Add consul_services_max_fails, consul_services_fail_timeout, fallback_peer and upsync_weight NodeMeta #200

eugenepaniot · 2018-03-13T17:59:20Z

This patch add following changes:

Backport changes from lyokha/nginx-upsync-module repository:
1.1 Add consul_services_max_fails and consul_services_fail_timeout directive to control max_fails and fail_timeout values for backends
1.2 Add fallback_peer directive. It allowes to delete all backends in upstream and set fallback_peer address marked as down as a once backend in upstream. Otherwise error "cannot delete all peers" will occure
Add functionality to change backend weight based on consul agent node NodeMeta (available in Consul 0.7.3). Tag named as "upsync_weight". It could be usefull when you have different hardware configurtions in your environment. And could be more flexible than "weight" ServiceTag.
For example: CPUs from Intel and AMD has different speed, and we should be able to balance requests based on CPU speed (based on node weight). More powerfull nodes will have higher weight and should be able to serve more requests
Seed srandom PRNG with PID, seconds, and milliseconds
Add current_weight, conns, max_conns and fails in show_upstream. It more usefull for backend status monitoring

1. Backport changes from lyokha/nginx-upsync-module repository: 1.1 Add consul_services_max_fails and consul_services_fail_timeout directive to control max_fails and fail_timeout values for backends 1.2 Add fallback_peer directive. It allowes to delete all backends in upstream and set fallback_peer address marked as down as a once backend in upstream. Otherwise error "cannot delete all peers" will occure 2. Add functionality to change backend weight based on consul agent node NodeMeta (available in Consul 0.7.3). Tag named as "upsync_weight". It could be usefull when you have different hardware configurtions in your environment. And could be more flexible than "weight" ServiceTag. For example: CPUs from Intel and AMD has different speed, and we should be able to balance requests based on CPU speed (based on node weight). More powerfull nodes will have higher weight and should be able to serve more requests 3. Seed srandom PRNG with PID, seconds, and milliseconds 4. Add current_weight, conns, max_conns and fails in show_upstream. It more usefull for backend status monitoring

xiaokai-wang · 2018-03-14T12:39:03Z

hi eugenepaniot,

Thanks your contributing, it's great. 

1.
  1.1 The directives is all about the conf-server(consul,etcd), it should be not config the backend
       servers, because the conf-server's function is to store the informations of all about backends. 
       I realize your code, the two directives are only as a default value?
  1.2 What's the purpose of fallback_peer? You don't want requests to the old servers? If it's that
        you can just mark it down?
2.
    I realize your usage scenario, but I don't recommend to add "upsync_weight" tag, I think it will 
    confused users. You can just tune the "weight" of servers based hardware environment.

The above is my opinion, it's welcome to be discussed.

About 3 and 4 I suggest a new commit.

Thanks again.

eugenepaniot · 2018-03-14T14:48:41Z

1.1 The directives is all about the conf-server(consul,etcd), it should be not config the backend
servers, because the conf-server's function is to store the informations of all about backends.
I realize your code, the two directives are only as a default value?

Yes, these two directives upsync_consul_services_max_fails and upsync_consul_services_fail_timeout controlling default values then service haven't tagged.
On the other hands these directives could be used to control backend max_fails, fail_timeout values as a setting for IP on LoadBalancer (in these case max_fails, fail_timeout values could be a property of VIP, not a property of application). For example: Nginx as LoadBalancer service (see our arch diagram).

I realize your usage scenario, but I don't recommend to add "upsync_weight" tag, I think it will 
confused users. You can just tune the "weight" of servers based hardware environment.

Ok, agree. I'll remove it.

eugenepaniot · 2018-03-14T15:03:33Z

1.2 What's the purpose of fallback_peer? You don't want requests to the old servers? If it's that
you can just mark it down?

When our application has 0 instances then upstream should be zeroed too. In our environment we are using mesos/marathon for containers orchestration. When container started on compute node, mesos assign random port for iptables NAT rules, in a worst scenario (we are facing on it few times because we have a huge conainers cluster) nginx proxying requests to wrong destination because upsteam haven't cleared.
For example:

Starting two container-nginx instances on container port 80:
1.1 Mesos assing ports 32001 and 32002 and create NAT rules:
1. compute node1 IP port 32001 DNAT to container1 IP port 80
2. compute node1 IP port 32001 DNAT to container2 IP port 80
When suspend nginx application. Without this patch LoadBalancer nginx will store information about latest container-nginx instance (for example: compute node1 IP port 32001)
Try to create another mesos/marathon application. For example Java
Meso/Marathon will assign the same port (32001) as for container-nginx instance. And as a result traffic for container-nginx instances proxying to Java instance.

eugenepaniot

Typo in "if (tags == NULL)'

xiaokai-wang · 2018-03-15T11:57:12Z

Yes, these two directives upsync_consul_services_max_fails and upsync_consul_services_fail_timeout controlling default values then service haven't tagged.
On the other hands these directives could be used to control backend max_fails, fail_timeout values as a setting for IP on LoadBalancer (in these case max_fails, fail_timeout values could be a property of VIP, not a property of application). For example: Nginx as LoadBalancer service (see our arch diagram).

Sorry eugenepaniot, you don't persuade me. The attributes of server can get from consul. If you haven't tagged, it seems you don't care so you haven't tagged.
Your purpose can be satisfied by setting max_fails and fail_timeout to the consul, right? I hope I understand your scenario.

xiaokai-wang · 2018-03-15T12:10:12Z

When our application has 0 instances then upstream should be zeroed too.

The module is not support to delete all servers of one upstream, which is to avoid causing service useless.
The "fallback_peer" of your solution is to replace the deleted servers(all real servers), right?
That can be realized using "fallback_peer" to replace the all real servers on consul, which will sync to nginx too. This can satisfy you?

eugenepaniot · 2018-03-22T11:56:42Z

The module is not support to delete all servers of one upstream, which is to avoid causing service useless.

But what if this is an expected behaviour? In my case I need to clear all backend list cause I suspend my application.

The "fallback_peer" of your solution is to replace the deleted servers(all real servers), right?

Yes, correct.

That can be realized using "fallback_peer" to replace the all real servers on consul, which will sync to nginx too. This can satisfy you?

You mean "fallback_peer" as a tag? But I won't have a service registration (and tags too) because I dont have a running service instances, or from KV ?

gfrankliu · 2018-03-23T08:30:26Z

We also have such cases where all upstream servers unregistered from consul. That's why we put a dummy "server" in the "upstream", eg:

upstream upstream_pool {
  server 127.0.0.1:11111 down;
  upsync ..
...

This is to make nginx happy, since nginx requires at least one server in the pool.

Why can't you use that for your fallback peer? It will get replaced once consul returns live servers.

2. Add max_stale GET parameter to service discovery request 3. Add wait GET parameter to service discovery request

cai-personal · 2019-10-23T03:59:30Z

为什么要提这个？

扩容：curl -X PUT -d '{"id":"111","name":"testnginx","address":"127.0.0.1","port":8001, "tags":["weight=4", "fail_timeout=13s", "max_fails=19"]}' http://127.0.0.1:8500/v1/agent/service/register

缩容：curl -X PUT -d '{"id":"111","name":"testnginx","address":"127.0.0.1","port":8001, "tags":["weight=4", "fail_timeout=13s", "max_fails=19", "down"]}' http://127.0.0.1:8500/v1/agent/service/register 或者把服务下掉

eugenepaniot added 2 commits March 14, 2018 18:03

Remove upsync_weight.

4a917c2

Fix type in "if tags == NULL"

de22b0f

eugenepaniot commented Mar 14, 2018

View reviewed changes

eugenepaniot added 2 commits May 24, 2018 13:19

Update cJSON to lattest - 1.7.7 version

8ba451d

1. Update http parser to lattest version

32e6438

2. Add max_stale GET parameter to service discovery request 3. Add wait GET parameter to service discovery request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add consul_services_max_fails, consul_services_fail_timeout, fallback_peer and upsync_weight NodeMeta #200

Add consul_services_max_fails, consul_services_fail_timeout, fallback_peer and upsync_weight NodeMeta #200

eugenepaniot commented Mar 13, 2018

xiaokai-wang commented Mar 14, 2018

eugenepaniot commented Mar 14, 2018

eugenepaniot commented Mar 14, 2018

eugenepaniot left a comment

xiaokai-wang commented Mar 15, 2018

xiaokai-wang commented Mar 15, 2018

eugenepaniot commented Mar 22, 2018

gfrankliu commented Mar 23, 2018

cai-personal commented Oct 23, 2019

Add consul_services_max_fails, consul_services_fail_timeout, fallback_peer and upsync_weight NodeMeta #200

Are you sure you want to change the base?

Add consul_services_max_fails, consul_services_fail_timeout, fallback_peer and upsync_weight NodeMeta #200

Conversation

eugenepaniot commented Mar 13, 2018

xiaokai-wang commented Mar 14, 2018

eugenepaniot commented Mar 14, 2018

eugenepaniot commented Mar 14, 2018

eugenepaniot left a comment

Choose a reason for hiding this comment

xiaokai-wang commented Mar 15, 2018

xiaokai-wang commented Mar 15, 2018

eugenepaniot commented Mar 22, 2018

gfrankliu commented Mar 23, 2018

cai-personal commented Oct 23, 2019