Message ID | 20211025115910.2595-1-xingwu.yang@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | ipvs: Fix reuse connection if RS weight is 0 | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Single patches do not need cover letters |
netdev/fixes_present | success | Fixes tag not required for -next series |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Guessed tree name to be net-next |
netdev/subject_prefix | warning | Target tree name not specified in the subject |
netdev/cc_maintainers | success | CCed 13 of 13 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Signed-off-by tag matches author and committer |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 0 this patch: 0 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | No Fixes tag |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 26 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 0 this patch: 0 |
netdev/header_inline | success | No static functions without inline keyword in header files |
Hello, On Mon, 25 Oct 2021, yangxingwu wrote: > Since commit dc7b3eb900aa ("ipvs: Fix reuse connection if real server is > dead"), new connections to dead servers are redistributed immediately to > new servers. > > Then commit d752c3645717 ("ipvs: allow rescheduling of new connections when > port reuse is detected") disable expire_nodest_conn if conn_reuse_mode is > 0. And new connection may be distributed to a real server with weight 0. Your change does not look correct to me. At the time expire_nodest_conn was created, it was not checked when weight is 0. At different places different terms are used but in short, we have two independent states for real server: - inhibited: weight=0 and no new connections should be served, packets for existing connections can be routed to server if it is still available and packets are not dropped by expire_nodest_conn. The new feature is that port reuse detection can redirect the new TCP connection into a new IPVS conn and to expire the existing cp/ct. - unavailable (!IP_VS_DEST_F_AVAILABLE): server is removed, can be temporary, drop traffic for existing connections but on expire_nodest_conn we can select different server The new conn_reuse_mode flag allows port reuse to be detected. Only then expire_nodest_conn has the opportunity with commit dc7b3eb900aa to check weight=0 and to consider the old traffic as finished. If a new server is selected, any retrans from previous connection would be considered as part from the new connection. It is a rapid way to switch server without checking with is_new_conn_expected() because we can not have many conns/conntracks to different servers. > Signed-off-by: yangxingwu <xingwu.yang@gmail.com> > --- > Documentation/networking/ipvs-sysctl.rst | 3 +-- > net/netfilter/ipvs/ip_vs_core.c | 5 +++-- > 2 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst > index 2afccc63856e..1cfbf1add2fc 100644 > --- a/Documentation/networking/ipvs-sysctl.rst > +++ b/Documentation/networking/ipvs-sysctl.rst > @@ -37,8 +37,7 @@ conn_reuse_mode - INTEGER > > 0: disable any special handling on port reuse. The new > connection will be delivered to the same real server that was > - servicing the previous connection. This will effectively > - disable expire_nodest_conn. > + servicing the previous connection. > > bit 1: enable rescheduling of new connections when it is safe. > That is, whenever expire_nodest_conn and for TCP sockets, when > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index 128690c512df..9279aed69e23 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -2042,14 +2042,15 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int > ipvs, af, skb, &iph); > > conn_reuse_mode = sysctl_conn_reuse_mode(ipvs); > - if (conn_reuse_mode && !iph.fragoffs && is_new_conn(skb, &iph) && cp) { > + if (!iph.fragoffs && is_new_conn(skb, &iph) && cp) { > bool old_ct = false, resched = false; > > if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest && > unlikely(!atomic_read(&cp->dest->weight))) { > resched = true; > old_ct = ip_vs_conn_uses_old_conntrack(cp, skb); > - } else if (is_new_conn_expected(cp, conn_reuse_mode)) { > + } else if (conn_reuse_mode && > + is_new_conn_expected(cp, conn_reuse_mode)) { > old_ct = ip_vs_conn_uses_old_conntrack(cp, skb); > if (!atomic_read(&cp->n_control)) { > resched = true; > -- > 2.30.2 Regards -- Julian Anastasov <ja@ssi.bg>
thanks julian What happens in this situation is that if we set the wait of the realserver to 0 and do NOT remove the weight zero realserver with sysctl settings (conn_reuse_mode == 0 && expire_nodest_conn == 1), and the client reuses its source ports, the kernel will constantly reuse connections and send the traffic to the weight 0 realserver. you may check the details from https://github.com/kubernetes/kubernetes/issues/81775 On Tue, Oct 26, 2021 at 2:12 AM Julian Anastasov <ja@ssi.bg> wrote: > > > Hello, > > On Mon, 25 Oct 2021, yangxingwu wrote: > > > Since commit dc7b3eb900aa ("ipvs: Fix reuse connection if real server is > > dead"), new connections to dead servers are redistributed immediately to > > new servers. > > > > Then commit d752c3645717 ("ipvs: allow rescheduling of new connections when > > port reuse is detected") disable expire_nodest_conn if conn_reuse_mode is > > 0. And new connection may be distributed to a real server with weight 0. > > Your change does not look correct to me. At the time > expire_nodest_conn was created, it was not checked when > weight is 0. At different places different terms are used > but in short, we have two independent states for real server: > > - inhibited: weight=0 and no new connections should be served, > packets for existing connections can be routed to server > if it is still available and packets are not dropped > by expire_nodest_conn. > The new feature is that port reuse detection can > redirect the new TCP connection into a new IPVS conn and > to expire the existing cp/ct. > > - unavailable (!IP_VS_DEST_F_AVAILABLE): server is removed, > can be temporary, drop traffic for existing connections > but on expire_nodest_conn we can select different server > > The new conn_reuse_mode flag allows port reuse to > be detected. Only then expire_nodest_conn has the > opportunity with commit dc7b3eb900aa to check weight=0 > and to consider the old traffic as finished. If a new > server is selected, any retrans from previous connection > would be considered as part from the new connection. It > is a rapid way to switch server without checking with > is_new_conn_expected() because we can not have many > conns/conntracks to different servers. > > > Signed-off-by: yangxingwu <xingwu.yang@gmail.com> > > --- > > Documentation/networking/ipvs-sysctl.rst | 3 +-- > > net/netfilter/ipvs/ip_vs_core.c | 5 +++-- > > 2 files changed, 4 insertions(+), 4 deletions(-) > > > > diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst > > index 2afccc63856e..1cfbf1add2fc 100644 > > --- a/Documentation/networking/ipvs-sysctl.rst > > +++ b/Documentation/networking/ipvs-sysctl.rst > > @@ -37,8 +37,7 @@ conn_reuse_mode - INTEGER > > > > 0: disable any special handling on port reuse. The new > > connection will be delivered to the same real server that was > > - servicing the previous connection. This will effectively > > - disable expire_nodest_conn. > > + servicing the previous connection. > > > > bit 1: enable rescheduling of new connections when it is safe. > > That is, whenever expire_nodest_conn and for TCP sockets, when > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > > index 128690c512df..9279aed69e23 100644 > > --- a/net/netfilter/ipvs/ip_vs_core.c > > +++ b/net/netfilter/ipvs/ip_vs_core.c > > @@ -2042,14 +2042,15 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int > > ipvs, af, skb, &iph); > > > > conn_reuse_mode = sysctl_conn_reuse_mode(ipvs); > > - if (conn_reuse_mode && !iph.fragoffs && is_new_conn(skb, &iph) && cp) { > > + if (!iph.fragoffs && is_new_conn(skb, &iph) && cp) { > > bool old_ct = false, resched = false; > > > > if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest && > > unlikely(!atomic_read(&cp->dest->weight))) { > > resched = true; > > old_ct = ip_vs_conn_uses_old_conntrack(cp, skb); > > - } else if (is_new_conn_expected(cp, conn_reuse_mode)) { > > + } else if (conn_reuse_mode && > > + is_new_conn_expected(cp, conn_reuse_mode)) { > > old_ct = ip_vs_conn_uses_old_conntrack(cp, skb); > > if (!atomic_read(&cp->n_control)) { > > resched = true; > > -- > > 2.30.2 > > Regards > > -- > Julian Anastasov <ja@ssi.bg>
Hello, On Tue, 26 Oct 2021, yangxingwu wrote: > thanks julian > > What happens in this situation is that if we set the wait of the > realserver to 0 and do NOT remove the weight zero realserver with > sysctl settings (conn_reuse_mode == 0 && expire_nodest_conn == 1), and > the client reuses its source ports, the kernel will constantly > reuse connections and send the traffic to the weight 0 realserver. Yes, this is expected when conn_reuse_mode=0. > you may check the details from > https://github.com/kubernetes/kubernetes/issues/81775 What happens if you try conn_reuse_mode=1? The one-second delay in previous kernels should be corrected with commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f Date: Wed Jul 1 18:17:19 2020 +0300 ipvs: allow connection reuse for unconfirmed conntrack > On Tue, Oct 26, 2021 at 2:12 AM Julian Anastasov <ja@ssi.bg> wrote: > > > > On Mon, 25 Oct 2021, yangxingwu wrote: > > > > > Since commit dc7b3eb900aa ("ipvs: Fix reuse connection if real server is > > > dead"), new connections to dead servers are redistributed immediately to > > > new servers. > > > > > > Then commit d752c3645717 ("ipvs: allow rescheduling of new connections when > > > port reuse is detected") disable expire_nodest_conn if conn_reuse_mode is > > > 0. And new connection may be distributed to a real server with weight 0. > > > > Your change does not look correct to me. At the time > > expire_nodest_conn was created, it was not checked when > > weight is 0. At different places different terms are used > > but in short, we have two independent states for real server: > > > > - inhibited: weight=0 and no new connections should be served, > > packets for existing connections can be routed to server > > if it is still available and packets are not dropped > > by expire_nodest_conn. > > The new feature is that port reuse detection can > > redirect the new TCP connection into a new IPVS conn and > > to expire the existing cp/ct. > > > > - unavailable (!IP_VS_DEST_F_AVAILABLE): server is removed, > > can be temporary, drop traffic for existing connections > > but on expire_nodest_conn we can select different server > > > > The new conn_reuse_mode flag allows port reuse to > > be detected. Only then expire_nodest_conn has the > > opportunity with commit dc7b3eb900aa to check weight=0 > > and to consider the old traffic as finished. If a new > > server is selected, any retrans from previous connection > > would be considered as part from the new connection. It > > is a rapid way to switch server without checking with > > is_new_conn_expected() because we can not have many > > conns/conntracks to different servers. Regards -- Julian Anastasov <ja@ssi.bg>
thanks Julian yes, I know that the one-second delay issue has been fixed by commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f if we set conn_reuse_mode to 1 BUT it's still NOT what we expected with sysctl settings (conn_reuse_mode == 0 && expire_nodest_conn == 1). We run kubernetes in extremely diverse environments and this issue happens a lot. On Tue, Oct 26, 2021 at 1:44 PM Julian Anastasov <ja@ssi.bg> wrote: > > > Hello, > > On Tue, 26 Oct 2021, yangxingwu wrote: > > > thanks julian > > > > What happens in this situation is that if we set the wait of the > > realserver to 0 and do NOT remove the weight zero realserver with > > sysctl settings (conn_reuse_mode == 0 && expire_nodest_conn == 1), and > > the client reuses its source ports, the kernel will constantly > > reuse connections and send the traffic to the weight 0 realserver. > > Yes, this is expected when conn_reuse_mode=0. > > > you may check the details from > > https://github.com/kubernetes/kubernetes/issues/81775 > > What happens if you try conn_reuse_mode=1? The > one-second delay in previous kernels should be corrected with > > commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f > Date: Wed Jul 1 18:17:19 2020 +0300 > > ipvs: allow connection reuse for unconfirmed conntrack > > > On Tue, Oct 26, 2021 at 2:12 AM Julian Anastasov <ja@ssi.bg> wrote: > > > > > > On Mon, 25 Oct 2021, yangxingwu wrote: > > > > > > > Since commit dc7b3eb900aa ("ipvs: Fix reuse connection if real server is > > > > dead"), new connections to dead servers are redistributed immediately to > > > > new servers. > > > > > > > > Then commit d752c3645717 ("ipvs: allow rescheduling of new connections when > > > > port reuse is detected") disable expire_nodest_conn if conn_reuse_mode is > > > > 0. And new connection may be distributed to a real server with weight 0. > > > > > > Your change does not look correct to me. At the time > > > expire_nodest_conn was created, it was not checked when > > > weight is 0. At different places different terms are used > > > but in short, we have two independent states for real server: > > > > > > - inhibited: weight=0 and no new connections should be served, > > > packets for existing connections can be routed to server > > > if it is still available and packets are not dropped > > > by expire_nodest_conn. > > > The new feature is that port reuse detection can > > > redirect the new TCP connection into a new IPVS conn and > > > to expire the existing cp/ct. > > > > > > - unavailable (!IP_VS_DEST_F_AVAILABLE): server is removed, > > > can be temporary, drop traffic for existing connections > > > but on expire_nodest_conn we can select different server > > > > > > The new conn_reuse_mode flag allows port reuse to > > > be detected. Only then expire_nodest_conn has the > > > opportunity with commit dc7b3eb900aa to check weight=0 > > > and to consider the old traffic as finished. If a new > > > server is selected, any retrans from previous connection > > > would be considered as part from the new connection. It > > > is a rapid way to switch server without checking with > > > is_new_conn_expected() because we can not have many > > > conns/conntracks to different servers. > > Regards > > -- > Julian Anastasov <ja@ssi.bg>
Julian what we want is if RS weight is 0, then no new connections should be served even if conn_reuse_mode is 0, just as commit dc7b3eb900aa ("ipvs: Fix reuse connection if real server is dead") trying to do Pls let me know if there are any other issues of concern On Tue, Oct 26, 2021 at 2:13 PM yangxingwu <xingwu.yang@gmail.com> wrote: > > thanks Julian > > yes, I know that the one-second delay issue has been fixed by commit > f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f if we set conn_reuse_mode to > 1 > > BUT it's still NOT what we expected with sysctl settings > (conn_reuse_mode == 0 && expire_nodest_conn == 1). > > We run kubernetes in extremely diverse environments and this issue > happens a lot. > > On Tue, Oct 26, 2021 at 1:44 PM Julian Anastasov <ja@ssi.bg> wrote: > > > > > > Hello, > > > > On Tue, 26 Oct 2021, yangxingwu wrote: > > > > > thanks julian > > > > > > What happens in this situation is that if we set the wait of the > > > realserver to 0 and do NOT remove the weight zero realserver with > > > sysctl settings (conn_reuse_mode == 0 && expire_nodest_conn == 1), and > > > the client reuses its source ports, the kernel will constantly > > > reuse connections and send the traffic to the weight 0 realserver. > > > > Yes, this is expected when conn_reuse_mode=0. > > > > > you may check the details from > > > https://github.com/kubernetes/kubernetes/issues/81775 > > > > What happens if you try conn_reuse_mode=1? The > > one-second delay in previous kernels should be corrected with > > > > commit f0a5e4d7a594e0fe237d3dfafb069bb82f80f42f > > Date: Wed Jul 1 18:17:19 2020 +0300 > > > > ipvs: allow connection reuse for unconfirmed conntrack > > > > > On Tue, Oct 26, 2021 at 2:12 AM Julian Anastasov <ja@ssi.bg> wrote: > > > > > > > > On Mon, 25 Oct 2021, yangxingwu wrote: > > > > > > > > > Since commit dc7b3eb900aa ("ipvs: Fix reuse connection if real server is > > > > > dead"), new connections to dead servers are redistributed immediately to > > > > > new servers. > > > > > > > > > > Then commit d752c3645717 ("ipvs: allow rescheduling of new connections when > > > > > port reuse is detected") disable expire_nodest_conn if conn_reuse_mode is > > > > > 0. And new connection may be distributed to a real server with weight 0. > > > > > > > > Your change does not look correct to me. At the time > > > > expire_nodest_conn was created, it was not checked when > > > > weight is 0. At different places different terms are used > > > > but in short, we have two independent states for real server: > > > > > > > > - inhibited: weight=0 and no new connections should be served, > > > > packets for existing connections can be routed to server > > > > if it is still available and packets are not dropped > > > > by expire_nodest_conn. > > > > The new feature is that port reuse detection can > > > > redirect the new TCP connection into a new IPVS conn and > > > > to expire the existing cp/ct. > > > > > > > > - unavailable (!IP_VS_DEST_F_AVAILABLE): server is removed, > > > > can be temporary, drop traffic for existing connections > > > > but on expire_nodest_conn we can select different server > > > > > > > > The new conn_reuse_mode flag allows port reuse to > > > > be detected. Only then expire_nodest_conn has the > > > > opportunity with commit dc7b3eb900aa to check weight=0 > > > > and to consider the old traffic as finished. If a new > > > > server is selected, any retrans from previous connection > > > > would be considered as part from the new connection. It > > > > is a rapid way to switch server without checking with > > > > is_new_conn_expected() because we can not have many > > > > conns/conntracks to different servers. > > > > Regards > > > > -- > > Julian Anastasov <ja@ssi.bg>
Hello, On Wed, 27 Oct 2021, yangxingwu wrote: > what we want is if RS weight is 0, then no new connections should be > served even if conn_reuse_mode is 0, just as commit dc7b3eb900aa > ("ipvs: Fix reuse connection if real server is > dead") trying to do > > Pls let me know if there are any other issues of concern My concern is with the behaviour people expect from each sysctl var: conn_reuse_mode decides if port reuse is considered for rescheduling and expire_nodest_conn should have priority only for unavailable servers (nodest means No Destination), not in this case. We don't know how people use the conn_reuse_mode=0 mode, one may bind to a local port and try to send multiple connections in a row with the hope they will go to same real server, i.e. as part from same "session", even while weight=0. If they do not want such behaviour (99% of the cases), they will use the default conn_reuse_mode=1. OTOH, you have different expectations for mode 0, not sure why but you do not want to use the default mode=1 which is safer to use. May be the setups forget to stay with conn_reuse_mode=1 on kernels 5.9+ and set the var to 0 ? The problem with mentioned commit dc7b3eb900aa is that it breaks FTP and persistent connections while the goal of weight=0 is graceful inhibition of the server. We made the mistake to add priority for expire_nodest_conn when weight=0. This can be fixed with a !cp->control check. We do not want expire_nodest_conn to kill every connection during the graceful period. Regards -- Julian Anastasov <ja@ssi.bg>
hello On Thu, Oct 28, 2021 at 5:09 AM Julian Anastasov <ja@ssi.bg> wrote: > > > Hello, > > On Wed, 27 Oct 2021, yangxingwu wrote: > > > what we want is if RS weight is 0, then no new connections should be > > served even if conn_reuse_mode is 0, just as commit dc7b3eb900aa > > ("ipvs: Fix reuse connection if real server is > > dead") trying to do > > > > Pls let me know if there are any other issues of concern > > My concern is with the behaviour people expect > from each sysctl var: conn_reuse_mode decides if port reuse > is considered for rescheduling and expire_nodest_conn > should have priority only for unavailable servers (nodest means > No Destination), not in this case. > > We don't know how people use the conn_reuse_mode=0 > mode, one may bind to a local port and try to send multiple > connections in a row with the hope they will go to same real > server, i.e. as part from same "session", even while weight=0. > If they do not want such behaviour (99% of the cases), they > will use the default conn_reuse_mode=1. OTOH, you have different > expectations for mode 0, not sure why but you do not want to use > the default mode=1 which is safer to use. May be the setups > forget to stay with conn_reuse_mode=1 on kernels 5.9+ and > set the var to 0 ? The problem is we can NOT decide what the customers do, many of them run kubernetes with old versions of kube-proxy. And most importantly, upgrade to new version is a very long and painful process, that's why we want to fix this at the kernel level > The problem with mentioned commit dc7b3eb900aa is that > it breaks FTP and persistent connections while the goal of > weight=0 is graceful inhibition of the server. We made > the mistake to add priority for expire_nodest_conn when weight=0. > This can be fixed with a !cp->control check. We do not want > expire_nodest_conn to kill every connection during the > graceful period. ok, got it, I will try to fix this problem > Regards > > -- > Julian Anastasov <ja@ssi.bg>
diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst index 2afccc63856e..1cfbf1add2fc 100644 --- a/Documentation/networking/ipvs-sysctl.rst +++ b/Documentation/networking/ipvs-sysctl.rst @@ -37,8 +37,7 @@ conn_reuse_mode - INTEGER 0: disable any special handling on port reuse. The new connection will be delivered to the same real server that was - servicing the previous connection. This will effectively - disable expire_nodest_conn. + servicing the previous connection. bit 1: enable rescheduling of new connections when it is safe. That is, whenever expire_nodest_conn and for TCP sockets, when diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c index 128690c512df..9279aed69e23 100644 --- a/net/netfilter/ipvs/ip_vs_core.c +++ b/net/netfilter/ipvs/ip_vs_core.c @@ -2042,14 +2042,15 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, int ipvs, af, skb, &iph); conn_reuse_mode = sysctl_conn_reuse_mode(ipvs); - if (conn_reuse_mode && !iph.fragoffs && is_new_conn(skb, &iph) && cp) { + if (!iph.fragoffs && is_new_conn(skb, &iph) && cp) { bool old_ct = false, resched = false; if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest && unlikely(!atomic_read(&cp->dest->weight))) { resched = true; old_ct = ip_vs_conn_uses_old_conntrack(cp, skb); - } else if (is_new_conn_expected(cp, conn_reuse_mode)) { + } else if (conn_reuse_mode && + is_new_conn_expected(cp, conn_reuse_mode)) { old_ct = ip_vs_conn_uses_old_conntrack(cp, skb); if (!atomic_read(&cp->n_control)) { resched = true;
Since commit dc7b3eb900aa ("ipvs: Fix reuse connection if real server is dead"), new connections to dead servers are redistributed immediately to new servers. Then commit d752c3645717 ("ipvs: allow rescheduling of new connections when port reuse is detected") disable expire_nodest_conn if conn_reuse_mode is 0. And new connection may be distributed to a real server with weight 0. Signed-off-by: yangxingwu <xingwu.yang@gmail.com> --- Documentation/networking/ipvs-sysctl.rst | 3 +-- net/netfilter/ipvs/ip_vs_core.c | 5 +++-- 2 files changed, 4 insertions(+), 4 deletions(-)