diff mbox series

[Question] : TCP resets when using ECMP for load-balancing between multiple servers.

Message ID 20230815201048.1796-1-sriram.yagnaraman@est.tech (mailing list archive)
State Not Applicable
Delegated to: Netdev Maintainers
Headers show
Series [Question] : TCP resets when using ECMP for load-balancing between multiple servers. | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 3125 this patch: 3125
netdev/cc_maintainers warning 10 maintainers not CCed: kuba@kernel.org idosch@nvidia.com danieller@nvidia.com dsahern@kernel.org petrm@nvidia.com shuah@kernel.org davem@davemloft.net linux-kselftest@vger.kernel.org pabeni@redhat.com edumazet@google.com
netdev/build_clang success Errors and warnings before: 1573 this patch: 1573
netdev/verify_signedoff fail author Signed-off-by missing
netdev/deprecated_api success None detected
netdev/check_selftest success net selftest script(s) already in Makefile
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 3288 this patch: 3288
netdev/checkpatch warning WARNING: Consider removing the code enclosed by this #if 0 and its #endif WARNING: added, moved or deleted file(s), does MAINTAINERS need updating? WARNING: line length of 81 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Sriram Yagnaraman Aug. 15, 2023, 8:10 p.m. UTC
All packets in the same flow (L3/L4 depending on multipath hash policy)
should be directed to the same target, but after [0] we see stray
packets directed towards other targets. This, for instance, causes RST
to be sent on TCP connections. This happens on a static setup, with no
changes to the nexthops, so there is no hash space reassignment.

IIUC, route hints when the next hop is part of a multipath group causes
packets in the same receive batch to be sent to the same next hop
irrespective of which nexthop the multipath hash points to. I am no
expert in this area, so please let me know if there is a simple
explanation on how to fix this problem?

Below is a patch which has a selftest that describes the problem setup
and a hack to solve the problem in ipv4. For ipv6, I have just commented
out the part the returns the route hint, just for testing.

[0]: 02b24941619f ("ipv4: use dst hint for ipv4 list receive")

---
 include/uapi/linux/in_route.h                 |   1 +
 net/ipv4/ip_input.c                           |   9 +-
 net/ipv4/route.c                              |   7 +-
 net/ipv6/ip6_input.c                          |   4 +
 .../testing/selftests/net/forwarding/Makefile |   1 +
 .../net/forwarding/router_multipath_vip.sh    | 324 ++++++++++++++++++
 6 files changed, 341 insertions(+), 5 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/router_multipath_vip.sh

Comments

Stephen Hemminger Aug. 15, 2023, 9:53 p.m. UTC | #1
On Tue, 15 Aug 2023 22:10:48 +0200
Sriram Yagnaraman <sriram.yagnaraman@est.tech> wrote:

> diff --git a/include/uapi/linux/in_route.h b/include/uapi/linux/in_route.h
> index 0cc2c23b47f8..01ae06c7743b 100644
> --- a/include/uapi/linux/in_route.h
> +++ b/include/uapi/linux/in_route.h
> @@ -15,6 +15,7 @@
>  #define RTCF_REDIRECTED	0x00040000
>  #define RTCF_TPROXY	0x00080000 /* unused */
>  
> +#define RTCF_MULTIPATH	0x00200000
>  #define RTCF_FAST	0x00200000 /* unused */

This looks like a bad idea. Reusing bits that were unused
in the past and exposed through uapi will likely cause
an ABI breakage.
Sriram Yagnaraman Aug. 16, 2023, 7:52 a.m. UTC | #2
> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, 15 August 2023 23:53
> To: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
> Cc: netdev@vger.kernel.org
> Subject: Re: [Question]: TCP resets when using ECMP for load-balancing
> between multiple servers.
> 
> On Tue, 15 Aug 2023 22:10:48 +0200
> Sriram Yagnaraman <sriram.yagnaraman@est.tech> wrote:
> 
> > diff --git a/include/uapi/linux/in_route.h
> > b/include/uapi/linux/in_route.h index 0cc2c23b47f8..01ae06c7743b
> > 100644
> > --- a/include/uapi/linux/in_route.h
> > +++ b/include/uapi/linux/in_route.h
> > @@ -15,6 +15,7 @@
> >  #define RTCF_REDIRECTED	0x00040000
> >  #define RTCF_TPROXY	0x00080000 /* unused */
> >
> > +#define RTCF_MULTIPATH	0x00200000
> >  #define RTCF_FAST	0x00200000 /* unused */
> 
> This looks like a bad idea. Reusing bits that were unused in the past and
> exposed through uapi will likely cause an ABI breakage.

Thanks, I agree, the code I posted is just a hack to test the hypothesis if route hints are the root cause.

It would be nice to get some guidance/suggestion on
- Is this really a problem, and should be fixed?
- What is the best way to solve this for ipv4 and ipv6?

I can propose a proper fix after that.
Ido Schimmel Aug. 17, 2023, 4:41 p.m. UTC | #3
+ David, Paolo

On Tue, Aug 15, 2023 at 10:10:48PM +0200, Sriram Yagnaraman wrote:
> All packets in the same flow (L3/L4 depending on multipath hash policy)
> should be directed to the same target, but after [0] we see stray
> packets directed towards other targets. This, for instance, causes RST
> to be sent on TCP connections. This happens on a static setup, with no
> changes to the nexthops, so there is no hash space reassignment.

Which multipath hash policy are you using? I guess the issue is more
visible with L4 as ip_can_use_hint() at least makes sure the destination
IP is the same before using the hint.

> 
> IIUC, route hints when the next hop is part of a multipath group causes
> packets in the same receive batch to be sent to the same next hop
> irrespective of which nexthop the multipath hash points to. I am no
> expert in this area, so please let me know if there is a simple
> explanation on how to fix this problem?
> 
> Below is a patch which has a selftest that describes the problem setup
> and a hack to solve the problem in ipv4. For ipv6, I have just commented
> out the part the returns the route hint, just for testing.

Did you consider marking the skb instead of the route? Something like
[1]. Compile tested only.

Also, are you positive that your selftest fails before the patch and
passes after? It is using VRFs, which use FIB rules, which should in
turn disable the use of hints. If this is indeed the case, then try
using namespaces instead. There are various examples outside of the
forwarding directory.

[1]
diff --git a/include/linux/ipv6.h b/include/linux/ipv6.h
index 5883551b1ee8..9ecebb8c6ffa 100644
--- a/include/linux/ipv6.h
+++ b/include/linux/ipv6.h
@@ -147,6 +147,7 @@ struct inet6_skb_parm {
 #define IP6SKB_JUMBOGRAM      128
 #define IP6SKB_SEG6          256
 #define IP6SKB_FAKEJUMBO      512
+#define IP6SKB_MULTIPATH     1024
 };
 
 #if defined(CONFIG_NET_L3_MASTER_DEV)
diff --git a/include/net/ip.h b/include/net/ip.h
index 332521170d9b..bdce572fa422 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -57,6 +57,7 @@ struct inet_skb_parm {
 #define IPSKB_FRAG_PMTU                BIT(6)
 #define IPSKB_L3SLAVE          BIT(7)
 #define IPSKB_NOPOLICY         BIT(8)
+#define IPSKB_MULTIPATH                BIT(9)
 
        u16                     frag_max_size;
 };
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index fe9ead9ee863..5e9c8156656a 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -584,7 +584,8 @@ static void ip_sublist_rcv_finish(struct list_head *head)
 static struct sk_buff *ip_extract_route_hint(const struct net *net,
                                             struct sk_buff *skb, int rt_type)
 {
-       if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST)
+       if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST ||
+           IPCB(skb)->flags & IPSKB_MULTIPATH)
                return NULL;
 
        return skb;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index a4e153dd615b..6a3f57a3fa41 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2144,6 +2144,7 @@ static int ip_mkroute_input(struct sk_buff *skb,
                int h = fib_multipath_hash(res->fi->fib_net, NULL, skb, hkeys);
 
                fib_select_multipath(res, h);
+               IPCB(skb)->flags |= IPSKB_MULTIPATH;
        }
 #endif
 
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index d94041bb4287..b8378814532c 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -99,7 +99,8 @@ static bool ip6_can_use_hint(const struct sk_buff *skb,
 static struct sk_buff *ip6_extract_route_hint(const struct net *net,
                                              struct sk_buff *skb)
 {
-       if (fib6_routes_require_src(net) || fib6_has_custom_rules(net))
+       if (fib6_routes_require_src(net) || fib6_has_custom_rules(net) ||
+           IP6CB(skb)->flags & IP6SKB_MULTIPATH)
                return NULL;
 
        return skb;
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index db10c36f34bb..4cdc82931c91 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -424,6 +424,8 @@ void fib6_select_path(const struct net *net, struct fib6_result *res,
        if (match->nh && have_oif_match && res->nh)
                return;
 
+       IP6CB(skb)->flags |= IP6SKB_MULTIPATH;
+
        /* We might have already computed the hash for ICMPv6 errors. In such
         * case it will always be non-zero. Otherwise now is the time to do it.
         */

> 
> [0]: 02b24941619f ("ipv4: use dst hint for ipv4 list receive")
> 
> ---
>  include/uapi/linux/in_route.h                 |   1 +
>  net/ipv4/ip_input.c                           |   9 +-
>  net/ipv4/route.c                              |   7 +-
>  net/ipv6/ip6_input.c                          |   4 +
>  .../testing/selftests/net/forwarding/Makefile |   1 +
>  .../net/forwarding/router_multipath_vip.sh    | 324 ++++++++++++++++++
>  6 files changed, 341 insertions(+), 5 deletions(-)
>  create mode 100755 tools/testing/selftests/net/forwarding/router_multipath_vip.sh
> 
> diff --git a/include/uapi/linux/in_route.h b/include/uapi/linux/in_route.h
> index 0cc2c23b47f8..01ae06c7743b 100644
> --- a/include/uapi/linux/in_route.h
> +++ b/include/uapi/linux/in_route.h
> @@ -15,6 +15,7 @@
>  #define RTCF_REDIRECTED	0x00040000
>  #define RTCF_TPROXY	0x00080000 /* unused */
>  
> +#define RTCF_MULTIPATH	0x00200000
>  #define RTCF_FAST	0x00200000 /* unused */
>  #define RTCF_MASQ	0x00400000 /* unused */
>  #define RTCF_SNAT	0x00800000 /* unused */
> diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
> index fe9ead9ee863..e06a1a6a4357 100644
> --- a/net/ipv4/ip_input.c
> +++ b/net/ipv4/ip_input.c
> @@ -582,9 +582,11 @@ static void ip_sublist_rcv_finish(struct list_head *head)
>  }
>  
>  static struct sk_buff *ip_extract_route_hint(const struct net *net,
> -					     struct sk_buff *skb, int rt_type)
> +					     struct sk_buff *skb, int rt_type,
> +					     unsigned int rt_flags)
>  {
> -	if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST)
> +	if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST ||
> +	    !!(rt_flags & RTCF_MULTIPATH))
>  		return NULL;
>  
>  	return skb;
> @@ -615,7 +617,8 @@ static void ip_list_rcv_finish(struct net *net, struct sock *sk,
>  		dst = skb_dst(skb);
>  		if (curr_dst != dst) {
>  			hint = ip_extract_route_hint(net, skb,
> -					       ((struct rtable *)dst)->rt_type);
> +					       ((struct rtable *)dst)->rt_type,
> +					       ((struct rtable *)dst)->rt_flags);
>  
>  			/* dispatch old sublist */
>  			if (!list_empty(&sublist))
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 92fede388d52..232b507faf04 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1786,6 +1786,7 @@ static void ip_handle_martian_source(struct net_device *dev,
>  
>  /* called in rcu_read_lock() section */
>  static int __mkroute_input(struct sk_buff *skb,
> +			   unsigned int flags,
>  			   const struct fib_result *res,
>  			   struct in_device *in_dev,
>  			   __be32 daddr, __be32 saddr, u32 tos)
> @@ -1856,7 +1857,7 @@ static int __mkroute_input(struct sk_buff *skb,
>  		}
>  	}
>  
> -	rth = rt_dst_alloc(out_dev->dev, 0, res->type,
> +	rth = rt_dst_alloc(out_dev->dev, flags, res->type,
>  			   IN_DEV_ORCONF(out_dev, NOXFRM));
>  	if (!rth) {
>  		err = -ENOBUFS;
> @@ -2139,16 +2140,18 @@ static int ip_mkroute_input(struct sk_buff *skb,
>  			    __be32 daddr, __be32 saddr, u32 tos,
>  			    struct flow_keys *hkeys)
>  {
> +	unsigned int flags = 0;
>  #ifdef CONFIG_IP_ROUTE_MULTIPATH
>  	if (res->fi && fib_info_num_path(res->fi) > 1) {
>  		int h = fib_multipath_hash(res->fi->fib_net, NULL, skb, hkeys);
>  
>  		fib_select_multipath(res, h);
> +		flags |= RTCF_MULTIPATH;
>  	}
>  #endif
>  
>  	/* create a routing cache entry */
> -	return __mkroute_input(skb, res, in_dev, daddr, saddr, tos);
> +	return __mkroute_input(skb, flags, res, in_dev, daddr, saddr, tos);
>  }
>  
>  /* Implements all the saddr-related checks as ip_route_input_slow(),
> diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
> index d94041bb4287..1b7527a4a4bd 100644
> --- a/net/ipv6/ip6_input.c
> +++ b/net/ipv6/ip6_input.c
> @@ -99,10 +99,14 @@ static bool ip6_can_use_hint(const struct sk_buff *skb,
>  static struct sk_buff *ip6_extract_route_hint(const struct net *net,
>  					      struct sk_buff *skb)
>  {
> +#if 0
>  	if (fib6_routes_require_src(net) || fib6_has_custom_rules(net))
>  		return NULL;
>  
>  	return skb;
> +#else
> +	return NULL;
> +#endif
>  }
>  
>  static void ip6_list_rcv_finish(struct net *net, struct sock *sk,
> diff --git a/tools/testing/selftests/net/forwarding/Makefile b/tools/testing/selftests/net/forwarding/Makefile
> index 770efbe24f0d..bf4e5745fd5c 100644
> --- a/tools/testing/selftests/net/forwarding/Makefile
> +++ b/tools/testing/selftests/net/forwarding/Makefile
> @@ -70,6 +70,7 @@ TEST_PROGS = bridge_igmp.sh \
>  	router_mpath_nh.sh \
>  	router_multicast.sh \
>  	router_multipath.sh \
> +	router_multipath_vip.sh \
>  	router_nh.sh \
>  	router.sh \
>  	router_vid_1.sh \
> diff --git a/tools/testing/selftests/net/forwarding/router_multipath_vip.sh b/tools/testing/selftests/net/forwarding/router_multipath_vip.sh
> new file mode 100755
> index 000000000000..0415cf974388
> --- /dev/null
> +++ b/tools/testing/selftests/net/forwarding/router_multipath_vip.sh
> @@ -0,0 +1,324 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +# +--------------------+                     +----------------------+
> +# | H1                 |                     |                   H2 |
> +# |                    |                     |                      |
> +# |              $h1 + |                     | + $h2                |
> +# |     192.0.2.2/24 | |                     | | 198.51.100.2/24    |
> +# | 2001:db8:1::2/64 | |                     | | 2001:db8:2::2/64   |
> +# |                  | |                     | |                    |
> +# +------------------|-+                     +-|--------------------+
> +#                    |                         |
> +# +------------------|-------------------------|--------------------+
> +# | SW               |                         |                    |
> +# |                  |                         |                    |
> +# |             $rp1 +                         + $rp2               |
> +# |     192.0.2.1/24                             198.51.100.1/24    |
> +# | 2001:db8:1::1/64     + vip                   2001:db8:2::1/64   |
> +# |                        198.18.0.0/24                            |
> +# |                        2001:db8:18::/64    + $rp3               |
> +# |                                            | 203.0.113.1/24     |
> +# |                                            | 2001:db8:3::1/64   |
> +# |                                            |                    |
> +# |                                            |                    |
> +# +--------------------------------------------|--------------------+
> +#                                              |
> +#                                            +-|--------------------+
> +#                                            | |                 H3 |
> +#                                            | |                    |
> +#                                            | | 203.0.113.2/24     |
> +#                                            | | 2001:db8:3::2/64   |
> +#                                            | + $h3                |
> +#                                            |                      |
> +#                                            +----------------------+
> +
> +ALL_TESTS="ping_ipv4 ping_ipv6 multipath_test"
> +NUM_NETIFS=6
> +source lib.sh
> +
> +h1_create()
> +{
> +	vrf_create "vrf-h1"
> +	ip link set dev $h1 master vrf-h1
> +
> +	ip link set dev vrf-h1 up
> +	ip link set dev $h1 up
> +
> +	ip address add 192.0.2.2/24 dev $h1
> +	ip address add 2001:db8:1::2/64 dev $h1
> +
> +	ip route add default vrf vrf-h1 via 192.0.2.1
> +	ip route add default vrf vrf-h1 via 2001:db8:1::1
> +}
> +
> +h1_destroy()
> +{
> +	ip route del default vrf vrf-h1 via 2001:db8:1::1
> +	ip route del default vrf vrf-h1 via 192.0.2.1
> +
> +	ip address del 2001:db8:1::2/64 dev $h1
> +	ip address del 192.0.2.2/24 dev $h1
> +
> +	ip link set dev $h1 down
> +	vrf_destroy "vrf-h1"
> +}
> +
> +h2_create()
> +{
> +	vrf_create "vrf-h2"
> +	ip link set dev $h2 master vrf-h2
> +
> +	ip link set dev vrf-h2 up
> +	ip link set dev $h2 up
> +
> +	ip address add 198.51.100.2/24 dev $h2
> +	ip address add 2001:db8:2::2/64 dev $h2
> +
> +	ip address add 198.18.0.0/24 dev vrf-h2
> +	ip address add 2001:db8:18::/64 dev vrf-h2
> +
> +	ip route add 192.0.2.0/24 vrf vrf-h2 via 198.51.100.1
> +	ip route add 2001:db8:1::/64 vrf vrf-h2 nexthop via 2001:db8:2::1
> +}
> +
> +h2_destroy()
> +{
> +	ip route del 2001:db8:1::/64 vrf vrf-h2 nexthop via 2001:db8:2::1
> +	ip route del 192.0.2.0/24 vrf vrf-h2 via 198.51.100.1
> +
> +	ip address del 2001:db8:18::/64 dev vrf-h2
> +	ip address del 198.18.0.0/24 dev vrf-h2
> +
> +	ip address del 2001:db8:2::2/64 dev $h2
> +	ip address del 198.51.100.2/24 dev $h2
> +
> +	ip link set dev $h2 down
> +	vrf_destroy "vrf-h2"
> +}
> +
> +h3_create()
> +{
> +	vrf_create "vrf-h3"
> +	ip link set dev $h3 master vrf-h3
> +
> +	ip link set dev vrf-h3 up
> +	ip link set dev $h3 up
> +
> +	ip address add 203.0.113.2/24 dev $h3
> +	ip address add 2001:db8:3::2/64 dev $h3
> +
> +	ip address add 198.18.0.0/24 dev vrf-h3
> +	ip address add 2001:db8:18::/64 dev vrf-h3
> +
> +	ip route add 192.0.2.0/24 vrf vrf-h3 via 203.0.113.1
> +	ip route add 2001:db8:1::/64 vrf vrf-h3 nexthop via 2001:db8:3::1
> +}
> +
> +h3_destroy()
> +{
> +	ip route del 2001:db8:1::/64 vrf vrf-h3 nexthop via 2001:db8:3::1
> +	ip route del 192.0.2.0/24 vrf vrf-h3 via 203.0.113.1
> +
> +	ip address del 198.18.0.0/24 dev vrf-h3
> +	ip address del 2001:db8:18::/64 dev vrf-h3
> +
> +	ip address del 2001:db8:3::2/64 dev $h3
> +	ip address del 203.0.113.2/24 dev $h3
> +
> +	ip link set dev $h3 down
> +	vrf_destroy "vrf-h3"
> +}
> +
> +router1_create()
> +{
> +	vrf_create "vrf-r1"
> +	ip link set dev $rp1 master vrf-r1
> +	ip link set dev $rp2 master vrf-r1
> +	ip link set dev $rp3 master vrf-r1
> +
> +	ip link set dev vrf-r1 up
> +	ip link set dev $rp1 up
> +	ip link set dev $rp2 up
> +	ip link set dev $rp3 up
> +
> +	ip address add 192.0.2.1/24 dev $rp1
> +	ip address add 2001:db8:1::1/64 dev $rp1
> +
> +	ip address add 198.51.100.1/24 dev $rp2
> +	ip address add 2001:db8:2::1/64 dev $rp2
> +
> +	ip address add 203.0.113.1/24 dev $rp3
> +	ip address add 2001:db8:3::1/64 dev $rp3
> +
> +	ip route add 198.18.0.0/24 vrf vrf-r1 \
> +		nexthop via 198.51.100.2 \
> +		nexthop via 203.0.113.2
> +	ip route add 2001:db8:18::/64 vrf vrf-r1 \
> +		nexthop via 2001:db8:2::2 \
> +		nexthop via 2001:db8:3::2
> +}
> +
> +router1_destroy()
> +{
> +	ip route del 2001:db8:18::/64 vrf vrf-r1
> +	ip route del 198.18.0.0/24 vrf vrf-r1
> +
> +	ip address del 2001:db8:3::1/64 dev $rp3
> +	ip address del 203.0.113.1/24 dev $rp3
> +
> +	ip address del 2001:db8:2::1/64 dev $rp2
> +	ip address del 198.51.100.1/24 dev $rp2
> +
> +	ip address del 2001:db8:1::1/64 dev $rp1
> +	ip address del 192.0.2.1/24 dev $rp1
> +
> +	ip link set dev $rp3 down
> +	ip link set dev $rp2 down
> +	ip link set dev $rp1 down
> +
> +	vrf_destroy "vrf-r1"
> +}
> +
> +multipath4_test()
> +{
> +	local desc="$1"
> +	local weight_rp2=$2
> +	local weight_rp3=$3
> +	local t0_rp2 t0_rp3 t1_rp2 t1_rp3
> +	local packets_rp2 packets_rp3
> +
> +	# Transmit multiple flows from h1 to h2 and make sure they are
> +	# distributed between both multipath links (rp2 and rp3)
> +	# according to the configured weights.
> +	sysctl_set net.ipv4.fib_multipath_hash_policy 1
> +	ip route replace 198.18.0.0/24 vrf vrf-r1 \
> +		nexthop via 198.51.100.2 weight $weight_rp2 \
> +		nexthop via 203.0.113.2 weight $weight_rp3
> +
> +	t0_rp2=$(link_stats_tx_packets_get $rp2)
> +	t0_rp3=$(link_stats_tx_packets_get $rp3)
> +
> +	ip vrf exec vrf-h1 $MZ $h1 -q -p 64 -A 192.0.2.2 -B 198.18.0.0 \
> +		-d 1msec -t tcp "sp=1024,dp=0-32768"
> +
> +	t1_rp2=$(link_stats_tx_packets_get $rp2)
> +	t1_rp3=$(link_stats_tx_packets_get $rp3)
> +
> +	let "packets_rp2 = $t1_rp2 - $t0_rp2"
> +	let "packets_rp3 = $t1_rp3 - $t0_rp3"
> +	multipath_eval "$desc" $weight_rp2 $weight_rp3 $packets_rp2 $packets_rp3
> +
> +	ip route replace 198.18.0.0/24 vrf vrf-r1 \
> +		nexthop via 198.51.100.2 \
> +		nexthop via 203.0.113.2
> +
> +	sysctl_restore net.ipv4.fib_multipath_hash_policy
> +}
> +
> +multipath6_l4_test()
> +{
> +	local desc="$1"
> +	local weight_rp2=$2
> +	local weight_rp3=$3
> +	local t0_rp2 t0_rp3 t1_rp2 t1_rp3
> +	local packets_rp2 packets_rp3
> +
> +	# Transmit multiple flows from h1 to h2 and make sure they are
> +	# distributed between both multipath links (rp2 and rp3)
> +	# according to the configured weights.
> +	sysctl_set net.ipv6.fib_multipath_hash_policy 1
> +	ip route replace 2001:db8:18::/64 vrf vrf-r1 \
> +		nexthop via 2001:db8:2::2 weight $weight_rp2 \
> +		nexthop via 2001:db8:3::2 weight $weight_rp3
> +
> +	t0_rp2=$(link_stats_tx_packets_get $rp2)
> +	t0_rp3=$(link_stats_tx_packets_get $rp3)
> +
> +	ip vrf exec vrf-h1 $MZ $h1 -6 -q -p 64 -A 2001:db8:1::2 -B 2001:db8:18::0 \
> +		-d 1msec -t tcp "sp=1024,dp=0-32768"
> +
> +	t1_rp2=$(link_stats_tx_packets_get $rp2)
> +	t1_rp3=$(link_stats_tx_packets_get $rp3)
> +
> +	let "packets_rp2 = $t1_rp2 - $t0_rp2"
> +	let "packets_rp3 = $t1_rp3 - $t0_rp3"
> +	multipath_eval "$desc" $weight_rp2 $weight_rp3 $packets_rp2 $packets_rp3
> +
> +	ip route replace 2001:db8:18::/64 vrf vrf-r1 \
> +		nexthop via 2001:db8:2::2 \
> +		nexthop via 2001:db8:3::2
> +
> +	sysctl_restore net.ipv6.fib_multipath_hash_policy
> +}
> +
> +multipath_test()
> +{
> +	log_info "Running IPv4 multipath tests"
> +	multipath4_test "ECMP" 1 1
> +	multipath4_test "Weighted MP 2:1" 2 1
> +	multipath4_test "Weighted MP 11:45" 11 45
> +
> +	log_info "Running IPv6 L4 hash multipath tests"
> +	multipath6_l4_test "ECMP" 1 1
> +	multipath6_l4_test "Weighted MP 2:1" 2 1
> +	multipath6_l4_test "Weighted MP 11:45" 11 45
> +}
> +
> +setup_prepare()
> +{
> +	h1=${NETIFS[p1]}
> +	rp1=${NETIFS[p2]}
> +
> +	rp2=${NETIFS[p3]}
> +	h2=${NETIFS[p4]}
> +
> +	rp3=${NETIFS[p5]}
> +	h3=${NETIFS[p6]}
> +
> +	vrf_prepare
> +
> +	h1_create
> +	h2_create
> +	h3_create
> +
> +	router1_create
> +
> +	forwarding_enable
> +}
> +
> +cleanup()
> +{
> +	pre_cleanup
> +
> +	forwarding_restore
> +
> +	router1_destroy
> +
> +	h3_destroy
> +	h2_destroy
> +	h1_destroy
> +
> +	vrf_cleanup
> +}
> +
> +ping_ipv4()
> +{
> +	ping_test $h1 198.51.100.2
> +	ping_test $h1 203.0.113.2
> +}
> +
> +ping_ipv6()
> +{
> +	ping6_test $h1 2001:db8:2::2
> +	ping6_test $h1 2001:db8:3::2
> +}
> +
> +trap cleanup EXIT
> +
> +setup_prepare
> +setup_wait
> +
> +tests_run
> +
> +exit $EXIT_STATUS
> -- 
> 2.34.1
> 
>
Sriram Yagnaraman Aug. 18, 2023, 1:17 p.m. UTC | #4
Hi Ido,

> -----Original Message-----
> From: Ido Schimmel <idosch@idosch.org>
> Sent: Thursday, 17 August 2023 18:41
> To: Sriram Yagnaraman <sriram.yagnaraman@est.tech>; dsahern@gmail.com;
> pabeni@redhat.com
> Cc: netdev@vger.kernel.org
> Subject: Re: [Question]: TCP resets when using ECMP for load-balancing
> between multiple servers.
> 
> + David, Paolo
> 
> On Tue, Aug 15, 2023 at 10:10:48PM +0200, Sriram Yagnaraman wrote:
> > All packets in the same flow (L3/L4 depending on multipath hash
> > policy) should be directed to the same target, but after [0] we see
> > stray packets directed towards other targets. This, for instance,
> > causes RST to be sent on TCP connections. This happens on a static
> > setup, with no changes to the nexthops, so there is no hash space
> reassignment.
> 
> Which multipath hash policy are you using? I guess the issue is more visible
> with L4 as ip_can_use_hint() at least makes sure the destination IP is the same
> before using the hint.

Yes, I am using L4 multipath hash policy. But I think the problem will be seen even on L3, if the source addresses are different.
 
> 
> >
> > IIUC, route hints when the next hop is part of a multipath group
> > causes packets in the same receive batch to be sent to the same next
> > hop irrespective of which nexthop the multipath hash points to. I am
> > no expert in this area, so please let me know if there is a simple
> > explanation on how to fix this problem?
> >
> > Below is a patch which has a selftest that describes the problem setup
> > and a hack to solve the problem in ipv4. For ipv6, I have just
> > commented out the part the returns the route hint, just for testing.
> 
> Did you consider marking the skb instead of the route? Something like [1].
> Compile tested only.
> 

Thank you so much for the idea/code. 
No, I didn't think of that. I will try your patch and get back with the results. 

> Also, are you positive that your selftest fails before the patch and passes after?
> It is using VRFs, which use FIB rules, which should in turn disable the use of
> hints. If this is indeed the case, then try using namespaces instead. There are
> various examples outside of the forwarding directory.

Ah, my mistake. I wrote the selftest to explain the problem and didn't test it thoroughly, sorry about that.
I will write a test with namespaces and send a proper patchset for review. Thanks once again.
diff mbox series

Patch

diff --git a/include/uapi/linux/in_route.h b/include/uapi/linux/in_route.h
index 0cc2c23b47f8..01ae06c7743b 100644
--- a/include/uapi/linux/in_route.h
+++ b/include/uapi/linux/in_route.h
@@ -15,6 +15,7 @@ 
 #define RTCF_REDIRECTED	0x00040000
 #define RTCF_TPROXY	0x00080000 /* unused */
 
+#define RTCF_MULTIPATH	0x00200000
 #define RTCF_FAST	0x00200000 /* unused */
 #define RTCF_MASQ	0x00400000 /* unused */
 #define RTCF_SNAT	0x00800000 /* unused */
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index fe9ead9ee863..e06a1a6a4357 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -582,9 +582,11 @@  static void ip_sublist_rcv_finish(struct list_head *head)
 }
 
 static struct sk_buff *ip_extract_route_hint(const struct net *net,
-					     struct sk_buff *skb, int rt_type)
+					     struct sk_buff *skb, int rt_type,
+					     unsigned int rt_flags)
 {
-	if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST)
+	if (fib4_has_custom_rules(net) || rt_type == RTN_BROADCAST ||
+	    !!(rt_flags & RTCF_MULTIPATH))
 		return NULL;
 
 	return skb;
@@ -615,7 +617,8 @@  static void ip_list_rcv_finish(struct net *net, struct sock *sk,
 		dst = skb_dst(skb);
 		if (curr_dst != dst) {
 			hint = ip_extract_route_hint(net, skb,
-					       ((struct rtable *)dst)->rt_type);
+					       ((struct rtable *)dst)->rt_type,
+					       ((struct rtable *)dst)->rt_flags);
 
 			/* dispatch old sublist */
 			if (!list_empty(&sublist))
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 92fede388d52..232b507faf04 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1786,6 +1786,7 @@  static void ip_handle_martian_source(struct net_device *dev,
 
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
+			   unsigned int flags,
 			   const struct fib_result *res,
 			   struct in_device *in_dev,
 			   __be32 daddr, __be32 saddr, u32 tos)
@@ -1856,7 +1857,7 @@  static int __mkroute_input(struct sk_buff *skb,
 		}
 	}
 
-	rth = rt_dst_alloc(out_dev->dev, 0, res->type,
+	rth = rt_dst_alloc(out_dev->dev, flags, res->type,
 			   IN_DEV_ORCONF(out_dev, NOXFRM));
 	if (!rth) {
 		err = -ENOBUFS;
@@ -2139,16 +2140,18 @@  static int ip_mkroute_input(struct sk_buff *skb,
 			    __be32 daddr, __be32 saddr, u32 tos,
 			    struct flow_keys *hkeys)
 {
+	unsigned int flags = 0;
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	if (res->fi && fib_info_num_path(res->fi) > 1) {
 		int h = fib_multipath_hash(res->fi->fib_net, NULL, skb, hkeys);
 
 		fib_select_multipath(res, h);
+		flags |= RTCF_MULTIPATH;
 	}
 #endif
 
 	/* create a routing cache entry */
-	return __mkroute_input(skb, res, in_dev, daddr, saddr, tos);
+	return __mkroute_input(skb, flags, res, in_dev, daddr, saddr, tos);
 }
 
 /* Implements all the saddr-related checks as ip_route_input_slow(),
diff --git a/net/ipv6/ip6_input.c b/net/ipv6/ip6_input.c
index d94041bb4287..1b7527a4a4bd 100644
--- a/net/ipv6/ip6_input.c
+++ b/net/ipv6/ip6_input.c
@@ -99,10 +99,14 @@  static bool ip6_can_use_hint(const struct sk_buff *skb,
 static struct sk_buff *ip6_extract_route_hint(const struct net *net,
 					      struct sk_buff *skb)
 {
+#if 0
 	if (fib6_routes_require_src(net) || fib6_has_custom_rules(net))
 		return NULL;
 
 	return skb;
+#else
+	return NULL;
+#endif
 }
 
 static void ip6_list_rcv_finish(struct net *net, struct sock *sk,
diff --git a/tools/testing/selftests/net/forwarding/Makefile b/tools/testing/selftests/net/forwarding/Makefile
index 770efbe24f0d..bf4e5745fd5c 100644
--- a/tools/testing/selftests/net/forwarding/Makefile
+++ b/tools/testing/selftests/net/forwarding/Makefile
@@ -70,6 +70,7 @@  TEST_PROGS = bridge_igmp.sh \
 	router_mpath_nh.sh \
 	router_multicast.sh \
 	router_multipath.sh \
+	router_multipath_vip.sh \
 	router_nh.sh \
 	router.sh \
 	router_vid_1.sh \
diff --git a/tools/testing/selftests/net/forwarding/router_multipath_vip.sh b/tools/testing/selftests/net/forwarding/router_multipath_vip.sh
new file mode 100755
index 000000000000..0415cf974388
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/router_multipath_vip.sh
@@ -0,0 +1,324 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# +--------------------+                     +----------------------+
+# | H1                 |                     |                   H2 |
+# |                    |                     |                      |
+# |              $h1 + |                     | + $h2                |
+# |     192.0.2.2/24 | |                     | | 198.51.100.2/24    |
+# | 2001:db8:1::2/64 | |                     | | 2001:db8:2::2/64   |
+# |                  | |                     | |                    |
+# +------------------|-+                     +-|--------------------+
+#                    |                         |
+# +------------------|-------------------------|--------------------+
+# | SW               |                         |                    |
+# |                  |                         |                    |
+# |             $rp1 +                         + $rp2               |
+# |     192.0.2.1/24                             198.51.100.1/24    |
+# | 2001:db8:1::1/64     + vip                   2001:db8:2::1/64   |
+# |                        198.18.0.0/24                            |
+# |                        2001:db8:18::/64    + $rp3               |
+# |                                            | 203.0.113.1/24     |
+# |                                            | 2001:db8:3::1/64   |
+# |                                            |                    |
+# |                                            |                    |
+# +--------------------------------------------|--------------------+
+#                                              |
+#                                            +-|--------------------+
+#                                            | |                 H3 |
+#                                            | |                    |
+#                                            | | 203.0.113.2/24     |
+#                                            | | 2001:db8:3::2/64   |
+#                                            | + $h3                |
+#                                            |                      |
+#                                            +----------------------+
+
+ALL_TESTS="ping_ipv4 ping_ipv6 multipath_test"
+NUM_NETIFS=6
+source lib.sh
+
+h1_create()
+{
+	vrf_create "vrf-h1"
+	ip link set dev $h1 master vrf-h1
+
+	ip link set dev vrf-h1 up
+	ip link set dev $h1 up
+
+	ip address add 192.0.2.2/24 dev $h1
+	ip address add 2001:db8:1::2/64 dev $h1
+
+	ip route add default vrf vrf-h1 via 192.0.2.1
+	ip route add default vrf vrf-h1 via 2001:db8:1::1
+}
+
+h1_destroy()
+{
+	ip route del default vrf vrf-h1 via 2001:db8:1::1
+	ip route del default vrf vrf-h1 via 192.0.2.1
+
+	ip address del 2001:db8:1::2/64 dev $h1
+	ip address del 192.0.2.2/24 dev $h1
+
+	ip link set dev $h1 down
+	vrf_destroy "vrf-h1"
+}
+
+h2_create()
+{
+	vrf_create "vrf-h2"
+	ip link set dev $h2 master vrf-h2
+
+	ip link set dev vrf-h2 up
+	ip link set dev $h2 up
+
+	ip address add 198.51.100.2/24 dev $h2
+	ip address add 2001:db8:2::2/64 dev $h2
+
+	ip address add 198.18.0.0/24 dev vrf-h2
+	ip address add 2001:db8:18::/64 dev vrf-h2
+
+	ip route add 192.0.2.0/24 vrf vrf-h2 via 198.51.100.1
+	ip route add 2001:db8:1::/64 vrf vrf-h2 nexthop via 2001:db8:2::1
+}
+
+h2_destroy()
+{
+	ip route del 2001:db8:1::/64 vrf vrf-h2 nexthop via 2001:db8:2::1
+	ip route del 192.0.2.0/24 vrf vrf-h2 via 198.51.100.1
+
+	ip address del 2001:db8:18::/64 dev vrf-h2
+	ip address del 198.18.0.0/24 dev vrf-h2
+
+	ip address del 2001:db8:2::2/64 dev $h2
+	ip address del 198.51.100.2/24 dev $h2
+
+	ip link set dev $h2 down
+	vrf_destroy "vrf-h2"
+}
+
+h3_create()
+{
+	vrf_create "vrf-h3"
+	ip link set dev $h3 master vrf-h3
+
+	ip link set dev vrf-h3 up
+	ip link set dev $h3 up
+
+	ip address add 203.0.113.2/24 dev $h3
+	ip address add 2001:db8:3::2/64 dev $h3
+
+	ip address add 198.18.0.0/24 dev vrf-h3
+	ip address add 2001:db8:18::/64 dev vrf-h3
+
+	ip route add 192.0.2.0/24 vrf vrf-h3 via 203.0.113.1
+	ip route add 2001:db8:1::/64 vrf vrf-h3 nexthop via 2001:db8:3::1
+}
+
+h3_destroy()
+{
+	ip route del 2001:db8:1::/64 vrf vrf-h3 nexthop via 2001:db8:3::1
+	ip route del 192.0.2.0/24 vrf vrf-h3 via 203.0.113.1
+
+	ip address del 198.18.0.0/24 dev vrf-h3
+	ip address del 2001:db8:18::/64 dev vrf-h3
+
+	ip address del 2001:db8:3::2/64 dev $h3
+	ip address del 203.0.113.2/24 dev $h3
+
+	ip link set dev $h3 down
+	vrf_destroy "vrf-h3"
+}
+
+router1_create()
+{
+	vrf_create "vrf-r1"
+	ip link set dev $rp1 master vrf-r1
+	ip link set dev $rp2 master vrf-r1
+	ip link set dev $rp3 master vrf-r1
+
+	ip link set dev vrf-r1 up
+	ip link set dev $rp1 up
+	ip link set dev $rp2 up
+	ip link set dev $rp3 up
+
+	ip address add 192.0.2.1/24 dev $rp1
+	ip address add 2001:db8:1::1/64 dev $rp1
+
+	ip address add 198.51.100.1/24 dev $rp2
+	ip address add 2001:db8:2::1/64 dev $rp2
+
+	ip address add 203.0.113.1/24 dev $rp3
+	ip address add 2001:db8:3::1/64 dev $rp3
+
+	ip route add 198.18.0.0/24 vrf vrf-r1 \
+		nexthop via 198.51.100.2 \
+		nexthop via 203.0.113.2
+	ip route add 2001:db8:18::/64 vrf vrf-r1 \
+		nexthop via 2001:db8:2::2 \
+		nexthop via 2001:db8:3::2
+}
+
+router1_destroy()
+{
+	ip route del 2001:db8:18::/64 vrf vrf-r1
+	ip route del 198.18.0.0/24 vrf vrf-r1
+
+	ip address del 2001:db8:3::1/64 dev $rp3
+	ip address del 203.0.113.1/24 dev $rp3
+
+	ip address del 2001:db8:2::1/64 dev $rp2
+	ip address del 198.51.100.1/24 dev $rp2
+
+	ip address del 2001:db8:1::1/64 dev $rp1
+	ip address del 192.0.2.1/24 dev $rp1
+
+	ip link set dev $rp3 down
+	ip link set dev $rp2 down
+	ip link set dev $rp1 down
+
+	vrf_destroy "vrf-r1"
+}
+
+multipath4_test()
+{
+	local desc="$1"
+	local weight_rp2=$2
+	local weight_rp3=$3
+	local t0_rp2 t0_rp3 t1_rp2 t1_rp3
+	local packets_rp2 packets_rp3
+
+	# Transmit multiple flows from h1 to h2 and make sure they are
+	# distributed between both multipath links (rp2 and rp3)
+	# according to the configured weights.
+	sysctl_set net.ipv4.fib_multipath_hash_policy 1
+	ip route replace 198.18.0.0/24 vrf vrf-r1 \
+		nexthop via 198.51.100.2 weight $weight_rp2 \
+		nexthop via 203.0.113.2 weight $weight_rp3
+
+	t0_rp2=$(link_stats_tx_packets_get $rp2)
+	t0_rp3=$(link_stats_tx_packets_get $rp3)
+
+	ip vrf exec vrf-h1 $MZ $h1 -q -p 64 -A 192.0.2.2 -B 198.18.0.0 \
+		-d 1msec -t tcp "sp=1024,dp=0-32768"
+
+	t1_rp2=$(link_stats_tx_packets_get $rp2)
+	t1_rp3=$(link_stats_tx_packets_get $rp3)
+
+	let "packets_rp2 = $t1_rp2 - $t0_rp2"
+	let "packets_rp3 = $t1_rp3 - $t0_rp3"
+	multipath_eval "$desc" $weight_rp2 $weight_rp3 $packets_rp2 $packets_rp3
+
+	ip route replace 198.18.0.0/24 vrf vrf-r1 \
+		nexthop via 198.51.100.2 \
+		nexthop via 203.0.113.2
+
+	sysctl_restore net.ipv4.fib_multipath_hash_policy
+}
+
+multipath6_l4_test()
+{
+	local desc="$1"
+	local weight_rp2=$2
+	local weight_rp3=$3
+	local t0_rp2 t0_rp3 t1_rp2 t1_rp3
+	local packets_rp2 packets_rp3
+
+	# Transmit multiple flows from h1 to h2 and make sure they are
+	# distributed between both multipath links (rp2 and rp3)
+	# according to the configured weights.
+	sysctl_set net.ipv6.fib_multipath_hash_policy 1
+	ip route replace 2001:db8:18::/64 vrf vrf-r1 \
+		nexthop via 2001:db8:2::2 weight $weight_rp2 \
+		nexthop via 2001:db8:3::2 weight $weight_rp3
+
+	t0_rp2=$(link_stats_tx_packets_get $rp2)
+	t0_rp3=$(link_stats_tx_packets_get $rp3)
+
+	ip vrf exec vrf-h1 $MZ $h1 -6 -q -p 64 -A 2001:db8:1::2 -B 2001:db8:18::0 \
+		-d 1msec -t tcp "sp=1024,dp=0-32768"
+
+	t1_rp2=$(link_stats_tx_packets_get $rp2)
+	t1_rp3=$(link_stats_tx_packets_get $rp3)
+
+	let "packets_rp2 = $t1_rp2 - $t0_rp2"
+	let "packets_rp3 = $t1_rp3 - $t0_rp3"
+	multipath_eval "$desc" $weight_rp2 $weight_rp3 $packets_rp2 $packets_rp3
+
+	ip route replace 2001:db8:18::/64 vrf vrf-r1 \
+		nexthop via 2001:db8:2::2 \
+		nexthop via 2001:db8:3::2
+
+	sysctl_restore net.ipv6.fib_multipath_hash_policy
+}
+
+multipath_test()
+{
+	log_info "Running IPv4 multipath tests"
+	multipath4_test "ECMP" 1 1
+	multipath4_test "Weighted MP 2:1" 2 1
+	multipath4_test "Weighted MP 11:45" 11 45
+
+	log_info "Running IPv6 L4 hash multipath tests"
+	multipath6_l4_test "ECMP" 1 1
+	multipath6_l4_test "Weighted MP 2:1" 2 1
+	multipath6_l4_test "Weighted MP 11:45" 11 45
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	rp1=${NETIFS[p2]}
+
+	rp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	rp3=${NETIFS[p5]}
+	h3=${NETIFS[p6]}
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+	h3_create
+
+	router1_create
+
+	forwarding_enable
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	forwarding_restore
+
+	router1_destroy
+
+	h3_destroy
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+ping_ipv4()
+{
+	ping_test $h1 198.51.100.2
+	ping_test $h1 203.0.113.2
+}
+
+ping_ipv6()
+{
+	ping6_test $h1 2001:db8:2::2
+	ping6_test $h1 2001:db8:3::2
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tests_run
+
+exit $EXIT_STATUS