Message ID | 20241029152206.303004-1-deliran@verdict.gg (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: ipv4: Cache pmtu for all packet paths if multipath enabled | expand |
On 10/29/24 9:21 AM, Vladimir Vdovin wrote: > Check number of paths by fib_info_num_path(), > and update_or_create_fnhe() for every path. > Problem is that pmtu is cached only for the oif > that has received icmp message "need to frag", > other oifs will still try to use "default" iface mtu. > > An example topology showing the problem: > > | host1 > +---------+ > | dummy0 | 10.179.20.18/32 mtu9000 > +---------+ > +-----------+----------------+ > +---------+ +---------+ > | ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31 > +---------+ +---------+ > | (all here have mtu 9000) | > +------+ +------+ > | ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31 > +------+ +------+ > | | > ---------+------------+-------------------+------ > | > +-----+ > | ro3 | 10.10.10.10 mtu1500 > +-----+ > | > ======================================== > some networks > ======================================== > | > +-----+ > | eth0| 10.10.30.30 mtu9000 > +-----+ > | host2 > > host1 have enabled multipath and > sysctl net.ipv4.fib_multipath_hash_policy = 1: > > default proto static src 10.179.20.18 > nexthop via 10.179.2.12 dev ens17f1 weight 1 > nexthop via 10.179.2.140 dev ens17f0 weight 1 > > When host1 tries to do pmtud from 10.179.20.18/32 to host2, > host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500. > And host1 caches it in nexthop exceptions cache. > > Problem is that it is cached only for the iface that has received icmp, > and there is no way that ro3 will send icmp msg to host1 via another path. > > Host1 now have this routes to host2: > > ip r g 10.10.30.30 sport 30000 dport 443 > 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0 > cache expires 521sec mtu 1500 > > ip r g 10.10.30.30 sport 30033 dport 443 > 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0 > cache > well known problem, and years ago I meant to send a similar patch. Can you add a test case under selftests; you will see many pmtu, redirect and multipath tests. > So when host1 tries again to reach host2 with mtu>1500, > if packet flow is lucky enough to be hashed with oif=ens17f1 its ok, > if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1, > until lucky day when ro3 will send it through another flow to ens17f0. > > Signed-off-by: Vladimir Vdovin <deliran@verdict.gg> > --- > net/ipv4/route.c | 13 +++++++++++++ > 1 file changed, 13 insertions(+) > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 723ac9181558..8eac6e361388 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -1027,10 +1027,23 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu) > struct fib_nh_common *nhc; > > fib_select_path(net, &res, fl4, NULL); > +#ifdef CONFIG_IP_ROUTE_MULTIPATH > + if (fib_info_num_path(res.fi) > 1) { > + int nhsel; > + > + for (nhsel = 0; nhsel < fib_info_num_path(fi); nhsel++) { > + nhc = fib_info_nhc(res.fi, nhsel); > + update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock, > + jiffies + net->ipv4.ip_rt_mtu_expires); > + } > + goto rcu_unlock; > + } > +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ > nhc = FIB_RES_NHC(res); > update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock, > jiffies + net->ipv4.ip_rt_mtu_expires); > } > +rcu_unlock: compiler error when CONFIG_IP_ROUTE_MULTIPATH is not set. > rcu_read_unlock(); > } > > > base-commit: 66600fac7a984dea4ae095411f644770b2561ede
On Tue, Oct 29, 2024 at 05:22:23PM -0600, David Ahern wrote: > On 10/29/24 9:21 AM, Vladimir Vdovin wrote: > > Check number of paths by fib_info_num_path(), > > and update_or_create_fnhe() for every path. > > Problem is that pmtu is cached only for the oif > > that has received icmp message "need to frag", > > other oifs will still try to use "default" iface mtu. > > > > An example topology showing the problem: > > > > | host1 > > +---------+ > > | dummy0 | 10.179.20.18/32 mtu9000 > > +---------+ > > +-----------+----------------+ > > +---------+ +---------+ > > | ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31 > > +---------+ +---------+ > > | (all here have mtu 9000) | > > +------+ +------+ > > | ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31 > > +------+ +------+ > > | | > > ---------+------------+-------------------+------ > > | > > +-----+ > > | ro3 | 10.10.10.10 mtu1500 > > +-----+ > > | > > ======================================== > > some networks > > ======================================== > > | > > +-----+ > > | eth0| 10.10.30.30 mtu9000 > > +-----+ > > | host2 > > > > host1 have enabled multipath and > > sysctl net.ipv4.fib_multipath_hash_policy = 1: > > > > default proto static src 10.179.20.18 > > nexthop via 10.179.2.12 dev ens17f1 weight 1 > > nexthop via 10.179.2.140 dev ens17f0 weight 1 > > > > When host1 tries to do pmtud from 10.179.20.18/32 to host2, > > host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500. > > And host1 caches it in nexthop exceptions cache. > > > > Problem is that it is cached only for the iface that has received icmp, > > and there is no way that ro3 will send icmp msg to host1 via another path. > > > > Host1 now have this routes to host2: > > > > ip r g 10.10.30.30 sport 30000 dport 443 > > 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0 > > cache expires 521sec mtu 1500 > > > > ip r g 10.10.30.30 sport 30033 dport 443 > > 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0 > > cache > > > > well known problem, and years ago I meant to send a similar patch. Doesn't IPv6 suffer from a similar problem? > > Can you add a test case under selftests; you will see many pmtu, > redirect and multipath tests. > > > So when host1 tries again to reach host2 with mtu>1500, > > if packet flow is lucky enough to be hashed with oif=ens17f1 its ok, > > if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1, > > until lucky day when ro3 will send it through another flow to ens17f0. > > > > Signed-off-by: Vladimir Vdovin <deliran@verdict.gg> Thanks for the detailed commit message
On Wed Oct 30, 2024 at 8:11 PM MSK, Ido Schimmel wrote: > On Tue, Oct 29, 2024 at 05:22:23PM -0600, David Ahern wrote: > > On 10/29/24 9:21 AM, Vladimir Vdovin wrote: > > > Check number of paths by fib_info_num_path(), > > > and update_or_create_fnhe() for every path. > > > Problem is that pmtu is cached only for the oif > > > that has received icmp message "need to frag", > > > other oifs will still try to use "default" iface mtu. > > > > > > An example topology showing the problem: > > > > > > | host1 > > > +---------+ > > > | dummy0 | 10.179.20.18/32 mtu9000 > > > +---------+ > > > +-----------+----------------+ > > > +---------+ +---------+ > > > | ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31 > > > +---------+ +---------+ > > > | (all here have mtu 9000) | > > > +------+ +------+ > > > | ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31 > > > +------+ +------+ > > > | | > > > ---------+------------+-------------------+------ > > > | > > > +-----+ > > > | ro3 | 10.10.10.10 mtu1500 > > > +-----+ > > > | > > > ======================================== > > > some networks > > > ======================================== > > > | > > > +-----+ > > > | eth0| 10.10.30.30 mtu9000 > > > +-----+ > > > | host2 > > > > > > host1 have enabled multipath and > > > sysctl net.ipv4.fib_multipath_hash_policy = 1: > > > > > > default proto static src 10.179.20.18 > > > nexthop via 10.179.2.12 dev ens17f1 weight 1 > > > nexthop via 10.179.2.140 dev ens17f0 weight 1 > > > > > > When host1 tries to do pmtud from 10.179.20.18/32 to host2, > > > host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500. > > > And host1 caches it in nexthop exceptions cache. > > > > > > Problem is that it is cached only for the iface that has received icmp, > > > and there is no way that ro3 will send icmp msg to host1 via another path. > > > > > > Host1 now have this routes to host2: > > > > > > ip r g 10.10.30.30 sport 30000 dport 443 > > > 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0 > > > cache expires 521sec mtu 1500 > > > > > > ip r g 10.10.30.30 sport 30033 dport 443 > > > 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0 > > > cache > > > > > > > well known problem, and years ago I meant to send a similar patch. > > Doesn't IPv6 suffer from a similar problem? I am not very familiar with ipv6, but I tried to reproduce same problem with my tests with same topology. ip netns exec ns_a-AHtoRb ip -6 r g fc00:1001::2:2 sport 30003 dport 443 fc00:1001::2:2 via fc00:2::2 dev veth_A-R2 src fc00:1000::1:1 metric 1024 expires 495sec mtu 1500 pref medium ip netns exec ns_a-AHtoRb ip -6 r g fc00:1001::2:2 sport 30013 dport 443 fc00:1001::2:2 via fc00:1::2 dev veth_A-R1 src fc00:1000::1:1 metric 1024 expires 484sec mtu 1500 pref medium It seems that there are no problems with ipv6. We have nhce entries for both paths. > > > > > Can you add a test case under selftests; you will see many pmtu, > > redirect and multipath tests. > > > > > So when host1 tries again to reach host2 with mtu>1500, > > > if packet flow is lucky enough to be hashed with oif=ens17f1 its ok, > > > if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1, > > > until lucky day when ro3 will send it through another flow to ens17f0. > > > > > > Signed-off-by: Vladimir Vdovin <deliran@verdict.gg> > > Thanks for the detailed commit message
On 11/2/24 10:20 AM, Vladimir Vdovin wrote: >> >> Doesn't IPv6 suffer from a similar problem? I believe the answer is yes, but do not have time to find a reproducer right now. > > I am not very familiar with ipv6, > but I tried to reproduce same problem with my tests with same topology. > > ip netns exec ns_a-AHtoRb ip -6 r g fc00:1001::2:2 sport 30003 dport 443 > fc00:1001::2:2 via fc00:2::2 dev veth_A-R2 src fc00:1000::1:1 metric 1024 expires 495sec mtu 1500 pref medium > > ip netns exec ns_a-AHtoRb ip -6 r g fc00:1001::2:2 sport 30013 dport 443 > fc00:1001::2:2 via fc00:1::2 dev veth_A-R1 src fc00:1000::1:1 metric 1024 expires 484sec mtu 1500 pref medium > > It seems that there are no problems with ipv6. We have nhce entries for both paths. Does rt6_cache_allowed_for_pmtu return true or false for this test?
On Tue Nov 5, 2024 at 6:52 AM MSK, David Ahern wrote: > On 11/2/24 10:20 AM, Vladimir Vdovin wrote: > >> > >> Doesn't IPv6 suffer from a similar problem? > > I believe the answer is yes, but do not have time to find a reproducer > right now. > > > > > I am not very familiar with ipv6, > > but I tried to reproduce same problem with my tests with same topology. > > > > ip netns exec ns_a-AHtoRb ip -6 r g fc00:1001::2:2 sport 30003 dport 443 > > fc00:1001::2:2 via fc00:2::2 dev veth_A-R2 src fc00:1000::1:1 metric 1024 expires 495sec mtu 1500 pref medium > > > > ip netns exec ns_a-AHtoRb ip -6 r g fc00:1001::2:2 sport 30013 dport 443 > > fc00:1001::2:2 via fc00:1::2 dev veth_A-R1 src fc00:1000::1:1 metric 1024 expires 484sec mtu 1500 pref medium > > > > It seems that there are no problems with ipv6. We have nhce entries for both paths. > > Does rt6_cache_allowed_for_pmtu return true or false for this test? It returns true.
On 11/6/24 10:20 AM, Vladimir Vdovin wrote: > On Tue Nov 5, 2024 at 6:52 AM MSK, David Ahern wrote: >> On 11/2/24 10:20 AM, Vladimir Vdovin wrote: >>>> >>>> Doesn't IPv6 suffer from a similar problem? >> >> I believe the answer is yes, but do not have time to find a reproducer >> right now. >> >>> >>> I am not very familiar with ipv6, >>> but I tried to reproduce same problem with my tests with same topology. >>> >>> ip netns exec ns_a-AHtoRb ip -6 r g fc00:1001::2:2 sport 30003 dport 443 >>> fc00:1001::2:2 via fc00:2::2 dev veth_A-R2 src fc00:1000::1:1 metric 1024 expires 495sec mtu 1500 pref medium >>> >>> ip netns exec ns_a-AHtoRb ip -6 r g fc00:1001::2:2 sport 30013 dport 443 >>> fc00:1001::2:2 via fc00:1::2 dev veth_A-R1 src fc00:1000::1:1 metric 1024 expires 484sec mtu 1500 pref medium you should dump the cache to see the full exception list. >>> >>> It seems that there are no problems with ipv6. We have nhce entries for both paths. >> >> Does rt6_cache_allowed_for_pmtu return true or false for this test? > It returns true. > > Looking at the code, it is creating a single exception - not one per path. I am fine with deferring the ipv6 patch until someone with time and interest can work on it.
diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 723ac9181558..8eac6e361388 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1027,10 +1027,23 @@ static void __ip_rt_update_pmtu(struct rtable *rt, struct flowi4 *fl4, u32 mtu) struct fib_nh_common *nhc; fib_select_path(net, &res, fl4, NULL); +#ifdef CONFIG_IP_ROUTE_MULTIPATH + if (fib_info_num_path(res.fi) > 1) { + int nhsel; + + for (nhsel = 0; nhsel < fib_info_num_path(fi); nhsel++) { + nhc = fib_info_nhc(res.fi, nhsel); + update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock, + jiffies + net->ipv4.ip_rt_mtu_expires); + } + goto rcu_unlock; + } +#endif /* CONFIG_IP_ROUTE_MULTIPATH */ nhc = FIB_RES_NHC(res); update_or_create_fnhe(nhc, fl4->daddr, 0, mtu, lock, jiffies + net->ipv4.ip_rt_mtu_expires); } +rcu_unlock: rcu_read_unlock(); }
Check number of paths by fib_info_num_path(), and update_or_create_fnhe() for every path. Problem is that pmtu is cached only for the oif that has received icmp message "need to frag", other oifs will still try to use "default" iface mtu. An example topology showing the problem: | host1 +---------+ | dummy0 | 10.179.20.18/32 mtu9000 +---------+ +-----------+----------------+ +---------+ +---------+ | ens17f0 | 10.179.2.141/31 | ens17f1 | 10.179.2.13/31 +---------+ +---------+ | (all here have mtu 9000) | +------+ +------+ | ro1 | 10.179.2.140/31 | ro2 | 10.179.2.12/31 +------+ +------+ | | ---------+------------+-------------------+------ | +-----+ | ro3 | 10.10.10.10 mtu1500 +-----+ | ======================================== some networks ======================================== | +-----+ | eth0| 10.10.30.30 mtu9000 +-----+ | host2 host1 have enabled multipath and sysctl net.ipv4.fib_multipath_hash_policy = 1: default proto static src 10.179.20.18 nexthop via 10.179.2.12 dev ens17f1 weight 1 nexthop via 10.179.2.140 dev ens17f0 weight 1 When host1 tries to do pmtud from 10.179.20.18/32 to host2, host1 receives at ens17f1 iface an icmp packet from ro3 that ro3 mtu=1500. And host1 caches it in nexthop exceptions cache. Problem is that it is cached only for the iface that has received icmp, and there is no way that ro3 will send icmp msg to host1 via another path. Host1 now have this routes to host2: ip r g 10.10.30.30 sport 30000 dport 443 10.10.30.30 via 10.179.2.12 dev ens17f1 src 10.179.20.18 uid 0 cache expires 521sec mtu 1500 ip r g 10.10.30.30 sport 30033 dport 443 10.10.30.30 via 10.179.2.140 dev ens17f0 src 10.179.20.18 uid 0 cache So when host1 tries again to reach host2 with mtu>1500, if packet flow is lucky enough to be hashed with oif=ens17f1 its ok, if oif=ens17f0 it blackholes and still gets icmp msgs from ro3 to ens17f1, until lucky day when ro3 will send it through another flow to ens17f0. Signed-off-by: Vladimir Vdovin <deliran@verdict.gg> --- net/ipv4/route.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) base-commit: 66600fac7a984dea4ae095411f644770b2561ede