Message ID | 1655182915-12897-2-git-send-email-quic_subashab@quicinc.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: ipv6: Update route MTU behavior | expand |
Hi Kaustubh, On Mon, 13 Jun 2022 23:01:54 -0600 Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> wrote: > From: Kaustubh Pandey <quic_kapandey@quicinc.com> > > When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised. > > addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route -> > fib6_nh_mtu_change > > As part of handling NETDEV_CHANGEMTU notification we land up on a > condition where if route mtu is less than dev mtu and route mtu equals > ipv6_devconf mtu, route mtu gets updated. > > Due to this v6 traffic end up using wrong MTU then configured earlier. I read this a few times but I still fail to understand what issue you're actually fixing -- what makes this new MTU "wrong"? The idea behind the original implementation is that, when an interface MTU is administratively updated, we should allow PMTU updates, if the old PMTU was matching the interface MTU, because the old MTU setting might have been the one limiting the MTU on the whole path. That is, if you lower the MTU on an interface, and then increase it back, a permanently lower PMTU is somewhat unexpected. As far as I can see, this behaviour persists with this patch, but: > This commit fixes this by removing comparison with ipv6_devconf > and updating route mtu only when it is greater than incoming dev mtu. ...I'm not sure what you really mean by "incoming dev mtu". Is it the newly configured one?
> I read this a few times but I still fail to understand what issue > you're actually fixing -- what makes this new MTU "wrong"? > > The idea behind the original implementation is that, when an interface > MTU is administratively updated, we should allow PMTU updates, if the > old PMTU was matching the interface MTU, because the old MTU setting > might have been the one limiting the MTU on the whole path. Hi Stefano Here is some additional background on the observations - consider a case where an interface is brought up with IPv6 connection first with PMTU 1280 (set via the MTU option of a router advertisement) followed by an IPv4 connection with PMTU 1700. An userspace management entity sets the link MTU to the maximum of IPv4 and IPv6 PMTUs. When the IPv4 connection is brought up, the link MTU changes to 1700 (max of IPv4 & IPv6 PMTUs in this case) which unnecessarily causes the PMTUD to occur for the IPv6 case. > That is, if you lower the MTU on an interface, and then increase it > back, a permanently lower PMTU is somewhat unexpected. As far as I can > see, this behaviour persists with this patch, but: I agree that this would indeed occur after the patch. The assumption here is that there would be an update from a router via a new router advertisement if the IPv6 PMTU has to be increased / updated. > ...I'm not sure what you really mean by "incoming dev mtu". Is it the > newly configured one? Yes, this is the new MTU configured by userspace.
On Mon, 13 Jun 2022 23:01:54 -0600 Subash Abhinov Kasiviswanathan wrote: > When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised. > > addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route -> > fib6_nh_mtu_change > > As part of handling NETDEV_CHANGEMTU notification we land up on a > condition where if route mtu is less than dev mtu and route mtu equals > ipv6_devconf mtu, route mtu gets updated. > > Due to this v6 traffic end up using wrong MTU then configured earlier. > This commit fixes this by removing comparison with ipv6_devconf > and updating route mtu only when it is greater than incoming dev mtu. > > This can be easily reproduced with below script: > pre-condition: > device up(mtu = 1500) and route mtu for both v4 and v6 is 1500 > > test-script: > ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1400 > ip -6 route change 2001::/64 dev eth0 metric 256 mtu 1400 > echo 1400 > /sys/class/net/eth0/mtu > ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1500 > echo 1500 > /sys/class/net/eth0/mtu CC maze, please add him if there is v3 I feel like the problem is with the fact that link mtu resets protocol MTUs. Nothing we can do about that, so why not set link MTU to 9k (or whatever other quantification of infinity there is) so you don't have to touch it as you discover the MTU for v4 and v6? My worry is that the tweaking of the route MTU update heuristic will have no end. Stefano, does that makes sense or you think the change is good?
On Wed, Jun 15, 2022 at 5:35 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 13 Jun 2022 23:01:54 -0600 Subash Abhinov Kasiviswanathan wrote: > > When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised. > > > > addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route -> > > fib6_nh_mtu_change > > > > As part of handling NETDEV_CHANGEMTU notification we land up on a > > condition where if route mtu is less than dev mtu and route mtu equals > > ipv6_devconf mtu, route mtu gets updated. > > > > Due to this v6 traffic end up using wrong MTU then configured earlier. > > This commit fixes this by removing comparison with ipv6_devconf > > and updating route mtu only when it is greater than incoming dev mtu. > > > > This can be easily reproduced with below script: > > pre-condition: > > device up(mtu = 1500) and route mtu for both v4 and v6 is 1500 > > > > test-script: > > ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1400 > > ip -6 route change 2001::/64 dev eth0 metric 256 mtu 1400 > > echo 1400 > /sys/class/net/eth0/mtu > > ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1500 > > echo 1500 > /sys/class/net/eth0/mtu > > CC maze, please add him if there is v3 > > I feel like the problem is with the fact that link mtu resets protocol > MTUs. Nothing we can do about that, so why not set link MTU to 9k (or > whatever other quantification of infinity there is) so you don't have > to touch it as you discover the MTU for v4 and v6? > > My worry is that the tweaking of the route MTU update heuristic will > have no end. > > Stefano, does that makes sense or you think the change is good? I vaguely recall that if you don't want device mtu changes to affect ipv6 route mtu, then you should set 'mtu lock' on the routes. (this meaning of 'lock' for v6 is different than for ipv4, where 'lock' means transmit IPv4/TCP with Don't Frag bit unset)
>> CC maze, please add him if there is v3 >> >> I feel like the problem is with the fact that link mtu resets protocol >> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or >> whatever other quantification of infinity there is) so you don't have >> to touch it as you discover the MTU for v4 and v6? That's a good point. >> >> My worry is that the tweaking of the route MTU update heuristic will >> have no end. >> >> Stefano, does that makes sense or you think the change is good? The only concern is that current behavior causes the initial packets after interface MTU increase to get dropped as part of PMTUD if the IPv6 PMTU itself didn't increase. I am not sure if that was the intended behavior as part of the original change. Stefano, could you please confirm? > I vaguely recall that if you don't want device mtu changes to affect > ipv6 route mtu, then you should set 'mtu lock' on the routes. > (this meaning of 'lock' for v6 is different than for ipv4, where > 'lock' means transmit IPv4/TCP with Don't Frag bit unset) I assume 'mtu lock' here refers to setting the PMTU on the IPv6 routes statically. The issue with that approach is that router advertisements can no longer update PMTU once a static route is configured.
On Wed, Jun 15, 2022 at 10:36 PM Subash Abhinov Kasiviswanathan (KS) <quic_subashab@quicinc.com> wrote: > > >> CC maze, please add him if there is v3 > >> > >> I feel like the problem is with the fact that link mtu resets protocol > >> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or > >> whatever other quantification of infinity there is) so you don't have > >> to touch it as you discover the MTU for v4 and v6? > > That's a good point. Because link mtu affects rx mtu which affects nic buffer allocations. Somewhere in the vicinity of mtu 1500..2048 your packets stop fitting in 2kB of memory and need 4kB (or more) > >> My worry is that the tweaking of the route MTU update heuristic will > >> have no end. > >> > >> Stefano, does that makes sense or you think the change is good? > > The only concern is that current behavior causes the initial packets > after interface MTU increase to get dropped as part of PMTUD if the IPv6 > PMTU itself didn't increase. I am not sure if that was the intended > behavior as part of the original change. Stefano, could you please confirm? > > > I vaguely recall that if you don't want device mtu changes to affect > > ipv6 route mtu, then you should set 'mtu lock' on the routes. > > (this meaning of 'lock' for v6 is different than for ipv4, where > > 'lock' means transmit IPv4/TCP with Don't Frag bit unset) > > I assume 'mtu lock' here refers to setting the PMTU on the IPv6 routes > statically. The issue with that approach is that router advertisements > can no longer update PMTU once a static route is configured. yeah. Hmm should RA generated routes use locked mtu too? I think the only reason an RA generated route would have mtu information is for it to stick...
[Subash, please fix quoting of replies in your client, it's stripping email authors. I rebuilt the chain here but it's kind of painful] On Wed, 15 Jun 2022 23:36:10 -0600 "Subash Abhinov Kasiviswanathan (KS)" <quic_subashab@quicinc.com> wrote: > On Wed, 15 Jun 2022 18:21:07 -0700 > Maciej Żenczykowski <maze@google.com> wrote: > > > > On Wed, Jun 15, 2022 at 5:35 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > CC maze, please add him if there is v3 > > > > > > I feel like the problem is with the fact that link mtu resets protocol > > > MTUs. Nothing we can do about that, so why not set link MTU to 9k (or > > > whatever other quantification of infinity there is) 2^16 - 1, works for both IPv4 and IPv6. > > > so you don't have > > > to touch it as you discover the MTU for v4 and v6? > > That's a good point. > > > > My worry is that the tweaking of the route MTU update heuristic will > > > have no end. > > > > > > Stefano, does that makes sense or you think the change is good? It makes sense -- I'm also worried that we're introducing another small issue to fix what, I think, is the smallest possible inconvenience. > The only concern is that current behavior causes the initial packets > after interface MTU increase to get dropped as part of PMTUD if the IPv6 > PMTU itself didn't increase. I am not sure if that was the intended > behavior as part of the original change. Stefano, could you please confirm? Correct, that was the intended behaviour, because I think one dropped packet is the smallest possible price we can pay for, knowingly, not having anymore a PMTU estimate that's accurate in terms of RFC 1191. > > I vaguely recall that if you don't want device mtu changes to affect > > ipv6 route mtu, then you should set 'mtu lock' on the routes. > > (this meaning of 'lock' for v6 is different than for ipv4, where > > 'lock' means transmit IPv4/TCP with Don't Frag bit unset) "Locked" exceptions are rather what's created as a result of ICMP and ICMPv6 messages -- I guess you can have a look or run the basic pmtu_ipv4() and pmtu_ipv6() to get a sense of it. With the existing implementation, if you increase the link MTU to a value that's bigger than the locked value from PMTU discovery, it will not increase in general: the exception is locking it. That's what's described in the comment that this patch is removing. It will increase only under that specific condition, namely, if the current PMTU estimate is the same as the old link MTU, because then we can take the reasonable assumption that our link was the limiting factor, and not some other link on the path. It might be wrong, but I still maintain it's a reasonable assumption, and, most importantly, we have no way to prove it wrong without PMTU discovery.
On Thu, 16 Jun 2022 00:33:02 -0700 Maciej Żenczykowski wrote: > On Wed, Jun 15, 2022 at 10:36 PM Subash Abhinov Kasiviswanathan (KS) <quic_subashab@quicinc.com> wrote: > > >> CC maze, please add him if there is v3 > > >> > > >> I feel like the problem is with the fact that link mtu resets protocol > > >> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or > > >> whatever other quantification of infinity there is) so you don't have > > >> to touch it as you discover the MTU for v4 and v6? > > > > That's a good point. > > Because link mtu affects rx mtu which affects nic buffer allocations. > Somewhere in the vicinity of mtu 1500..2048 your packets stop fitting > in 2kB of memory and need 4kB (or more) I was afraid someone would point that out :) Luckily the values Subash mentioned were both under 2k, and hope fully the device can do scatter?
On Thu, Jun 16, 2022 at 9:42 AM Jakub Kicinski <kuba@kernel.org> wrote: > On Thu, 16 Jun 2022 00:33:02 -0700 Maciej Żenczykowski wrote: > > On Wed, Jun 15, 2022 at 10:36 PM Subash Abhinov Kasiviswanathan (KS) <quic_subashab@quicinc.com> wrote: > > > >> CC maze, please add him if there is v3 > > > >> > > > >> I feel like the problem is with the fact that link mtu resets protocol > > > >> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or > > > >> whatever other quantification of infinity there is) so you don't have > > > >> to touch it as you discover the MTU for v4 and v6? > > > > > > That's a good point. > > > > Because link mtu affects rx mtu which affects nic buffer allocations. > > Somewhere in the vicinity of mtu 1500..2048 your packets stop fitting > > in 2kB of memory and need 4kB (or more) > > I was afraid someone would point that out :) Luckily the values Subash > mentioned were both under 2k, and hope fully the device can do scatter? >
diff --git a/net/ipv6/route.c b/net/ipv6/route.c index d25dc83..6f7e8c5 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1991,19 +1991,11 @@ static bool rt6_mtu_change_route_allowed(struct inet6_dev *idev, /* If the new MTU is lower than the route PMTU, this new MTU will be the * lowest MTU in the path: always allow updating the route PMTU to * reflect PMTU decreases. - * - * If the new MTU is higher, and the route PMTU is equal to the local - * MTU, this means the old MTU is the lowest in the path, so allow - * updating it: if other nodes now have lower MTUs, PMTU discovery will - * handle this. */ if (dst_mtu(&rt->dst) >= mtu) return true; - if (dst_mtu(&rt->dst) == idev->cnf.mtu6) - return true; - return false; } @@ -4914,8 +4906,7 @@ static int fib6_nh_mtu_change(struct fib6_nh *nh, void *_arg) struct inet6_dev *idev = __in6_dev_get(arg->dev); u32 mtu = f6i->fib6_pmtu; - if (mtu >= arg->mtu || - (mtu < arg->mtu && mtu == idev->cnf.mtu6)) + if (mtu >= arg->mtu) fib6_metric_set(f6i, RTAX_MTU, arg->mtu); spin_lock_bh(&rt6_exception_lock);