diff mbox series

[net,v2,1/2] ipv6: Honor route mtu if it is within limit of dev mtu

Message ID 1655182915-12897-2-git-send-email-quic_subashab@quicinc.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series net: ipv6: Update route MTU behavior | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/cc_maintainers warning 2 maintainers not CCed: pabeni@redhat.com edumazet@google.com
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 28 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Subash Abhinov Kasiviswanathan (KS) June 14, 2022, 5:01 a.m. UTC
From: Kaustubh Pandey <quic_kapandey@quicinc.com>

When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised.

addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route ->
fib6_nh_mtu_change

As part of handling NETDEV_CHANGEMTU notification we land up on a
condition where if route mtu is less than dev mtu and route mtu equals
ipv6_devconf mtu, route mtu gets updated.

Due to this v6 traffic end up using wrong MTU then configured earlier.
This commit fixes this by removing comparison with ipv6_devconf
and updating route mtu only when it is greater than incoming dev mtu.

This can be easily reproduced with below script:
pre-condition:
device up(mtu = 1500) and route mtu for both v4 and v6 is 1500

test-script:
ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1400
ip -6 route change 2001::/64 dev eth0 metric 256 mtu 1400
echo 1400 > /sys/class/net/eth0/mtu
ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1500
echo 1500 > /sys/class/net/eth0/mtu

Fixes: e9fa1495d738 ("ipv6: Reflect MTU changes on PMTU of exceptions for MTU-less routes")
Signed-off-by: Kaustubh Pandey <quic_kapandey@quicinc.com>
Signed-off-by: Sean Tranchetti <quic_stranche@quicinc.com>
Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
---
v1 -> v2: Update the exception route logic as mentioned by David Ahern.
Also add fixes tag.

 net/ipv6/route.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

Comments

Stefano Brivio June 14, 2022, 12:27 p.m. UTC | #1
Hi Kaustubh,

On Mon, 13 Jun 2022 23:01:54 -0600
Subash Abhinov Kasiviswanathan <quic_subashab@quicinc.com> wrote:

> From: Kaustubh Pandey <quic_kapandey@quicinc.com>
> 
> When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised.
> 
> addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route ->
> fib6_nh_mtu_change
> 
> As part of handling NETDEV_CHANGEMTU notification we land up on a
> condition where if route mtu is less than dev mtu and route mtu equals
> ipv6_devconf mtu, route mtu gets updated.
> 
> Due to this v6 traffic end up using wrong MTU then configured earlier.

I read this a few times but I still fail to understand what issue
you're actually fixing -- what makes this new MTU "wrong"?

The idea behind the original implementation is that, when an interface
MTU is administratively updated, we should allow PMTU updates, if the
old PMTU was matching the interface MTU, because the old MTU setting
might have been the one limiting the MTU on the whole path.

That is, if you lower the MTU on an interface, and then increase it
back, a permanently lower PMTU is somewhat unexpected. As far as I can
see, this behaviour persists with this patch, but:

> This commit fixes this by removing comparison with ipv6_devconf
> and updating route mtu only when it is greater than incoming dev mtu.

...I'm not sure what you really mean by "incoming dev mtu". Is it the
newly configured one?
Subash Abhinov Kasiviswanathan (KS) June 14, 2022, 6:34 p.m. UTC | #2
> I read this a few times but I still fail to understand what issue
> you're actually fixing -- what makes this new MTU "wrong"?
> 
> The idea behind the original implementation is that, when an interface
> MTU is administratively updated, we should allow PMTU updates, if the
> old PMTU was matching the interface MTU, because the old MTU setting
> might have been the one limiting the MTU on the whole path.

Hi Stefano

Here is some additional background on the observations - consider a case
where an interface is brought up with IPv6 connection first with PMTU 
1280 (set via the MTU option of a router advertisement) followed by an 
IPv4 connection with PMTU 1700. An userspace management entity sets the 
link MTU to the maximum of IPv4 and IPv6 PMTUs.

When the IPv4 connection is brought up, the link MTU changes to 1700 
(max of IPv4 & IPv6 PMTUs in this case) which unnecessarily causes the 
PMTUD to occur for the IPv6 case.


> That is, if you lower the MTU on an interface, and then increase it
> back, a permanently lower PMTU is somewhat unexpected. As far as I can
> see, this behaviour persists with this patch, but:

I agree that this would indeed occur after the patch. The assumption 
here is that there would be an update from a router via a new router
advertisement if the IPv6 PMTU has to be increased / updated.

> ...I'm not sure what you really mean by "incoming dev mtu". Is it the
> newly configured one?

Yes, this is the new MTU configured by userspace.
Jakub Kicinski June 16, 2022, 12:35 a.m. UTC | #3
On Mon, 13 Jun 2022 23:01:54 -0600 Subash Abhinov Kasiviswanathan wrote:
> When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised.
> 
> addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route ->
> fib6_nh_mtu_change
> 
> As part of handling NETDEV_CHANGEMTU notification we land up on a
> condition where if route mtu is less than dev mtu and route mtu equals
> ipv6_devconf mtu, route mtu gets updated.
> 
> Due to this v6 traffic end up using wrong MTU then configured earlier.
> This commit fixes this by removing comparison with ipv6_devconf
> and updating route mtu only when it is greater than incoming dev mtu.
> 
> This can be easily reproduced with below script:
> pre-condition:
> device up(mtu = 1500) and route mtu for both v4 and v6 is 1500
> 
> test-script:
> ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1400
> ip -6 route change 2001::/64 dev eth0 metric 256 mtu 1400
> echo 1400 > /sys/class/net/eth0/mtu
> ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1500
> echo 1500 > /sys/class/net/eth0/mtu

CC maze, please add him if there is v3

I feel like the problem is with the fact that link mtu resets protocol
MTUs. Nothing we can do about that, so why not set link MTU to 9k (or
whatever other quantification of infinity there is) so you don't have 
to touch it as you discover the MTU for v4 and v6? 

My worry is that the tweaking of the route MTU update heuristic will
have no end. 

Stefano, does that makes sense or you think the change is good?
Maciej Żenczykowski June 16, 2022, 1:21 a.m. UTC | #4
On Wed, Jun 15, 2022 at 5:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 13 Jun 2022 23:01:54 -0600 Subash Abhinov Kasiviswanathan wrote:
> > When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised.
> >
> > addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route ->
> > fib6_nh_mtu_change
> >
> > As part of handling NETDEV_CHANGEMTU notification we land up on a
> > condition where if route mtu is less than dev mtu and route mtu equals
> > ipv6_devconf mtu, route mtu gets updated.
> >
> > Due to this v6 traffic end up using wrong MTU then configured earlier.
> > This commit fixes this by removing comparison with ipv6_devconf
> > and updating route mtu only when it is greater than incoming dev mtu.
> >
> > This can be easily reproduced with below script:
> > pre-condition:
> > device up(mtu = 1500) and route mtu for both v4 and v6 is 1500
> >
> > test-script:
> > ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1400
> > ip -6 route change 2001::/64 dev eth0 metric 256 mtu 1400
> > echo 1400 > /sys/class/net/eth0/mtu
> > ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1500
> > echo 1500 > /sys/class/net/eth0/mtu
>
> CC maze, please add him if there is v3
>
> I feel like the problem is with the fact that link mtu resets protocol
> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or
> whatever other quantification of infinity there is) so you don't have
> to touch it as you discover the MTU for v4 and v6?
>
> My worry is that the tweaking of the route MTU update heuristic will
> have no end.
>
> Stefano, does that makes sense or you think the change is good?

I vaguely recall that if you don't want device mtu changes to affect
ipv6 route mtu, then you should set 'mtu lock' on the routes.
(this meaning of 'lock' for v6 is different than for ipv4, where
'lock' means transmit IPv4/TCP with Don't Frag bit unset)
Subash Abhinov Kasiviswanathan (KS) June 16, 2022, 5:36 a.m. UTC | #5
>> CC maze, please add him if there is v3
>>
>> I feel like the problem is with the fact that link mtu resets protocol
>> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or
>> whatever other quantification of infinity there is) so you don't have
>> to touch it as you discover the MTU for v4 and v6?

That's a good point.

>>
>> My worry is that the tweaking of the route MTU update heuristic will
>> have no end.
>>
>> Stefano, does that makes sense or you think the change is good?

The only concern is that current behavior causes the initial packets 
after interface MTU increase to get dropped as part of PMTUD if the IPv6 
PMTU itself didn't increase. I am not sure if that was the intended 
behavior as part of the original change. Stefano, could you please confirm?

> I vaguely recall that if you don't want device mtu changes to affect
> ipv6 route mtu, then you should set 'mtu lock' on the routes.
> (this meaning of 'lock' for v6 is different than for ipv4, where
> 'lock' means transmit IPv4/TCP with Don't Frag bit unset)

I assume 'mtu lock' here refers to setting the PMTU on the IPv6 routes 
statically. The issue with that approach is that router advertisements 
can no longer update PMTU once a static route is configured.
Maciej Żenczykowski June 16, 2022, 7:33 a.m. UTC | #6
On Wed, Jun 15, 2022 at 10:36 PM Subash Abhinov Kasiviswanathan (KS)
<quic_subashab@quicinc.com> wrote:
>
> >> CC maze, please add him if there is v3
> >>
> >> I feel like the problem is with the fact that link mtu resets protocol
> >> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or
> >> whatever other quantification of infinity there is) so you don't have
> >> to touch it as you discover the MTU for v4 and v6?
>
> That's a good point.

Because link mtu affects rx mtu which affects nic buffer allocations.
Somewhere in the vicinity of mtu 1500..2048 your packets stop fitting
in 2kB of memory and need 4kB (or more)

> >> My worry is that the tweaking of the route MTU update heuristic will
> >> have no end.
> >>
> >> Stefano, does that makes sense or you think the change is good?
>
> The only concern is that current behavior causes the initial packets
> after interface MTU increase to get dropped as part of PMTUD if the IPv6
> PMTU itself didn't increase. I am not sure if that was the intended
> behavior as part of the original change. Stefano, could you please confirm?
>
> > I vaguely recall that if you don't want device mtu changes to affect
> > ipv6 route mtu, then you should set 'mtu lock' on the routes.
> > (this meaning of 'lock' for v6 is different than for ipv4, where
> > 'lock' means transmit IPv4/TCP with Don't Frag bit unset)
>
> I assume 'mtu lock' here refers to setting the PMTU on the IPv6 routes
> statically. The issue with that approach is that router advertisements
> can no longer update PMTU once a static route is configured.

yeah.   Hmm should RA generated routes use locked mtu too?
I think the only reason an RA generated route would have mtu information
is for it to stick...
Stefano Brivio June 16, 2022, 1:39 p.m. UTC | #7
[Subash, please fix quoting of replies in your client, it's stripping
email authors. I rebuilt the chain here but it's kind of painful]

On Wed, 15 Jun 2022 23:36:10 -0600
"Subash Abhinov Kasiviswanathan (KS)" <quic_subashab@quicinc.com> wrote:

> On Wed, 15 Jun 2022 18:21:07 -0700
> Maciej Żenczykowski <maze@google.com> wrote:
> >
> > On Wed, Jun 15, 2022 at 5:35 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > CC maze, please add him if there is v3
> > >
> > > I feel like the problem is with the fact that link mtu resets protocol
> > > MTUs. Nothing we can do about that, so why not set link MTU to 9k (or
> > > whatever other quantification of infinity there is)

2^16 - 1, works for both IPv4 and IPv6.

> > > so you don't have
> > > to touch it as you discover the MTU for v4 and v6?  
> 
> That's a good point.
> 
> > > My worry is that the tweaking of the route MTU update heuristic will
> > > have no end.
> > >
> > > Stefano, does that makes sense or you think the change is good?  

It makes sense -- I'm also worried that we're introducing another small
issue to fix what, I think, is the smallest possible inconvenience.

> The only concern is that current behavior causes the initial packets 
> after interface MTU increase to get dropped as part of PMTUD if the IPv6 
> PMTU itself didn't increase. I am not sure if that was the intended 
> behavior as part of the original change. Stefano, could you please confirm?

Correct, that was the intended behaviour, because I think one dropped
packet is the smallest possible price we can pay for, knowingly, not
having anymore a PMTU estimate that's accurate in terms of RFC 1191.

> > I vaguely recall that if you don't want device mtu changes to affect
> > ipv6 route mtu, then you should set 'mtu lock' on the routes.
> > (this meaning of 'lock' for v6 is different than for ipv4, where
> > 'lock' means transmit IPv4/TCP with Don't Frag bit unset)  

"Locked" exceptions are rather what's created as a result of ICMP and
ICMPv6 messages -- I guess you can have a look or run the basic
pmtu_ipv4() and pmtu_ipv6() to get a sense of it.

With the existing implementation, if you increase the link MTU to a
value that's bigger than the locked value from PMTU discovery, it will
not increase in general: the exception is locking it. That's what's
described in the comment that this patch is removing.

It will increase only under that specific condition, namely, if the
current PMTU estimate is the same as the old link MTU, because then we
can take the reasonable assumption that our link was the limiting
factor, and not some other link on the path. It might be wrong, but I
still maintain it's a reasonable assumption, and, most importantly, we
have no way to prove it wrong without PMTU discovery.
Jakub Kicinski June 16, 2022, 4:42 p.m. UTC | #8
On Thu, 16 Jun 2022 00:33:02 -0700 Maciej Żenczykowski wrote:
> On Wed, Jun 15, 2022 at 10:36 PM Subash Abhinov Kasiviswanathan (KS) <quic_subashab@quicinc.com> wrote:
> > >> CC maze, please add him if there is v3
> > >>
> > >> I feel like the problem is with the fact that link mtu resets protocol
> > >> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or
> > >> whatever other quantification of infinity there is) so you don't have
> > >> to touch it as you discover the MTU for v4 and v6?  
> >
> > That's a good point.  
> 
> Because link mtu affects rx mtu which affects nic buffer allocations.
> Somewhere in the vicinity of mtu 1500..2048 your packets stop fitting
> in 2kB of memory and need 4kB (or more)

I was afraid someone would point that out :) Luckily the values Subash
mentioned were both under 2k, and hope fully the device can do scatter? 

Maciej Żenczykowski June 16, 2022, 5:08 p.m. UTC | #9
On Thu, Jun 16, 2022 at 9:42 AM Jakub Kicinski <kuba@kernel.org> wrote:
> On Thu, 16 Jun 2022 00:33:02 -0700 Maciej Żenczykowski wrote:
> > On Wed, Jun 15, 2022 at 10:36 PM Subash Abhinov Kasiviswanathan (KS) <quic_subashab@quicinc.com> wrote:
> > > >> CC maze, please add him if there is v3
> > > >>
> > > >> I feel like the problem is with the fact that link mtu resets protocol
> > > >> MTUs. Nothing we can do about that, so why not set link MTU to 9k (or
> > > >> whatever other quantification of infinity there is) so you don't have
> > > >> to touch it as you discover the MTU for v4 and v6?
> > >
> > > That's a good point.
> >
> > Because link mtu affects rx mtu which affects nic buffer allocations.
> > Somewhere in the vicinity of mtu 1500..2048 your packets stop fitting
> > in 2kB of memory and need 4kB (or more)
>
> I was afraid someone would point that out :) Luckily the values Subash
> mentioned were both under 2k, and hope fully the device can do scatter?
> 
diff mbox series

Patch

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d25dc83..6f7e8c5 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1991,19 +1991,11 @@  static bool rt6_mtu_change_route_allowed(struct inet6_dev *idev,
 	/* If the new MTU is lower than the route PMTU, this new MTU will be the
 	 * lowest MTU in the path: always allow updating the route PMTU to
 	 * reflect PMTU decreases.
-	 *
-	 * If the new MTU is higher, and the route PMTU is equal to the local
-	 * MTU, this means the old MTU is the lowest in the path, so allow
-	 * updating it: if other nodes now have lower MTUs, PMTU discovery will
-	 * handle this.
 	 */
 
 	if (dst_mtu(&rt->dst) >= mtu)
 		return true;
 
-	if (dst_mtu(&rt->dst) == idev->cnf.mtu6)
-		return true;
-
 	return false;
 }
 
@@ -4914,8 +4906,7 @@  static int fib6_nh_mtu_change(struct fib6_nh *nh, void *_arg)
 		struct inet6_dev *idev = __in6_dev_get(arg->dev);
 		u32 mtu = f6i->fib6_pmtu;
 
-		if (mtu >= arg->mtu ||
-		    (mtu < arg->mtu && mtu == idev->cnf.mtu6))
+		if (mtu >= arg->mtu)
 			fib6_metric_set(f6i, RTAX_MTU, arg->mtu);
 
 		spin_lock_bh(&rt6_exception_lock);