diff mbox series

Refcount mismatch when unregistering netdevice from kernel

Message ID ca64de092db5a2ac80d22eaa9d662520@codeaurora.org (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Refcount mismatch when unregistering netdevice from kernel | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

stranche@codeaurora.org Dec. 8, 2020, 3:55 a.m. UTC
Hi everyone,

We've recently been investigating a refcount problem when unregistering 
a netdevice from the kernel. It seems that the IPv6 module is still 
holding various references to the inet6_dev associated with the main 
netdevice struct that are not being released, preventing the 
unregistration from completing.

After tracing the various locations that take/release refcounts to these 
two structs, we see that there are mismatches in the refcount for the 
inet6_dev in the DST path when there are routes flushed with the 
rt6_uncached_list_flush_dev() function during rt6_disable_ip() when the 
device is unregistering. It seems that usually these references are 
freed via ip6_dst_ifdown() in the dst_ops->ifdown callback from 
dst_dev_put(), but this callback is not executed in the 
rt6_uncached_list_flush_dev() function. Instead, a reference to the 
inet6_dev is simply dropped to account for the inet6_dev_get() in 
ip6_rt_copy_init().

Unfortunately, this does not seem to be sufficient, as these uncached 
routes have an additional refcount on the inet6_device attached to them 
from the DST allocation. In the normal case, this reference from the DST 
allocation will happen in the dst_ops->destroy() callback in the 
dst_destroy() function when the DST is being freed. However, since 
rt6_uncached_list_flush_dev() changes the inet6_device stored in the DST 
to the loopback device, the dst_ops->destroy() callback doesn't 
decrement the refcount on the correct inet6_dev struct.

We're wondering if it would be appropriate to put() the refcount 
additionally for the uncached routes when flushing out the list for the 
unregistering device. Perhaps something like the following?

          if (dev == loopback_dev)
@@ -190,6 +192,10 @@ static void rt6_uncached_list_flush_dev(struct net 
*net, struct net_device *dev)
                          }

                          if (rt_idev->dev == dev) {
+                               if (rt->dst.ops->ifdown)
+                                       rt->dst.ops->ifdown(&rt->dst, 
dev,
+                                                           unreg);
+
                                  rt->rt6i_idev = 
in6_dev_get(loopback_dev);
                                  in6_dev_put(rt_idev);
                          }
@@ -4781,7 +4787,7 @@ void rt6_sync_down_dev(struct net_device *dev, 
unsigned long event)
   void rt6_disable_ip(struct net_device *dev, unsigned long event)
   {
          rt6_sync_down_dev(dev, event);
-       rt6_uncached_list_flush_dev(dev_net(dev), dev);
+       rt6_uncached_list_flush_dev(dev_net(dev), dev, event);
          neigh_ifdown(&nd_tbl, dev);
   }


For reference, here are some samples of the refcounts on the 
inet6_device. In this log we saw the inet6_device had a final refcount 
of 4 while unregistering.
holds                 puts
ip6_rt_copy_init 17   13 ip6_dst_ifdown
xfrm6_fill_dst   6     5 xfrm6_dst_destroy
                        1 rt6_disable_ip
icmp6_dst_alloc  28   25 ip6_dst_destroy
                        3 rt6_disable_ip

Thanks,
Sean

Comments

Eric Dumazet Dec. 8, 2020, 3:08 p.m. UTC | #1
On 12/8/20 4:55 AM, stranche@codeaurora.org wrote:
> Hi everyone,
> 
> We've recently been investigating a refcount problem when unregistering a netdevice from the kernel. It seems that the IPv6 module is still holding various references to the inet6_dev associated with the main netdevice struct that are not being released, preventing the unregistration from completing.
> 
> After tracing the various locations that take/release refcounts to these two structs, we see that there are mismatches in the refcount for the inet6_dev in the DST path when there are routes flushed with the rt6_uncached_list_flush_dev() function during rt6_disable_ip() when the device is unregistering. It seems that usually these references are freed via ip6_dst_ifdown() in the dst_ops->ifdown callback from dst_dev_put(), but this callback is not executed in the rt6_uncached_list_flush_dev() function. Instead, a reference to the inet6_dev is simply dropped to account for the inet6_dev_get() in ip6_rt_copy_init().
> 
> Unfortunately, this does not seem to be sufficient, as these uncached routes have an additional refcount on the inet6_device attached to them from the DST allocation. In the normal case, this reference from the DST allocation will happen in the dst_ops->destroy() callback in the dst_destroy() function when the DST is being freed. However, since rt6_uncached_list_flush_dev() changes the inet6_device stored in the DST to the loopback device, the dst_ops->destroy() callback doesn't decrement the refcount on the correct inet6_dev struct.
> 
> We're wondering if it would be appropriate to put() the refcount additionally for the uncached routes when flushing out the list for the unregistering device. Perhaps something like the following?
> 

dev refcount imbalances are quite common, particularly on old kernel versions, lacking few fixes.

First thing is to let us know on which kernel version you see this, and how you reproduce it.
Wei Wang Dec. 8, 2020, 6:09 p.m. UTC | #2
Hi Sean,

Thanks for reporting the issue. Like Eric said, if you could provide
the kernel version you are using, that would be great. And if you have
a reproducer, that would be even better.
I have a few comments inline below.

On Tue, Dec 8, 2020 at 7:08 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 12/8/20 4:55 AM, stranche@codeaurora.org wrote:
> > Hi everyone,
> >
> > We've recently been investigating a refcount problem when unregistering a netdevice from the kernel. It seems that the IPv6 module is still holding various references to the inet6_dev associated with the main netdevice struct that are not being released, preventing the unregistration from completing.
> >
> > After tracing the various locations that take/release refcounts to these two structs, we see that there are mismatches in the refcount for the inet6_dev in the DST path when there are routes flushed with the rt6_uncached_list_flush_dev() function during rt6_disable_ip() when the device is unregistering. It seems that usually these references are freed via ip6_dst_ifdown() in the dst_ops->ifdown callback from dst_dev_put(), but this callback is not executed in the rt6_uncached_list_flush_dev() function. Instead, a reference to the inet6_dev is simply dropped to account for the inet6_dev_get() in ip6_rt_copy_init().
> >

rt6_uncached_list_flush_dev() actually tries to replace the inet6_dev
with loopback_dev, and release the reference to the previous inet6_dev
by calling in6_dev_put(), which is actually doing the same thing as
ip6_dst_ifdown(). I don't understand why you say " a reference to the
inet6_dev is simply dropped".

> > Unfortunately, this does not seem to be sufficient, as these uncached routes have an additional refcount on the inet6_device attached to them from the DST allocation. In the normal case, this reference from the DST allocation will happen in the dst_ops->destroy() callback in the dst_destroy() function when the DST is being freed. However, since rt6_uncached_list_flush_dev() changes the inet6_device stored in the DST to the loopback device, the dst_ops->destroy() callback doesn't decrement the refcount on the correct inet6_dev struct.
> >

The additional refcount to the DST is also released by doing the following:
                        if (rt_dev == dev) {
                                rt->dst.dev = blackhole_netdev;
                                dev_hold(rt->dst.dev);
                                dev_put(rt_dev);
                        }
Am I missing something?


> > We're wondering if it would be appropriate to put() the refcount additionally for the uncached routes when flushing out the list for the unregistering device. Perhaps something like the following?
> >
>
> dev refcount imbalances are quite common, particularly on old kernel versions, lacking few fixes.
>
> First thing is to let us know on which kernel version you see this, and how you reproduce it.
>
stranche@codeaurora.org Dec. 8, 2020, 7:12 p.m. UTC | #3
Hi Wei and Eric,

Thanks for the replies.

This was reported to us on the 5.4.61 kernel during a customer 
regression suite, so we don't have an exact reproducer unfortunately. 
 From the trace logs we've added it seems like this is happening during 
IPv6 transport mode XFRM data transfer and the device is unregistered in 
the middle of it, but we've been unable to reproduce it ourselves.. 
We're open to trying out and sharing debug patches if needed though.

> rt6_uncached_list_flush_dev() actually tries to replace the inet6_dev
> with loopback_dev, and release the reference to the previous inet6_dev
> by calling in6_dev_put(), which is actually doing the same thing as
> ip6_dst_ifdown(). I don't understand why you say " a reference to the
> inet6_dev is simply dropped".

Fair. I was going off the semantics used by the dst_dev_put() function 
which calls dst_ops->ifdown() explicitly. At least in the case of 
xfrm6_dst_ifdown() this swap of the loopback device and putting the 
refcount seems like it could be missing a few things.

> The additional refcount to the DST is also released by doing the 
> following:
>                         if (rt_dev == dev) {
>                                 rt->dst.dev = blackhole_netdev;
>                                 dev_hold(rt->dst.dev);
>                                 dev_put(rt_dev);
>                         }
> Am I missing something?

That dev_put() is on the actual netdevice struct, not the inet6_dev 
associated with it. We're seeing many calls to icmp6_dst_alloc() and 
xfrm6_fill_dst() here, both of which seem to associate a reference to 
the inet6_dev struct with the DST in addition to the standard dev_hold() 
on the netdevice during the dst_alloc()/dst_init().

Thanks,
Sean
Wei Wang Dec. 8, 2020, 9:51 p.m. UTC | #4
On Tue, Dec 8, 2020 at 11:13 AM <stranche@codeaurora.org> wrote:
>
> Hi Wei and Eric,
>
> Thanks for the replies.
>
> This was reported to us on the 5.4.61 kernel during a customer
> regression suite, so we don't have an exact reproducer unfortunately.
>  From the trace logs we've added it seems like this is happening during
> IPv6 transport mode XFRM data transfer and the device is unregistered in
> the middle of it, but we've been unable to reproduce it ourselves..
> We're open to trying out and sharing debug patches if needed though.
>

I double checked 5.4.61, and I didn't find any missing fixes in this
area AFAICT.

> > rt6_uncached_list_flush_dev() actually tries to replace the inet6_dev
> > with loopback_dev, and release the reference to the previous inet6_dev
> > by calling in6_dev_put(), which is actually doing the same thing as
> > ip6_dst_ifdown(). I don't understand why you say " a reference to the
> > inet6_dev is simply dropped".
>
> Fair. I was going off the semantics used by the dst_dev_put() function
> which calls dst_ops->ifdown() explicitly. At least in the case of
> xfrm6_dst_ifdown() this swap of the loopback device and putting the
> refcount seems like it could be missing a few things.
>

Looking more into the xfrm code, I think the major difference between
xfrm dst and non-xfrm dst is that, xfrm code creates a list of dst
entries in one dst_entry linked by xfrm_dst_child().
In xfrm_bundle_create(), which I believe is the main function to
create xfrm dst bundles, it allocates the main dst entry and its
children, and it calls xfrm_fill_dst() for each of them. So I think
each dst in the list (including all the children) are added into the
uncached_list.
The difference between the current code in
rt6_uncached_list_flush_dev() vs dst_ops->ifdown() is that, the
current code only releases the refcnt to inet6_dev associated with the
main dst, while xfrm6_dst_ifdown() tries to release inet6_dev
associated with every dst linked by xfrm_dst_child(). However, since
xfrm_bundle_create() anyway adds each child dst to the uncached list,
I don't see how the current code could miss any refcnt.
BTW, have you tried your previous proposed patch and confirmed it
would fix the issue?

> > The additional refcount to the DST is also released by doing the
> > following:
> >                         if (rt_dev == dev) {
> >                                 rt->dst.dev = blackhole_netdev;
> >                                 dev_hold(rt->dst.dev);
> >                                 dev_put(rt_dev);
> >                         }
> > Am I missing something?
>
> That dev_put() is on the actual netdevice struct, not the inet6_dev
> associated with it. We're seeing many calls to icmp6_dst_alloc() and
> xfrm6_fill_dst() here, both of which seem to associate a reference to
> the inet6_dev struct with the DST in addition to the standard dev_hold()
> on the netdevice during the dst_alloc()/dst_init().
>

Could we further distinguish between dst added to the uncached list by
icmp6_dst_alloc() and xfrm6_fill_dst(), and confirm which ones are the
ones leaking reference?
I suspect it would be the xfrm ones, but I think it is worth verifying.


> Thanks,
> Sean
David Ahern Dec. 9, 2020, 12:03 a.m. UTC | #5
On 12/8/20 2:51 PM, Wei Wang wrote:
> On Tue, Dec 8, 2020 at 11:13 AM <stranche@codeaurora.org> wrote:
>>
>> Hi Wei and Eric,
>>
>> Thanks for the replies.
>>
>> This was reported to us on the 5.4.61 kernel during a customer
>> regression suite, so we don't have an exact reproducer unfortunately.
>>  From the trace logs we've added it seems like this is happening during
>> IPv6 transport mode XFRM data transfer and the device is unregistered in
>> the middle of it, but we've been unable to reproduce it ourselves..
>> We're open to trying out and sharing debug patches if needed though.
>>
> 
> I double checked 5.4.61, and I didn't find any missing fixes in this
> area AFAICT.
> 
>>> rt6_uncached_list_flush_dev() actually tries to replace the inet6_dev
>>> with loopback_dev, and release the reference to the previous inet6_dev
>>> by calling in6_dev_put(), which is actually doing the same thing as
>>> ip6_dst_ifdown(). I don't understand why you say " a reference to the
>>> inet6_dev is simply dropped".
>>
>> Fair. I was going off the semantics used by the dst_dev_put() function
>> which calls dst_ops->ifdown() explicitly. At least in the case of
>> xfrm6_dst_ifdown() this swap of the loopback device and putting the
>> refcount seems like it could be missing a few things.
>>
> 
> Looking more into the xfrm code, I think the major difference between
> xfrm dst and non-xfrm dst is that, xfrm code creates a list of dst
> entries in one dst_entry linked by xfrm_dst_child().
> In xfrm_bundle_create(), which I believe is the main function to
> create xfrm dst bundles, it allocates the main dst entry and its
> children, and it calls xfrm_fill_dst() for each of them. So I think
> each dst in the list (including all the children) are added into the
> uncached_list.
> The difference between the current code in
> rt6_uncached_list_flush_dev() vs dst_ops->ifdown() is that, the
> current code only releases the refcnt to inet6_dev associated with the
> main dst, while xfrm6_dst_ifdown() tries to release inet6_dev
> associated with every dst linked by xfrm_dst_child(). However, since
> xfrm_bundle_create() anyway adds each child dst to the uncached list,
> I don't see how the current code could miss any refcnt.
> BTW, have you tried your previous proposed patch and confirmed it
> would fix the issue?
> 
>>> The additional refcount to the DST is also released by doing the
>>> following:
>>>                         if (rt_dev == dev) {
>>>                                 rt->dst.dev = blackhole_netdev;
>>>                                 dev_hold(rt->dst.dev);
>>>                                 dev_put(rt_dev);
>>>                         }
>>> Am I missing something?
>>
>> That dev_put() is on the actual netdevice struct, not the inet6_dev
>> associated with it. We're seeing many calls to icmp6_dst_alloc() and
>> xfrm6_fill_dst() here, both of which seem to associate a reference to
>> the inet6_dev struct with the DST in addition to the standard dev_hold()
>> on the netdevice during the dst_alloc()/dst_init().
>>
> 
> Could we further distinguish between dst added to the uncached list by
> icmp6_dst_alloc() and xfrm6_fill_dst(), and confirm which ones are the
> ones leaking reference?
> I suspect it would be the xfrm ones, but I think it is worth verifying.
> 

Finally found the reference:

tools/testing/selftests/net/l2tp.sh at one point was triggering a
refcount leak:

https://lore.kernel.org/netdev/20190801235421.8344-1-dsahern@kernel.org/

And then Colin found more problems with it:

https://lore.kernel.org/netdev/450f5abb-5fe8-158d-d267-4334e15f8e58@canonical.com/


running that on a 5.8 kernel on Ubuntu 20.10 did not trigger the
problem. Neither did Ubuntu 20.04 with 5.4.0-51-generic.

Can you run it on your 5.4 version and see?
stranche@codeaurora.org Dec. 11, 2020, 1:12 a.m. UTC | #6
>> BTW, have you tried your previous proposed patch and confirmed it
>> would fix the issue?
>> 

Yes, we shared this with the customer and the refcount mismatch still 
occurred, so this doesn't seem sufficient either.

>> Could we further distinguish between dst added to the uncached list by
>> icmp6_dst_alloc() and xfrm6_fill_dst(), and confirm which ones are the
>> ones leaking reference?
>> I suspect it would be the xfrm ones, but I think it is worth 
>> verifying.
>> 

After digging into the DST allocation/destroy a bit more, it seems that 
there are some cases where the DST's refcount does not hit zero, causing 
them to never be freed and release their references.
One case comes from here on the IPv6 packet output path (these DST 
structs would hold references to both the inet6_dev and the netdevice)
ip6_pol_route_output+0x20/0x2c -> ip6_pol_route+0x1dc/0x34c -> 
rt6_make_pcpu_route+0x18/0xf4 -> ip6_rt_pcpu_alloc+0xb4/0x19c

We also see two DSTs where they are stored as the xdst->rt entry on the 
XFRM path that do not get released. One is allocated by the same path as 
above, and the other like this
xfrm6_esp_err+0x7c/0xd4 -> esp6_err+0xc8/0x100 -> 
ip6_update_pmtu+0xc8/0x100 -> __ip6_rt_update_pmtu+0x248/0x434 -> 
ip6_rt_cache_alloc+0xa0/0x1dc

 From those alloc paths it seems like the problem might not be coming 
from the uncached list after all.

> 
> Finally found the reference:
> 
> tools/testing/selftests/net/l2tp.sh at one point was triggering a
> refcount leak:
> 
> https://lore.kernel.org/netdev/20190801235421.8344-1-dsahern@kernel.org/
> 
> And then Colin found more problems with it:
> 
> https://lore.kernel.org/netdev/450f5abb-5fe8-158d-d267-4334e15f8e58@canonical.com/
> 
> 
> running that on a 5.8 kernel on Ubuntu 20.10 did not trigger the
> problem. Neither did Ubuntu 20.04 with 5.4.0-51-generic.
> 
> Can you run it on your 5.4 version and see?

We let that run for two days on our setup and didn't see anything, 
unfortunately.

Thanks,
Sean
David Ahern Dec. 11, 2020, 4:10 p.m. UTC | #7
On 12/10/20 6:12 PM, stranche@codeaurora.org wrote:
>>> BTW, have you tried your previous proposed patch and confirmed it
>>> would fix the issue?
>>>
> 
> Yes, we shared this with the customer and the refcount mismatch still
> occurred, so this doesn't seem sufficient either.
> 
>>> Could we further distinguish between dst added to the uncached list by
>>> icmp6_dst_alloc() and xfrm6_fill_dst(), and confirm which ones are the
>>> ones leaking reference?
>>> I suspect it would be the xfrm ones, but I think it is worth verifying.
>>>
> 
> After digging into the DST allocation/destroy a bit more, it seems that
> there are some cases where the DST's refcount does not hit zero, causing
> them to never be freed and release their references.
> One case comes from here on the IPv6 packet output path (these DST
> structs would hold references to both the inet6_dev and the netdevice)
> ip6_pol_route_output+0x20/0x2c -> ip6_pol_route+0x1dc/0x34c ->
> rt6_make_pcpu_route+0x18/0xf4 -> ip6_rt_pcpu_alloc+0xb4/0x19c

This is the normal data path, and this refers to a per-cpu dst cache.
Delete the route and the cached entries get removed.

> 
> We also see two DSTs where they are stored as the xdst->rt entry on the
> XFRM path that do not get released. One is allocated by the same path as
> above, and the other like this
> xfrm6_esp_err+0x7c/0xd4 -> esp6_err+0xc8/0x100 ->
> ip6_update_pmtu+0xc8/0x100 -> __ip6_rt_update_pmtu+0x248/0x434 ->
> ip6_rt_cache_alloc+0xa0/0x1dc

This entry goes into an exception cache. I have lost track of kernel
versions and features. Try listing the route cache to see these:  ip -6
ro ls cache
stranche@codeaurora.org Jan. 5, 2021, 3:05 a.m. UTC | #8
On 2020-12-11 09:10, David Ahern wrote:

>>>> Could we further distinguish between dst added to the uncached list 
>>>> by
>>>> icmp6_dst_alloc() and xfrm6_fill_dst(), and confirm which ones are 
>>>> the
>>>> ones leaking reference?
>>>> I suspect it would be the xfrm ones, but I think it is worth 
>>>> verifying.
>>>> 
>> 
>> After digging into the DST allocation/destroy a bit more, it seems 
>> that
>> there are some cases where the DST's refcount does not hit zero, 
>> causing
>> them to never be freed and release their references.
>> One case comes from here on the IPv6 packet output path (these DST
>> structs would hold references to both the inet6_dev and the netdevice)
>> ip6_pol_route_output+0x20/0x2c -> ip6_pol_route+0x1dc/0x34c ->
>> rt6_make_pcpu_route+0x18/0xf4 -> ip6_rt_pcpu_alloc+0xb4/0x19c
> 
> This is the normal data path, and this refers to a per-cpu dst cache.
> Delete the route and the cached entries get removed.
> 

After tracing all the DST entries created by the system, we've been able 
to see
that all unfreed DST entries belong to the same route on the system. One 
is the
main rt6_info struct it references and the rest are percpu copies of it.

>> 
>> We also see two DSTs where they are stored as the xdst->rt entry on 
>> the
>> XFRM path that do not get released. One is allocated by the same path 
>> as
>> above, and the other like this
>> xfrm6_esp_err+0x7c/0xd4 -> esp6_err+0xc8/0x100 ->
>> ip6_update_pmtu+0xc8/0x100 -> __ip6_rt_update_pmtu+0x248/0x434 ->
>> ip6_rt_cache_alloc+0xa0/0x1dc
> 
> This entry goes into an exception cache. I have lost track of kernel
> versions and features. Try listing the route cache to see these:  ip -6
> ro ls cache

Thanks for the tip here. We've further seen that the route that refers 
to these
unfreed DST is always a cached exception route. After tracing the routes 
as well,
we can see that the fib6_info struct for this route is never freed 
either, thus
preventing any of the DSTs associated with it from being cleaned up and 
releasing
their refcounts on the device. In fact, we can see that the fib6_info 
struct is no
longer present in the main fib6 tree after a period of time. The last 
time we're
able to see the pointer to the route in the tree is during a route 
replace
operation from userspace, but it seems that the fib6_info is not fully 
released.
In particular, the exception cache is not flushed out for the route 
during the
replace operation like it is during a standard fib6_del_route() call.

We're able to reproduce the refcount mismatch after some experimentation 
as well.
Essentially, it consists of
1) adding a default route (ip -6 route add dev XXX default)
2) forcing the creation of an exception route via manually injecting an 
ICMPv6
Packet Too Big into the device.
3) Replace the default route (ip -6 route change dev XXX default)
4) Delete the device. (ip link del XXX)

After adding a call to flush out the exception cache for the route, the 
mismatch
is no longer seen:
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 7a0c877..95e4310 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -1215,6 +1215,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
struct fib6_info *rt,
                 }
                 nsiblings = iter->fib6_nsiblings;
                 iter->fib6_node = NULL;
+               rt6_flush_exceptions(iter);
                 fib6_purge_rt(iter, fn, info->nl_net);
                 if (rcu_access_pointer(fn->rr_ptr) == iter)
                         fn->rr_ptr = NULL;
David Ahern Jan. 5, 2021, 4:58 a.m. UTC | #9
On 1/4/21 8:05 PM, stranche@codeaurora.org wrote:
> 
> We're able to reproduce the refcount mismatch after some experimentation
> as well.
> Essentially, it consists of
> 1) adding a default route (ip -6 route add dev XXX default)
> 2) forcing the creation of an exception route via manually injecting an
> ICMPv6
> Packet Too Big into the device.
> 3) Replace the default route (ip -6 route change dev XXX default)
> 4) Delete the device. (ip link del XXX)
> 
> After adding a call to flush out the exception cache for the route, the
> mismatch
> is no longer seen:
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 7a0c877..95e4310 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -1215,6 +1215,7 @@ static int fib6_add_rt2node(struct fib6_node *fn,
> struct fib6_info *rt,
>                 }
>                 nsiblings = iter->fib6_nsiblings;
>                 iter->fib6_node = NULL;
> +               rt6_flush_exceptions(iter);
>                 fib6_purge_rt(iter, fn, info->nl_net);
>                 if (rcu_access_pointer(fn->rr_ptr) == iter)
>                         fn->rr_ptr = NULL;

Ah, I see now. rt6_flush_exceptions is called by fib6_del_route, but
that won't handle replace.

If you look at fib6_purge_rt it already has a call to remove pcpu
entries. This call to flush exceptions should go there and the existing
one in fib6_del_route can be removed.

Also, can you add the reproducer as another test case to
tools/testing/selftests/net/pmtu.sh? We definitely need one for this
sequence (route, exceptions, replace route).
Wei Wang Jan. 5, 2021, 7:09 p.m. UTC | #10
On Mon, Jan 4, 2021 at 8:58 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 1/4/21 8:05 PM, stranche@codeaurora.org wrote:
> >
> > We're able to reproduce the refcount mismatch after some experimentation
> > as well.
> > Essentially, it consists of
> > 1) adding a default route (ip -6 route add dev XXX default)
> > 2) forcing the creation of an exception route via manually injecting an
> > ICMPv6
> > Packet Too Big into the device.
> > 3) Replace the default route (ip -6 route change dev XXX default)
> > 4) Delete the device. (ip link del XXX)
> >
> > After adding a call to flush out the exception cache for the route, the
> > mismatch
> > is no longer seen:
> > diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> > index 7a0c877..95e4310 100644
> > --- a/net/ipv6/ip6_fib.c
> > +++ b/net/ipv6/ip6_fib.c
> > @@ -1215,6 +1215,7 @@ static int fib6_add_rt2node(struct fib6_node *fn,
> > struct fib6_info *rt,
> >                 }
> >                 nsiblings = iter->fib6_nsiblings;
> >                 iter->fib6_node = NULL;
> > +               rt6_flush_exceptions(iter);
> >                 fib6_purge_rt(iter, fn, info->nl_net);
> >                 if (rcu_access_pointer(fn->rr_ptr) == iter)
> >                         fn->rr_ptr = NULL;
>
> Ah, I see now. rt6_flush_exceptions is called by fib6_del_route, but
> that won't handle replace.
>
> If you look at fib6_purge_rt it already has a call to remove pcpu
> entries. This call to flush exceptions should go there and the existing
> one in fib6_del_route can be removed.
>
Thanks for catching this!
Agree with this proposed fix.


> Also, can you add the reproducer as another test case to
> tools/testing/selftests/net/pmtu.sh? We definitely need one for this
> sequence (route, exceptions, replace route).
Alexei Starovoitov Feb. 11, 2021, 7:21 p.m. UTC | #11
On Tue, Jan 5, 2021 at 11:11 AM Wei Wang <weiwan@google.com> wrote:
>
> On Mon, Jan 4, 2021 at 8:58 PM David Ahern <dsahern@gmail.com> wrote:
> >
> > On 1/4/21 8:05 PM, stranche@codeaurora.org wrote:
> > >
> > > We're able to reproduce the refcount mismatch after some experimentation
> > > as well.
> > > Essentially, it consists of
> > > 1) adding a default route (ip -6 route add dev XXX default)
> > > 2) forcing the creation of an exception route via manually injecting an
> > > ICMPv6
> > > Packet Too Big into the device.
> > > 3) Replace the default route (ip -6 route change dev XXX default)
> > > 4) Delete the device. (ip link del XXX)
> > >
> > > After adding a call to flush out the exception cache for the route, the
> > > mismatch
> > > is no longer seen:
> > > diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> > > index 7a0c877..95e4310 100644
> > > --- a/net/ipv6/ip6_fib.c
> > > +++ b/net/ipv6/ip6_fib.c
> > > @@ -1215,6 +1215,7 @@ static int fib6_add_rt2node(struct fib6_node *fn,
> > > struct fib6_info *rt,
> > >                 }
> > >                 nsiblings = iter->fib6_nsiblings;
> > >                 iter->fib6_node = NULL;
> > > +               rt6_flush_exceptions(iter);
> > >                 fib6_purge_rt(iter, fn, info->nl_net);
> > >                 if (rcu_access_pointer(fn->rr_ptr) == iter)
> > >                         fn->rr_ptr = NULL;
> >
> > Ah, I see now. rt6_flush_exceptions is called by fib6_del_route, but
> > that won't handle replace.
> >
> > If you look at fib6_purge_rt it already has a call to remove pcpu
> > entries. This call to flush exceptions should go there and the existing
> > one in fib6_del_route can be removed.
> >
> Thanks for catching this!
> Agree with this proposed fix.

Looks like this fix never landed?
Is it still needed or there was an alternative fix merged?
Jakub Kicinski Feb. 12, 2021, 1:28 a.m. UTC | #12
On Thu, 11 Feb 2021 11:21:26 -0800 Alexei Starovoitov wrote:
> On Tue, Jan 5, 2021 at 11:11 AM Wei Wang <weiwan@google.com> wrote:
> > On Mon, Jan 4, 2021 at 8:58 PM David Ahern <dsahern@gmail.com> wrote:  
> > > On 1/4/21 8:05 PM, stranche@codeaurora.org wrote:  
> > Ah, I see now. rt6_flush_exceptions is called by fib6_del_route, but
> > > that won't handle replace.
> > >
> > > If you look at fib6_purge_rt it already has a call to remove pcpu
> > > entries. This call to flush exceptions should go there and the existing
> > > one in fib6_del_route can be removed.
> > >  
> > Thanks for catching this!
> > Agree with this proposed fix.  
> 
> Looks like this fix never landed?
> Is it still needed or there was an alternative fix merged?

Wasn't it:

d8f5c29653c3 ("net: ipv6: fib: flush exceptions when purging route")

?
Alexei Starovoitov Feb. 12, 2021, 1:44 a.m. UTC | #13
On Thu, Feb 11, 2021 at 5:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 11 Feb 2021 11:21:26 -0800 Alexei Starovoitov wrote:
> > On Tue, Jan 5, 2021 at 11:11 AM Wei Wang <weiwan@google.com> wrote:
> > > On Mon, Jan 4, 2021 at 8:58 PM David Ahern <dsahern@gmail.com> wrote:
> > > > On 1/4/21 8:05 PM, stranche@codeaurora.org wrote:
> > > Ah, I see now. rt6_flush_exceptions is called by fib6_del_route, but
> > > > that won't handle replace.
> > > >
> > > > If you look at fib6_purge_rt it already has a call to remove pcpu
> > > > entries. This call to flush exceptions should go there and the existing
> > > > one in fib6_del_route can be removed.
> > > >
> > > Thanks for catching this!
> > > Agree with this proposed fix.
> >
> > Looks like this fix never landed?
> > Is it still needed or there was an alternative fix merged?
>
> Wasn't it:
>
> d8f5c29653c3 ("net: ipv6: fib: flush exceptions when purging route")
>
> ?

Ahh. Thanks!
diff mbox series

Patch

diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 6602f43..554b07b 
100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -156,9 +156,11 @@  void rt6_uncached_list_del(struct rt6_info *rt)

   char rt6_uncached_list_flush_dev_log1[1000][512];
   int rt6_uncached_list_flush_dev_log1_iter = 0; -static void 
rt6_uncached_list_flush_dev(struct net *net, struct net_device *dev)
+static void rt6_uncached_list_flush_dev(struct net *net, struct 
net_device *dev,
+                                        unsigned long event)
   {
          struct net_device *loopback_dev = net->loopback_dev;
+       bool unreg = event == NETDEV_UNREGISTER;
          int cpu;