mbox series

[net,0/3] net: nexthop: fix refcount issues when replacing groups

Message ID 20211121152453.2580051-1-razor@blackwall.org (mailing list archive)
Headers show
Series net: nexthop: fix refcount issues when replacing groups | expand

Message

Nikolay Aleksandrov Nov. 21, 2021, 3:24 p.m. UTC
From: Nikolay Aleksandrov <nikolay@nvidia.com>

Hi,
This set fixes a refcount bug when replacing nexthop groups and
modifying routes. It is complex because the objects look valid when
debugging memory dumps, but we end up having refcount dependency between
unlinked objects which can never be released, so in turn they cannot
free their resources and refcounts. The problem happens because we can
have stale IPv6 per-cpu dsts in nexthops which were removed from a
group. Even though the IPv6 gen is bumped, the dsts won't be released
until traffic passes through them or the nexthop is freed, that can take
arbitrarily long time, and even worse we can create a scenario[1] where it
can never be released. The fix is to release the IPv6 per-cpu dsts of
replaced nexthops after an RCU grace period so no new ones can be
created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
is used by the nexthop code only when necessary. We can further optimize
group replacement, but that is more suited for net-next as these patches
would have to be backported to stable releases.

Thanks,
 Nik

[1]
This info is also present in patch 02's commit message.
Initial state:
 $ ip nexthop list
  id 200 via 2002:db8::2 dev bridge.10 scope link onlink
  id 201 via 2002:db8::3 dev bridge scope link onlink
  id 203 group 201/200
 $ ip -6 route
  2001:db8::10 nhid 203 metric 1024 pref medium
     nexthop via 2002:db8::3 dev bridge weight 1 onlink
     nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink

Create rt6_info through one of the multipath legs, e.g.:
 $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
 (pkt_inj is just a custom packet generator, nothing special)

Then remove that leg from the group by replace (let's assume it is id
200 in this case):
 $ ip nexthop replace id 203 group 201

Now remove the IPv6 route:
 $ ip -6 route del 2001:db8::10/128

The route won't be really deleted due to the stale rt6_info holding 1
refcnt in nexthop id 200.
At this point we have the following reference count dependency:
 (deleted) IPv6 route holds 1 reference over nhid 203
 nh 203 holds 1 ref over id 201
 nh 200 holds 1 ref over the net device and the route due to the stale
 rt6_info

Now to create circular dependency between nh 200 and the IPv6 route, and
also to get a reference over nh 200, restore nhid 200 in the group:
 $ ip nexthop replace id 203 group 201/200

And now we have a permanent circular dependncy because nhid 203 holds a
reference over nh 200 and 201, but the route holds a ref over nh 203 and
is deleted.

To trigger the bug just delete the group (nhid 203):
 $ ip nexthop del id 203

It won't really be deleted due to the IPv6 route dependency, and now we
have 2 unlinked and deleted objects that reference each other: the group
and the IPv6 route. Since the group drops the reference it holds over its
entries at free time (i.e. its own refcount needs to drop to 0) that will
never happen and we get a permanent ref on them, since one of the entries
holds a reference over the IPv6 route it will also never be released.

At this point the dependencies are:
 (deleted, only unlinked) IPv6 route holds reference over group nh 203
 (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
 nh 200 holds 1 ref over the net device and the route due to the stale
 rt6_info

This is the last point where it can be fixed by running traffic through
nh 200, and specifically through the same CPU so the rt6_info (dst) will
get released due to the IPv6 genid, that in turn will free the IPv6
route, which in turn will free the ref count over the group nh 203.

If nh 200 is deleted at this point, it will never be released due to the
ref from the unlinked group 203, it will only be unlinked:
 $ ip nexthop del id 200
 $ ip nexthop
 $

Now we can never release that stale rt6_info, we have IPv6 route with ref
over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
rt6_info (dst) with ref over the net device and the IPv6 route. All of
these objects are only unlinked, and cannot be released, thus they can't
release their ref counts.

 Message from syslogd@dev at Nov 19 14:04:10 ...
  kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
 Message from syslogd@dev at Nov 19 14:04:20 ...
  kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3


Nikolay Aleksandrov (3):
  net: ipv6: add fib6_nh_release_dsts stub
  net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group
  selftests: net: fib_nexthops: add test for group refcount imbalance
    bug

 include/net/ip6_fib.h                       |  1 +
 include/net/ipv6_stubs.h                    |  1 +
 net/ipv4/nexthop.c                          | 25 ++++++++-
 net/ipv6/af_inet6.c                         |  1 +
 net/ipv6/route.c                            | 19 +++++++
 tools/testing/selftests/net/fib_nexthops.sh | 56 +++++++++++++++++++++
 6 files changed, 101 insertions(+), 2 deletions(-)

Comments

Ido Schimmel Nov. 21, 2021, 5:55 p.m. UTC | #1
On Sun, Nov 21, 2021 at 05:24:50PM +0200, Nikolay Aleksandrov wrote:
> From: Nikolay Aleksandrov <nikolay@nvidia.com>
> 
> Hi,
> This set fixes a refcount bug when replacing nexthop groups and
> modifying routes. It is complex because the objects look valid when
> debugging memory dumps, but we end up having refcount dependency between
> unlinked objects which can never be released, so in turn they cannot
> free their resources and refcounts. The problem happens because we can
> have stale IPv6 per-cpu dsts in nexthops which were removed from a
> group. Even though the IPv6 gen is bumped, the dsts won't be released
> until traffic passes through them or the nexthop is freed, that can take
> arbitrarily long time, and even worse we can create a scenario[1] where it
> can never be released. The fix is to release the IPv6 per-cpu dsts of
> replaced nexthops after an RCU grace period so no new ones can be
> created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
> is used by the nexthop code only when necessary. We can further optimize
> group replacement, but that is more suited for net-next as these patches
> would have to be backported to stable releases.

Will run regression with these patches tonight and report tomorrow
Nikolay Aleksandrov Nov. 21, 2021, 6:17 p.m. UTC | #2
On 21/11/2021 19:55, Ido Schimmel wrote:
> On Sun, Nov 21, 2021 at 05:24:50PM +0200, Nikolay Aleksandrov wrote:
>> From: Nikolay Aleksandrov <nikolay@nvidia.com>
>>
>> Hi,
>> This set fixes a refcount bug when replacing nexthop groups and
>> modifying routes. It is complex because the objects look valid when
>> debugging memory dumps, but we end up having refcount dependency between
>> unlinked objects which can never be released, so in turn they cannot
>> free their resources and refcounts. The problem happens because we can
>> have stale IPv6 per-cpu dsts in nexthops which were removed from a
>> group. Even though the IPv6 gen is bumped, the dsts won't be released
>> until traffic passes through them or the nexthop is freed, that can take
>> arbitrarily long time, and even worse we can create a scenario[1] where it
>> can never be released. The fix is to release the IPv6 per-cpu dsts of
>> replaced nexthops after an RCU grace period so no new ones can be
>> created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
>> is used by the nexthop code only when necessary. We can further optimize
>> group replacement, but that is more suited for net-next as these patches
>> would have to be backported to stable releases.
> 
> Will run regression with these patches tonight and report tomorrow
> 

Thank you, I've prepared v2 with the selftest mausezahn check and will hold
it off to see how the tests would go. Also if any comments show up in the
meantime. :)

By the way I've been running a torture test all day for multiple IPv6 route
forwarding + local traffic through different CPUs while also replacing multiple
nh groups referencing multiple nexthops, so far it looks good.
Ido Schimmel Nov. 22, 2021, 9:48 a.m. UTC | #3
On Sun, Nov 21, 2021 at 08:17:49PM +0200, Nikolay Aleksandrov wrote:
> On 21/11/2021 19:55, Ido Schimmel wrote:
> > On Sun, Nov 21, 2021 at 05:24:50PM +0200, Nikolay Aleksandrov wrote:
> >> From: Nikolay Aleksandrov <nikolay@nvidia.com>
> >>
> >> Hi,
> >> This set fixes a refcount bug when replacing nexthop groups and
> >> modifying routes. It is complex because the objects look valid when
> >> debugging memory dumps, but we end up having refcount dependency between
> >> unlinked objects which can never be released, so in turn they cannot
> >> free their resources and refcounts. The problem happens because we can
> >> have stale IPv6 per-cpu dsts in nexthops which were removed from a
> >> group. Even though the IPv6 gen is bumped, the dsts won't be released
> >> until traffic passes through them or the nexthop is freed, that can take
> >> arbitrarily long time, and even worse we can create a scenario[1] where it
> >> can never be released. The fix is to release the IPv6 per-cpu dsts of
> >> replaced nexthops after an RCU grace period so no new ones can be
> >> created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
> >> is used by the nexthop code only when necessary. We can further optimize
> >> group replacement, but that is more suited for net-next as these patches
> >> would have to be backported to stable releases.
> > 
> > Will run regression with these patches tonight and report tomorrow
> > 
> 
> Thank you, I've prepared v2 with the selftest mausezahn check and will hold
> it off to see how the tests would go. Also if any comments show up in the
> meantime. :)
> 
> By the way I've been running a torture test all day for multiple IPv6 route
> forwarding + local traffic through different CPUs while also replacing multiple
> nh groups referencing multiple nexthops, so far it looks good.

Regression looks good. Later today I will also have results from a debug
kernel, but I think it should be fine.

Regarding patch #2, can you add a comment (or edit the commit message)
to explain why the fix is only relevant for IPv4? I made this comment,
but I think it was missed:

"This problem is specific to IPv6 because IPv4 dst entries do not hold
references on routes / FIB info thereby avoiding the circular dependency
described in the commit message?"

Thanks!
Nikolay Aleksandrov Nov. 22, 2021, 9:53 a.m. UTC | #4
On 22/11/2021 11:48, Ido Schimmel wrote:
> On Sun, Nov 21, 2021 at 08:17:49PM +0200, Nikolay Aleksandrov wrote:
>> On 21/11/2021 19:55, Ido Schimmel wrote:
>>> On Sun, Nov 21, 2021 at 05:24:50PM +0200, Nikolay Aleksandrov wrote:
>>>> From: Nikolay Aleksandrov <nikolay@nvidia.com>
>>>>
>>>> Hi,
>>>> This set fixes a refcount bug when replacing nexthop groups and
>>>> modifying routes. It is complex because the objects look valid when
>>>> debugging memory dumps, but we end up having refcount dependency between
>>>> unlinked objects which can never be released, so in turn they cannot
>>>> free their resources and refcounts. The problem happens because we can
>>>> have stale IPv6 per-cpu dsts in nexthops which were removed from a
>>>> group. Even though the IPv6 gen is bumped, the dsts won't be released
>>>> until traffic passes through them or the nexthop is freed, that can take
>>>> arbitrarily long time, and even worse we can create a scenario[1] where it
>>>> can never be released. The fix is to release the IPv6 per-cpu dsts of
>>>> replaced nexthops after an RCU grace period so no new ones can be
>>>> created. To do that we add a new IPv6 stub - fib6_nh_release_dsts, which
>>>> is used by the nexthop code only when necessary. We can further optimize
>>>> group replacement, but that is more suited for net-next as these patches
>>>> would have to be backported to stable releases.
>>>
>>> Will run regression with these patches tonight and report tomorrow
>>>
>>
>> Thank you, I've prepared v2 with the selftest mausezahn check and will hold
>> it off to see how the tests would go. Also if any comments show up in the
>> meantime. :)
>>
>> By the way I've been running a torture test all day for multiple IPv6 route
>> forwarding + local traffic through different CPUs while also replacing multiple
>> nh groups referencing multiple nexthops, so far it looks good.
> 
> Regression looks good. Later today I will also have results from a debug
> kernel, but I think it should be fine.
> 
> Regarding patch #2, can you add a comment (or edit the commit message)
> to explain why the fix is only relevant for IPv4? I made this comment,
> but I think it was missed:
> 

I saw it, I've updated the commit msg to reflect why IPv4 isn't affected.

> "This problem is specific to IPv6 because IPv4 dst entries do not hold
> references on routes / FIB info thereby avoiding the circular dependency
> described in the commit message?"
> > Thanks!
> 

Cheers,
 Nik