Message ID | 20211121152453.2580051-3-razor@blackwall.org (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: nexthop: fix refcount issues when replacing groups | expand |
On Sun, Nov 21, 2021 at 05:24:52PM +0200, Nikolay Aleksandrov wrote: > From: Nikolay Aleksandrov <nikolay@nvidia.com> > > When replacing a nexthop group, we must release the IPv6 per-cpu dsts of > the removed nexthop entries after an RCU grace period because they > contain references to the nexthop's net device and to the fib6 info. > With specific series of events[1] we can reach net device refcount > imbalance which is unrecoverable. > > [1] > $ ip nexthop list > id 200 via 2002:db8::2 dev bridge.10 scope link onlink > id 201 via 2002:db8::3 dev bridge scope link onlink > id 203 group 201/200 > $ ip -6 route > 2001:db8::10 nhid 203 metric 1024 pref medium > nexthop via 2002:db8::3 dev bridge weight 1 onlink > nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink > > Create rt6_info through one of the multipath legs, e.g.: > $ taskset -a -c 1 ./pkt_inj 24 bridge.10 2001:db8::10 > (pkt_inj is just a custom packet generator, nothing special) > > Then remove that leg from the group by replace (let's assume it is id > 200 in this case): > $ ip nexthop replace id 203 group 201 > > Now remove the IPv6 route: > $ ip -6 route del 2001:db8::10/128 > > The route won't be really deleted due to the stale rt6_info holding 1 > refcnt in nexthop id 200. > At this point we have the following reference count dependency: > (deleted) IPv6 route holds 1 reference over nhid 203 > nh 203 holds 1 ref over id 201 > nh 200 holds 1 ref over the net device and the route due to the stale > rt6_info > > Now to create circular dependency between nh 200 and the IPv6 route, and > also to get a reference over nh 200, restore nhid 200 in the group: > $ ip nexthop replace id 203 group 201/200 > > And now we have a permanent circular dependncy because nhid 203 holds a > reference over nh 200 and 201, but the route holds a ref over nh 203 and > is deleted. > > To trigger the bug just delete the group (nhid 203): > $ ip nexthop del id 203 > > It won't really be deleted due to the IPv6 route dependency, and now we > have 2 unlinked and deleted objects that reference each other: the group > and the IPv6 route. Since the group drops the reference it holds over its > entries at free time (i.e. its own refcount needs to drop to 0) that will > never happen and we get a permanent ref on them, since one of the entries > holds a reference over the IPv6 route it will also never be released. > > At this point the dependencies are: > (deleted, only unlinked) IPv6 route holds reference over group nh 203 > (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200 > nh 200 holds 1 ref over the net device and the route due to the stale > rt6_info > > This is the last point where it can be fixed by running traffic through > nh 200, and specifically through the same CPU so the rt6_info (dst) will > get released due to the IPv6 genid, that in turn will free the IPv6 > route, which in turn will free the ref count over the group nh 203. > > If nh 200 is deleted at this point, it will never be released due to the > ref from the unlinked group 203, it will only be unlinked: > $ ip nexthop del id 200 > $ ip nexthop > $ > > Now we can never release that stale rt6_info, we have IPv6 route with ref > over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with > rt6_info (dst) with ref over the net device and the IPv6 route. All of > these objects are only unlinked, and cannot be released, thus they can't > release their ref counts. > > Message from syslogd@dev at Nov 19 14:04:10 ... > kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 > Message from syslogd@dev at Nov 19 14:04:20 ... > kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3 > > Fixes: 7bf4796dd099 ("nexthops: add support for replace") > Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> > --- > net/ipv4/nexthop.c | 25 +++++++++++++++++++++++-- > 1 file changed, 23 insertions(+), 2 deletions(-) > > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c > index 9e8100728d46..a69a9e76f99f 100644 > --- a/net/ipv4/nexthop.c > +++ b/net/ipv4/nexthop.c > @@ -1899,15 +1899,36 @@ static void remove_nexthop(struct net *net, struct nexthop *nh, > /* if any FIB entries reference this nexthop, any dst entries > * need to be regenerated > */ > -static void nh_rt_cache_flush(struct net *net, struct nexthop *nh) > +static void nh_rt_cache_flush(struct net *net, struct nexthop *nh, > + struct nexthop *replaced_nh) > { > struct fib6_info *f6i; > + struct nh_group *nhg; > + int i; > > if (!list_empty(&nh->fi_list)) > rt_cache_flush(net); > > list_for_each_entry(f6i, &nh->f6i_list, nh_list) > ipv6_stub->fib6_update_sernum(net, f6i); > + > + /* if an IPv6 group was replaced, we have to release all old > + * dsts to make sure all refcounts are released > + */ This problem is specific to IPv6 because IPv4 dst entries do not hold references on routes / FIB info thereby avoiding the circular dependency described in the commit message? > + if (!replaced_nh->is_group) > + return; Does it also make sense to skip the part below if we don't have any IPv6 routes using the nexthop? > + > + /* new dsts must use only the new nexthop group */ > + synchronize_net(); > + > + nhg = rtnl_dereference(replaced_nh->nh_grp); In replace_nexthop_grp() that precedes this function we assign the new nh_group to the nexthop shell used by the routes: rcu_assign_pointer(old->nh_grp, newg); And the old one that you want to purge is stored in 'replaced_nh->nh_grp': rcu_assign_pointer(new->nh_grp, oldg); The need for synchronize_net() above stems from the fact that some CPUs might still be using 'oldg' via 'old->nh_grp'? If so, we already have one synchronize_net() in replace_nexthop_grp() for resilient groups. See: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=563f23b002534176f49524b5ca0e1d94d8906c40 Can we avoid two synchronize_net() per resilient group by removing the one added here and instead do: diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index a69a9e76f99f..a47ce43ab1ff 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -2002,9 +2002,10 @@ static int replace_nexthop_grp(struct net *net, struct nexthop *old, rcu_assign_pointer(old->nh_grp, newg); + /* Make sure concurrent readers are not using 'oldg' anymore. */ + synchronize_net(); + if (newg->resilient) { - /* Make sure concurrent readers are not using 'oldg' anymore. */ - synchronize_net(); rcu_assign_pointer(oldg->res_table, tmp_table); rcu_assign_pointer(oldg->spare->res_table, tmp_table); } > + for (i = 0; i < nhg->num_nh; i++) { > + struct nh_grp_entry *nhge = &nhg->nh_entries[i]; > + struct nh_info *nhi = rtnl_dereference(nhge->nh->nh_info); > + > + if (nhi->family == AF_INET6) > + ipv6_stub->fib6_nh_release_dsts(&nhi->fib6_nh); > + } > } > > static int replace_nexthop_grp(struct net *net, struct nexthop *old, > @@ -2247,7 +2268,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old, > err = replace_nexthop_single(net, old, new, extack); > > if (!err) { > - nh_rt_cache_flush(net, old); > + nh_rt_cache_flush(net, old, new); > > __remove_nexthop(net, new, NULL); > nexthop_put(new); > -- > 2.31.1 >
On Sun, Nov 21, 2021 at 07:17:41PM +0200, Ido Schimmel wrote: > On Sun, Nov 21, 2021 at 05:24:52PM +0200, Nikolay Aleksandrov wrote: > > From: Nikolay Aleksandrov <nikolay@nvidia.com> > Can we avoid two synchronize_net() per resilient group by removing the > one added here and instead do: > > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c > index a69a9e76f99f..a47ce43ab1ff 100644 > --- a/net/ipv4/nexthop.c > +++ b/net/ipv4/nexthop.c > @@ -2002,9 +2002,10 @@ static int replace_nexthop_grp(struct net *net, struct nexthop *old, > > rcu_assign_pointer(old->nh_grp, newg); > > + /* Make sure concurrent readers are not using 'oldg' anymore. */ > + synchronize_net(); > + > if (newg->resilient) { > - /* Make sure concurrent readers are not using 'oldg' anymore. */ > - synchronize_net(); > rcu_assign_pointer(oldg->res_table, tmp_table); > rcu_assign_pointer(oldg->spare->res_table, tmp_table); > } Discussed this with Nik. It is possible and would be a good cleanup for net-next. For net it is best to leave synchronize_net() where it is so that the patch will be easier to backport. Resilient nexthop groups were only added in 5.13 whereas nexthop objects were added in 5.3
On 21/11/2021 19:35, Ido Schimmel wrote: > On Sun, Nov 21, 2021 at 07:17:41PM +0200, Ido Schimmel wrote: >> On Sun, Nov 21, 2021 at 05:24:52PM +0200, Nikolay Aleksandrov wrote: >>> From: Nikolay Aleksandrov <nikolay@nvidia.com> >> Can we avoid two synchronize_net() per resilient group by removing the >> one added here and instead do: >> >> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c >> index a69a9e76f99f..a47ce43ab1ff 100644 >> --- a/net/ipv4/nexthop.c >> +++ b/net/ipv4/nexthop.c >> @@ -2002,9 +2002,10 @@ static int replace_nexthop_grp(struct net *net, struct nexthop *old, >> >> rcu_assign_pointer(old->nh_grp, newg); >> >> + /* Make sure concurrent readers are not using 'oldg' anymore. */ >> + synchronize_net(); >> + >> if (newg->resilient) { >> - /* Make sure concurrent readers are not using 'oldg' anymore. */ >> - synchronize_net(); >> rcu_assign_pointer(oldg->res_table, tmp_table); >> rcu_assign_pointer(oldg->spare->res_table, tmp_table); >> } > > Discussed this with Nik. It is possible and would be a good cleanup for > net-next. For net it is best to leave synchronize_net() where it is so > that the patch will be easier to backport. Resilient nexthop groups were > only added in 5.13 whereas nexthop objects were added in 5.3 > Indeed, thank you for the review. I'll send patches for this suggestion and the IPv6 optimization (that is one of the optimizations I was referring to in the cover letter) for net-next after net is merged.
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c index 9e8100728d46..a69a9e76f99f 100644 --- a/net/ipv4/nexthop.c +++ b/net/ipv4/nexthop.c @@ -1899,15 +1899,36 @@ static void remove_nexthop(struct net *net, struct nexthop *nh, /* if any FIB entries reference this nexthop, any dst entries * need to be regenerated */ -static void nh_rt_cache_flush(struct net *net, struct nexthop *nh) +static void nh_rt_cache_flush(struct net *net, struct nexthop *nh, + struct nexthop *replaced_nh) { struct fib6_info *f6i; + struct nh_group *nhg; + int i; if (!list_empty(&nh->fi_list)) rt_cache_flush(net); list_for_each_entry(f6i, &nh->f6i_list, nh_list) ipv6_stub->fib6_update_sernum(net, f6i); + + /* if an IPv6 group was replaced, we have to release all old + * dsts to make sure all refcounts are released + */ + if (!replaced_nh->is_group) + return; + + /* new dsts must use only the new nexthop group */ + synchronize_net(); + + nhg = rtnl_dereference(replaced_nh->nh_grp); + for (i = 0; i < nhg->num_nh; i++) { + struct nh_grp_entry *nhge = &nhg->nh_entries[i]; + struct nh_info *nhi = rtnl_dereference(nhge->nh->nh_info); + + if (nhi->family == AF_INET6) + ipv6_stub->fib6_nh_release_dsts(&nhi->fib6_nh); + } } static int replace_nexthop_grp(struct net *net, struct nexthop *old, @@ -2247,7 +2268,7 @@ static int replace_nexthop(struct net *net, struct nexthop *old, err = replace_nexthop_single(net, old, new, extack); if (!err) { - nh_rt_cache_flush(net, old); + nh_rt_cache_flush(net, old, new); __remove_nexthop(net, new, NULL); nexthop_put(new);