diff mbox series

[net,2/2] vlan: move dev_put into vlan_dev_uninit

Message ID 76c52badfdccaa7f094d959eaf24f422ae09dec6.1644394642.git.lucien.xin@gmail.com (mailing list archive)
State Accepted
Commit d6ff94afd90b0ce8d1715f8ef77d4347d7a7f2c0
Delegated to: Netdev Maintainers
Headers show
Series vlan: fix a netdev refcnt leak for QinQ | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers fail 1 blamed authors not CCed: jgg@ziepe.ca; 2 maintainers not CCed: pablo@netfilter.org jgg@ziepe.ca
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 21 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Xin Long Feb. 9, 2022, 8:19 a.m. UTC
Shuang Li reported an QinQ issue by simply doing:

  # ip link add dummy0 type dummy
  # ip link add link dummy0 name dummy0.1 type vlan id 1
  # ip link add link dummy0.1 name dummy0.1.2 type vlan id 2
  # rmmod 8021q

 unregister_netdevice: waiting for dummy0.1 to become free. Usage count = 1

When rmmods 8021q, all vlan devs are deleted from their real_dev's vlan grp
and added into list_kill by unregister_vlan_dev(). dummy0.1 is unregistered
before dummy0.1.2, as it's using for_each_netdev() in __rtnl_kill_links().

When unregisters dummy0.1, dummy0.1.2 is not unregistered in the event of
NETDEV_UNREGISTER, as it's been deleted from dummy0.1's vlan grp. However,
due to dummy0.1.2 still holding dummy0.1, dummy0.1 will keep waiting in
netdev_wait_allrefs(), while dummy0.1.2 will never get unregistered and
release dummy0.1, as it delays dev_put until calling dev->priv_destructor,
vlan_dev_free().

This issue was introduced by Commit 563bcbae3ba2 ("net: vlan: fix a UAF in
vlan_dev_real_dev()"), and this patch is to fix it by moving dev_put() into
vlan_dev_uninit(), which is called after NETDEV_UNREGISTER event but before
netdev_wait_allrefs().

Fixes: 563bcbae3ba2 ("net: vlan: fix a UAF in vlan_dev_real_dev()")
Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/8021q/vlan_dev.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Jakub Kicinski Feb. 10, 2022, 1:55 a.m. UTC | #1
On Wed,  9 Feb 2022 03:19:56 -0500 Xin Long wrote:
> Shuang Li reported an QinQ issue by simply doing:
> 
>   # ip link add dummy0 type dummy
>   # ip link add link dummy0 name dummy0.1 type vlan id 1
>   # ip link add link dummy0.1 name dummy0.1.2 type vlan id 2
>   # rmmod 8021q
> 
>  unregister_netdevice: waiting for dummy0.1 to become free. Usage count = 1

How about we put this in a selftest under tools/testing/selftests/net/
or tools/testing/selftests/drivers/net/ ?

> When rmmods 8021q, all vlan devs are deleted from their real_dev's vlan grp
> and added into list_kill by unregister_vlan_dev(). dummy0.1 is unregistered
> before dummy0.1.2, as it's using for_each_netdev() in __rtnl_kill_links().
> 
> When unregisters dummy0.1, dummy0.1.2 is not unregistered in the event of
> NETDEV_UNREGISTER, as it's been deleted from dummy0.1's vlan grp. However,
> due to dummy0.1.2 still holding dummy0.1, dummy0.1 will keep waiting in
> netdev_wait_allrefs(), while dummy0.1.2 will never get unregistered and
> release dummy0.1, as it delays dev_put until calling dev->priv_destructor,
> vlan_dev_free().
> 
> This issue was introduced by Commit 563bcbae3ba2 ("net: vlan: fix a UAF in
> vlan_dev_real_dev()"), and this patch is to fix it by moving dev_put() into
> vlan_dev_uninit(), which is called after NETDEV_UNREGISTER event but before
> netdev_wait_allrefs().
> 
> Fixes: 563bcbae3ba2 ("net: vlan: fix a UAF in vlan_dev_real_dev()")

As far as I understand this is pretty much a revert of the previous fix.
Note that netdevice_event_work_handler() as seen in the backtrace in the
commit message of the fix in question is called from a workqueue, so the
ordering of netdev notifications saves us from nothing here. We can't
start freeing state until all refs are gone.

I think better fix would be to rewrite netdev_run_todo() to free the
netdevs in any order they become ready. That's gonna solve any
dependency problems and may even speed things up.
Xin Long Feb. 10, 2022, 3:40 a.m. UTC | #2
On Thu, Feb 10, 2022 at 9:56 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed,  9 Feb 2022 03:19:56 -0500 Xin Long wrote:
> > Shuang Li reported an QinQ issue by simply doing:
> >
> >   # ip link add dummy0 type dummy
> >   # ip link add link dummy0 name dummy0.1 type vlan id 1
> >   # ip link add link dummy0.1 name dummy0.1.2 type vlan id 2
> >   # rmmod 8021q
> >
> >  unregister_netdevice: waiting for dummy0.1 to become free. Usage count = 1
>
> How about we put this in a selftest under tools/testing/selftests/net/
> or tools/testing/selftests/drivers/net/ ?
I will try.

>
> > When rmmods 8021q, all vlan devs are deleted from their real_dev's vlan grp
> > and added into list_kill by unregister_vlan_dev(). dummy0.1 is unregistered
> > before dummy0.1.2, as it's using for_each_netdev() in __rtnl_kill_links().
> >
> > When unregisters dummy0.1, dummy0.1.2 is not unregistered in the event of
> > NETDEV_UNREGISTER, as it's been deleted from dummy0.1's vlan grp. However,
> > due to dummy0.1.2 still holding dummy0.1, dummy0.1 will keep waiting in
> > netdev_wait_allrefs(), while dummy0.1.2 will never get unregistered and
> > release dummy0.1, as it delays dev_put until calling dev->priv_destructor,
> > vlan_dev_free().
> >
> > This issue was introduced by Commit 563bcbae3ba2 ("net: vlan: fix a UAF in
> > vlan_dev_real_dev()"), and this patch is to fix it by moving dev_put() into
> > vlan_dev_uninit(), which is called after NETDEV_UNREGISTER event but before
> > netdev_wait_allrefs().
> >
> > Fixes: 563bcbae3ba2 ("net: vlan: fix a UAF in vlan_dev_real_dev()")
>
> As far as I understand this is pretty much a revert of the previous fix.
> Note that netdevice_event_work_handler() as seen in the backtrace in the
> commit message of the fix in question is called from a workqueue, so the
> ordering of netdev notifications saves us from nothing here. We can't
> start freeing state until all refs are gone.
understand.

>
> I think better fix would be to rewrite netdev_run_todo() to free the
> netdevs in any order they become ready. That's gonna solve any
> dependency problems and may even speed things up.
>
What about I keep dev_put() in dev->priv_destructor()/vlan_dev_free() for
vlan as before, and fix this problem by using for_each_netdev_reverse()
in __rtnl_kill_links()?
It will make sense as the late added dev should be deleted early when
rtnl_link_unregister a rtnl_link_ops.
Jakub Kicinski Feb. 10, 2022, 5:28 a.m. UTC | #3
On Thu, 10 Feb 2022 11:40:42 +0800 Xin Long wrote:
> > I think better fix would be to rewrite netdev_run_todo() to free the
> > netdevs in any order they become ready. That's gonna solve any
> > dependency problems and may even speed things up.
>
> What about I keep dev_put() in dev->priv_destructor()/vlan_dev_free() for
> vlan as before, and fix this problem by using for_each_netdev_reverse()
> in __rtnl_kill_links()?
> It will make sense as the late added dev should be deleted early when
> rtnl_link_unregister a rtnl_link_ops.

Feels like sooner or later we'll run into a scenario when reversing will
cause a problem. Or some data structure will stop preserving the order.

Do you reckon rewriting netdev_run_todo() will be a lot of effort or
it's too risky?
Eric Dumazet Feb. 10, 2022, 5:49 a.m. UTC | #4
On Wed, Feb 9, 2022 at 9:28 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 10 Feb 2022 11:40:42 +0800 Xin Long wrote:
> > > I think better fix would be to rewrite netdev_run_todo() to free the
> > > netdevs in any order they become ready. That's gonna solve any
> > > dependency problems and may even speed things up.
> >
> > What about I keep dev_put() in dev->priv_destructor()/vlan_dev_free() for
> > vlan as before, and fix this problem by using for_each_netdev_reverse()
> > in __rtnl_kill_links()?
> > It will make sense as the late added dev should be deleted early when
> > rtnl_link_unregister a rtnl_link_ops.
>
> Feels like sooner or later we'll run into a scenario when reversing will
> cause a problem. Or some data structure will stop preserving the order.
>
> Do you reckon rewriting netdev_run_todo() will be a lot of effort or
> it's too risky?

This is doable, and risky ;)

BTW, I have the plan of generalizing blackhole_netdev for IPv6,
meaning that we could perhaps get rid of the dependency
about loopback dev, being the last device in a netns being dismantled.
Jakub Kicinski Feb. 10, 2022, 5:59 a.m. UTC | #5
On Wed, 9 Feb 2022 21:49:51 -0800 Eric Dumazet wrote:
> > Feels like sooner or later we'll run into a scenario when reversing will
> > cause a problem. Or some data structure will stop preserving the order.
> >
> > Do you reckon rewriting netdev_run_todo() will be a lot of effort or
> > it's too risky?  
> 
> This is doable, and risky ;)
> 
> BTW, I have the plan of generalizing blackhole_netdev for IPv6,
> meaning that we could perhaps get rid of the dependency
> about loopback dev, being the last device in a netns being dismantled.

Oh, I see..

I have no great ideas then, we may need to go back to zeroing
vlan->real_dev and making sure the caller can deal with that.
At least for the time being. Xin this was discussed at some 
length in response to the patch under Fixes.
Eric Dumazet Feb. 10, 2022, 6:10 a.m. UTC | #6
On Wed, Feb 9, 2022 at 9:49 PM Eric Dumazet <edumazet@google.com> wrote:

>
> BTW, I have the plan of generalizing blackhole_netdev for IPv6,
> meaning that we could perhaps get rid of the dependency
> about loopback dev, being the last device in a netns being dismantled.

Untested patch to give more ideas:
(Main incentive is that we are still chasing a unregister_device bug,
that I now think is related to IPv6 idev leaking somewhere)


diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 4f402bc38f056e08f3761e63a7bc7a51e54e9384..02d31d4fcab3b3d529c4fe3260216ecee1108e82
100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -372,7 +372,7 @@ static struct inet6_dev *ipv6_add_dev(struct
net_device *dev)

        ASSERT_RTNL();

-       if (dev->mtu < IPV6_MIN_MTU)
+       if (dev->mtu < IPV6_MIN_MTU && dev != blackhole_netdev)
                return ERR_PTR(-EINVAL);

        ndev = kzalloc(sizeof(struct inet6_dev), GFP_KERNEL);
@@ -400,21 +400,22 @@ static struct inet6_dev *ipv6_add_dev(struct
net_device *dev)
        /* We refer to the device */
        dev_hold_track(dev, &ndev->dev_tracker, GFP_KERNEL);

-       if (snmp6_alloc_dev(ndev) < 0) {
-               netdev_dbg(dev, "%s: cannot allocate memory for statistics\n",
-                          __func__);
-               neigh_parms_release(&nd_tbl, ndev->nd_parms);
-               dev_put_track(dev, &ndev->dev_tracker);
-               kfree(ndev);
-               return ERR_PTR(err);
-       }
+       if (dev != blackhole_netdev) {
+               if (snmp6_alloc_dev(ndev) < 0) {
+                       netdev_dbg(dev, "%s: cannot allocate memory
for statistics\n",
+                                  __func__);
+                       neigh_parms_release(&nd_tbl, ndev->nd_parms);
+                       dev_put_track(dev, &ndev->dev_tracker);
+                       kfree(ndev);
+                       return ERR_PTR(err);
+               }

-       if (snmp6_register_dev(ndev) < 0) {
-               netdev_dbg(dev, "%s: cannot create /proc/net/dev_snmp6/%s\n",
-                          __func__, dev->name);
-               goto err_release;
+               if (snmp6_register_dev(ndev) < 0) {
+                       netdev_dbg(dev, "%s: cannot create
/proc/net/dev_snmp6/%s\n",
+                                  __func__, dev->name);
+                       goto err_release;
+               }
        }
-
        /* One reference from device. */
        refcount_set(&ndev->refcnt, 1);

@@ -445,25 +446,28 @@ static struct inet6_dev *ipv6_add_dev(struct
net_device *dev)

        ipv6_mc_init_dev(ndev);
        ndev->tstamp = jiffies;
-       err = addrconf_sysctl_register(ndev);
-       if (err) {
-               ipv6_mc_destroy_dev(ndev);
-               snmp6_unregister_dev(ndev);
-               goto err_release;
+       if (dev != blackhole_netdev) {
+               err = addrconf_sysctl_register(ndev);
+               if (err) {
+                       ipv6_mc_destroy_dev(ndev);
+                       snmp6_unregister_dev(ndev);
+                       goto err_release;
+               }
        }
        /* protected by rtnl_lock */
        rcu_assign_pointer(dev->ip6_ptr, ndev);

-       /* Join interface-local all-node multicast group */
-       ipv6_dev_mc_inc(dev, &in6addr_interfacelocal_allnodes);
+       if (dev != blackhole_netdev) {
+               /* Join interface-local all-node multicast group */
+               ipv6_dev_mc_inc(dev, &in6addr_interfacelocal_allnodes);

-       /* Join all-node multicast group */
-       ipv6_dev_mc_inc(dev, &in6addr_linklocal_allnodes);
-
-       /* Join all-router multicast group if forwarding is set */
-       if (ndev->cnf.forwarding && (dev->flags & IFF_MULTICAST))
-               ipv6_dev_mc_inc(dev, &in6addr_linklocal_allrouters);
+               /* Join all-node multicast group */
+               ipv6_dev_mc_inc(dev, &in6addr_linklocal_allnodes);

+               /* Join all-router multicast group if forwarding is set */
+               if (ndev->cnf.forwarding && (dev->flags & IFF_MULTICAST))
+                       ipv6_dev_mc_inc(dev, &in6addr_linklocal_allrouters);
+       }
        return ndev;

 err_release:
@@ -7233,26 +7237,8 @@ int __init addrconf_init(void)
                goto out_nowq;
        }

-       /* The addrconf netdev notifier requires that loopback_dev
-        * has it's ipv6 private information allocated and setup
-        * before it can bring up and give link-local addresses
-        * to other devices which are up.
-        *
-        * Unfortunately, loopback_dev is not necessarily the first
-        * entry in the global dev_base list of net devices.  In fact,
-        * it is likely to be the very last entry on that list.
-        * So this causes the notifier registry below to try and
-        * give link-local addresses to all devices besides loopback_dev
-        * first, then loopback_dev, which cases all the non-loopback_dev
-        * devices to fail to get a link-local address.
-        *
-        * So, as a temporary fix, allocate the ipv6 structure for
-        * loopback_dev first by hand.
-        * Longer term, all of the dependencies ipv6 has upon the loopback
-        * device and it being up should be removed.
-        */
        rtnl_lock();
-       idev = ipv6_add_dev(init_net.loopback_dev);
+       idev = ipv6_add_dev(blackhole_netdev);
        rtnl_unlock();
        if (IS_ERR(idev)) {
                err = PTR_ERR(idev);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f4884cda13b92e72d041680cbabfee2e07ec0f10..dc9d26a77c48f649ec39084c539d45b378474a35
100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -160,12 +160,8 @@ void rt6_uncached_list_del(struct rt6_info *rt)

 static void rt6_uncached_list_flush_dev(struct net *net, struct
net_device *dev)
 {
-       struct net_device *loopback_dev = net->loopback_dev;
        int cpu;

-       if (dev == loopback_dev)
-               return;
-
        for_each_possible_cpu(cpu) {
                struct uncached_list *ul = per_cpu_ptr(&rt6_uncached_list, cpu);
                struct rt6_info *rt;
@@ -176,7 +172,7 @@ static void rt6_uncached_list_flush_dev(struct net
*net, struct net_device *dev)
                        struct net_device *rt_dev = rt->dst.dev;

                        if (rt_idev->dev == dev) {
-                               rt->rt6i_idev = in6_dev_get(loopback_dev);
+                               rt->rt6i_idev = in6_dev_get(blackhole_netdev);
                                in6_dev_put(rt_idev);
                        }

@@ -373,13 +369,12 @@ static void ip6_dst_ifdown(struct dst_entry
*dst, struct net_device *dev,
 {
        struct rt6_info *rt = (struct rt6_info *)dst;
        struct inet6_dev *idev = rt->rt6i_idev;
-       struct net_device *loopback_dev =
-               dev_net(dev)->loopback_dev;

-       if (idev && idev->dev != loopback_dev) {
-               struct inet6_dev *loopback_idev = in6_dev_get(loopback_dev);
-               if (loopback_idev) {
-                       rt->rt6i_idev = loopback_idev;
+       if (idev && idev->dev != blackhole_netdev) {
+               struct inet6_dev *blackhole_idev =
in6_dev_get(blackhole_netdev);
+
+               if (blackhole_idev) {
+                       rt->rt6i_idev = blackhole_idev;
                        in6_dev_put(idev);
                }
        }
Xin Long Feb. 11, 2022, 7:58 a.m. UTC | #7
On Thu, Feb 10, 2022 at 1:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 9 Feb 2022 21:49:51 -0800 Eric Dumazet wrote:
> > > Feels like sooner or later we'll run into a scenario when reversing will
> > > cause a problem. Or some data structure will stop preserving the order.
> > >
> > > Do you reckon rewriting netdev_run_todo() will be a lot of effort or
> > > it's too risky?
> >
> > This is doable, and risky ;)
> >
> > BTW, I have the plan of generalizing blackhole_netdev for IPv6,
> > meaning that we could perhaps get rid of the dependency
> > about loopback dev, being the last device in a netns being dismantled.
>
> Oh, I see..
>
> I have no great ideas then, we may need to go back to zeroing
> vlan->real_dev and making sure the caller can deal with that.
> At least for the time being. Xin this was discussed at some
> length in response to the patch under Fixes.
Hi, Jakub,

What if dev->real_dev is freed and zeroed *after* vlan_dev_real_dev()
is called? This issue can still be triggered, right? I don't see any lock
protecting this.

> Feels like sooner or later we'll run into a scenario when reversing will
> cause a problem. Or some data structure will stop preserving the order.
I was checking a few places doing such batch devices freeing, and noticed that:
In rtnl_group_dellink() and __rtnl_kill_links(), it's using for_each_netdev(),
while in default_device_exit_batch(), it's using for_each_netdev_reverse().

shouldn't be in the same order all these places? If yes, which one is the
right one to use?

Thanks.
Jakub Kicinski Feb. 11, 2022, 10:06 p.m. UTC | #8
On Fri, 11 Feb 2022 15:58:56 +0800 Xin Long wrote:
> On Thu, Feb 10, 2022 at 1:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > This is doable, and risky ;)
> > >
> > > BTW, I have the plan of generalizing blackhole_netdev for IPv6,
> > > meaning that we could perhaps get rid of the dependency
> > > about loopback dev, being the last device in a netns being dismantled.  
> >
> > Oh, I see..
> >
> > I have no great ideas then, we may need to go back to zeroing
> > vlan->real_dev and making sure the caller can deal with that.
> > At least for the time being. Xin this was discussed at some
> > length in response to the patch under Fixes.  
> 
> What if dev->real_dev is freed and zeroed *after* vlan_dev_real_dev()
> is called? This issue can still be triggered, right? I don't see any lock
> protecting this.

Maybe the suggestion in the old thread was to NULL the pointer out
before unregister is called. Which seems like a bad idea, as the 
netdev would already be impaired when unregister is called.

> > Feels like sooner or later we'll run into a scenario when reversing will
> > cause a problem. Or some data structure will stop preserving the order.  
> I was checking a few places doing such batch devices freeing, and noticed that:
> In rtnl_group_dellink() and __rtnl_kill_links(), it's using for_each_netdev(),
> while in default_device_exit_batch(), it's using for_each_netdev_reverse().
> 
> shouldn't be in the same order all these places? If yes, which one is the
> right one to use?

I don't know. Maybe this will work maybe it will cause a circular
dependency with something else.

Honestly, I don't have a simple solutions to offer. Jann Horn pointed
out recently that our per-CPU netdev refs are themselves racy...
diff mbox series

Patch

diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index e5d23e75572a..d1902828a18a 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -638,7 +638,12 @@  void vlan_dev_free_egress_priority(const struct net_device *dev)
 
 static void vlan_dev_uninit(struct net_device *dev)
 {
+	struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
+
 	vlan_dev_free_egress_priority(dev);
+
+	/* Get rid of the vlan's reference to real_dev */
+	dev_put_track(vlan->real_dev, &vlan->dev_tracker);
 }
 
 static netdev_features_t vlan_dev_fix_features(struct net_device *dev,
@@ -851,9 +856,6 @@  static void vlan_dev_free(struct net_device *dev)
 
 	free_percpu(vlan->vlan_pcpu_stats);
 	vlan->vlan_pcpu_stats = NULL;
-
-	/* Get rid of the vlan's reference to real_dev */
-	dev_put_track(vlan->real_dev, &vlan->dev_tracker);
 }
 
 void vlan_setup(struct net_device *dev)