net: linkwatch: ignore events for unregistered netdevs

Message ID	18b3541e5372bc9b9fc733d422f4e698c089077c.1650177997.git.lukas@wunner.de (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> Message-Id: <18b3541e5372bc9b9fc733d422f4e698c089077c.1650177997.git.lukas@wunner.de> From: Lukas Wunner <lukas@wunner.de> Date: Sun, 17 Apr 2022 09:04:19 +0200 Subject: [PATCH] net: linkwatch: ignore events for unregistered netdevs To: Oliver Neukum <oneukum@suse.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Jann Horn <jannh@google.com>, Oleksij Rempel <o.rempel@pengutronix.de>, Eric Dumazet <edumazet@google.com> Cc: netdev@vger.kernel.org, linux-usb@vger.kernel.org, Andrew Lunn <andrew@lunn.ch>, Jacky Chou <jackychou@asix.com.tw>, Willy Tarreau <w@1wt.eu>, Lino Sanfilippo <LinoSanfilippo@gmx.de>, Philipp Rosenberger <p.rosenberger@kunbus.com>, Heiner Kallweit <hkallweit1@gmail.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org> Precedence: bulk
Series	net: linkwatch: ignore events for unregistered netdevs \| expand net: linkwatch: ignore events for unregistered netdevs

Context	Check	Description
netdev/tree_selection	success	Guessing tree name failed - patch did not apply, async

Lukas Wunner April 17, 2022, 7:04 a.m. UTC

Jann Horn reports a use-after-free on disconnect of a USB Ethernet
(ax88179_178a.c).  Oleksij Rempel has witnessed the same issue with a
different driver (ax88172a.c).

Jann's report (linked below) explains the root cause in great detail,
but the gist is that USB Ethernet drivers call linkwatch_fire_event()
between unregister_netdev() and free_netdev().  The asynchronous work
linkwatch_event() may thus access the netdev after it's been freed.

USB Ethernet may not even be the only culprit.  To address the problem
in the most general way, ignore link events once a netdev's state has
been set to NETREG_UNREGISTERED.

That happens in netdev_run_todo() immediately before the call to
linkwatch_forget_dev().  Note that lweventlist_lock (and its implied
memory barrier) guarantees that a linkwatch_add_event() running after
linkwatch_forget_dev() will see the netdev's new state and bail out.
An unregistered netdev is therefore never added to link_watch_list
(but may have its __LINK_STATE_LINKWATCH_PENDING bit set, which should
not matter).  That obviates the need to invoke linkwatch_run_queue() in
netdev_wait_allrefs(), so drop it.

In a sense, the present commit is to *no longer* registered netdevs as
commit b47300168e77 ("net: Do not fire linkwatch events until the device
is registered.") is to *not yet* registered netdevs.

Reported-by: Jann Horn <jannh@google.com>
Link: https://lore.kernel.org/netdev/CAG48ez0MHBbENX5gCdHAUXZ7h7s20LnepBF-pa5M=7Bi-jZrEA@mail.gmail.com/
Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/netdev/20220315113841.GA22337@pengutronix.de/
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org
Cc: Eric Dumazet <edumazet@google.com>
Cc: Oliver Neukum <oneukum@suse.com>
Cc: Andrew Lunn <andrew@lunn.ch>
---
 include/linux/netdevice.h |  2 --
 net/core/dev.c            | 17 -----------------
 net/core/link_watch.c     | 10 ++--------
 3 files changed, 2 insertions(+), 27 deletions(-)

Paolo Abeni April 21, 2022, 8:02 a.m. UTC | #1

On Sun, 2022-04-17 at 09:04 +0200, Lukas Wunner wrote:
> Jann Horn reports a use-after-free on disconnect of a USB Ethernet
> (ax88179_178a.c).  Oleksij Rempel has witnessed the same issue with a
> different driver (ax88172a.c).
> 
> Jann's report (linked below) explains the root cause in great detail,
> but the gist is that USB Ethernet drivers call linkwatch_fire_event()
> between unregister_netdev() and free_netdev().  The asynchronous work
> linkwatch_event() may thus access the netdev after it's been freed.
> 
> USB Ethernet may not even be the only culprit.  To address the problem
> in the most general way, ignore link events once a netdev's state has
> been set to NETREG_UNREGISTERED.
> 
> That happens in netdev_run_todo() immediately before the call to
> linkwatch_forget_dev().  Note that lweventlist_lock (and its implied
> memory barrier) guarantees that a linkwatch_add_event() running after
> linkwatch_forget_dev() will see the netdev's new state and bail out.
> An unregistered netdev is therefore never added to link_watch_list
> (but may have its __LINK_STATE_LINKWATCH_PENDING bit set, which should
> not matter).  That obviates the need to invoke linkwatch_run_queue() in
> netdev_wait_allrefs(), so drop it.
> 
> In a sense, the present commit is to *no longer* registered netdevs as
> commit b47300168e77 ("net: Do not fire linkwatch events until the device
> is registered.") is to *not yet* registered netdevs.
> 
> Reported-by: Jann Horn <jannh@google.com>
> Link: https://lore.kernel.org/netdev/CAG48ez0MHBbENX5gCdHAUXZ7h7s20LnepBF-pa5M=7Bi-jZrEA@mail.gmail.com/
> Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
> Link: https://lore.kernel.org/netdev/20220315113841.GA22337@pengutronix.de/
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> Cc: stable@vger.kernel.org
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Oliver Neukum <oneukum@suse.com>
> Cc: Andrew Lunn <andrew@lunn.ch>
> ---
>  include/linux/netdevice.h |  2 --
>  net/core/dev.c            | 17 -----------------
>  net/core/link_watch.c     | 10 ++--------
>  3 files changed, 2 insertions(+), 27 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 59e27a2b7bf0..5d950b45b59d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -4805,8 +4805,6 @@ extern const struct kobj_ns_type_operations net_ns_type_operations;
>  
>  const char *netdev_drivername(const struct net_device *dev);
>  
> -void linkwatch_run_queue(void);
> -
>  static inline netdev_features_t netdev_intersect_features(netdev_features_t f1,
>  							  netdev_features_t f2)
>  {
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8c6c08446556..0ee56965ff76 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -10140,23 +10140,6 @@ static struct net_device *netdev_wait_allrefs_any(struct list_head *list)
>  			list_for_each_entry(dev, list, todo_list)
>  				call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
>  
> -			__rtnl_unlock();
> -			rcu_barrier();
> -			rtnl_lock();
> -
> -			list_for_each_entry(dev, list, todo_list)
> -				if (test_bit(__LINK_STATE_LINKWATCH_PENDING,
> -					     &dev->state)) {
> -					/* We must not have linkwatch events
> -					 * pending on unregister. If this
> -					 * happens, we simply run the queue
> -					 * unscheduled, resulting in a noop
> -					 * for this device.
> -					 */
> -					linkwatch_run_queue();
> -					break;
> -				}
> -
>  			__rtnl_unlock();
>  
>  			rebroadcast_time = jiffies;
> diff --git a/net/core/link_watch.c b/net/core/link_watch.c
> index 95098d1a49bd..9a0ea7cd68e4 100644
> --- a/net/core/link_watch.c
> +++ b/net/core/link_watch.c
> @@ -107,7 +107,8 @@ static void linkwatch_add_event(struct net_device *dev)
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&lweventlist_lock, flags);
> -	if (list_empty(&dev->link_watch_list)) {
> +	if (list_empty(&dev->link_watch_list) &&
> +	    dev->reg_state < NETREG_UNREGISTERED) {
>  		list_add_tail(&dev->link_watch_list, &lweventlist);
>  		dev_hold_track(dev, &dev->linkwatch_dev_tracker, GFP_ATOMIC);
>  	

What about testing dev->reg_state in linkwatch_fire_event() before
setting the __LINK_STATE_LINKWATCH_PENDING bit, so that we don't leave
the device in an unexpected state?

Other than that, it looks good to me, but potentially quite risky. 

Looking at the original report it looks like the issue could be
resolved with a more usb-specific change: e.g. it looks like
usbnet_defer_kevent() is not acquiring a dev reference as it should.

Have you considered that path?

Thanks,

Paolo

Lukas Wunner April 23, 2022, 4:07 p.m. UTC | #2

On Thu, Apr 21, 2022 at 10:02:43AM +0200, Paolo Abeni wrote:
> On Sun, 2022-04-17 at 09:04 +0200, Lukas Wunner wrote:
> > --- a/net/core/link_watch.c
> > +++ b/net/core/link_watch.c
> > @@ -107,7 +107,8 @@ static void linkwatch_add_event(struct net_device *dev)
> >  	unsigned long flags;
> >  
> >  	spin_lock_irqsave(&lweventlist_lock, flags);
> > -	if (list_empty(&dev->link_watch_list)) {
> > +	if (list_empty(&dev->link_watch_list) &&
> > +	    dev->reg_state < NETREG_UNREGISTERED) {
> >  		list_add_tail(&dev->link_watch_list, &lweventlist);
> >  		dev_hold_track(dev, &dev->linkwatch_dev_tracker, GFP_ATOMIC);
> >  	
> 
> What about testing dev->reg_state in linkwatch_fire_event() before
> setting the __LINK_STATE_LINKWATCH_PENDING bit, so that we don't leave
> the device in an unexpected state?

That would be racy because linkwatch_fire_event() may see a reg_state of
REGISTERED or UNREGISTERING and then add the device to link_watch_list,
even though reg_state may be changed to UNREGISTERED in-between.

That race is avoided by performing the reg_state check under
lweventlist_lock:

Scenario 1:

CPU 1:                                  CPU 2:
                                        linkwatch_add_event(dev);
dev->reg_state = NETREG_UNREGISTERED;
linkwatch_forget_dev(dev);

In this scenario, CPU 2 sees the old value of dev->reg_state and
adds the device to link_watch_list, but CPU 1 will subsequently
delete it from the list.

Scenario 2:

CPU 1:                                  CPU 2:
dev->reg_state = NETREG_UNREGISTERED;
linkwatch_forget_dev(dev);
                                        linkwatch_add_event(dev);

In this scenario, CPU 2 refrains from adding the device to
link_watch_list.  It is guaranteed to see the new reg_state
due to the memory barriers implied by lweventlist_lock,
which is taken both by linkwatch_forget_dev() and
linkwatch_add_event().

Note that an unregistered netdev has been stopped, so the portion
of linkwatch_do_dev() which is constrained to the netdev being IFF_UP
is skipped.  The only portion that's executed is rfc2863_policy(),
which updates the operstate.

I believe that operstate changes are irrelevant and unnecessary after
the netdev has been unregistered.

Same for the fact that __LINK_STATE_LINKWATCH_PENDING may be set even
though the netdev is not on link_watch_list.  That should be irrelevant
for an unregistered netdev.

> Other than that, it looks good to me, but potentially quite risky.

To mitigate risk I suggest letting the patch bake in linux-next
for a couple of weeks.

However I would then have to respin it because the declaration of
linkwatch_run_queue() was moved from include/linux/netdevice.h to
net/core/dev.h by Jakub's net-next commit 6264f58ca0e5 ("net:
extract a few internals from netdevice.h").

Let me know if you want me to respin the patch based on net-next.

> Looking at the original report it looks like the issue could be
> resolved with a more usb-specific change: e.g. it looks like
> usbnet_defer_kevent() is not acquiring a dev reference as it should.
> 
> Have you considered that path?

First of all, the diffstat of the patch shows this is an opportunity
to reduce LoC as well as simplify and speed up device teardown.

Second, the approach you're proposing won't work if a driver calls
netif_carrier_on/off() after unregister_netdev().

It seems prudent to prevent such a misbehavior in *any* driver,
not just usbnet.  usbnet may not be the only one doing it wrong.
Jann pointed out that there are more syzbot reports related
to a UAF in linkwatch:

https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot

Third, I think an API which schedules work, invisibly to the driver,
is dangerous and misguided.  If it is illegal to call
netif_carrier_on/off() for an unregistered but not yet freed netdev,
catch that in core networking code and don't expect drivers to respect
a rule which isn't even documented.

Thanks,

Lukas

Lukas Wunner April 23, 2022, 7:35 p.m. UTC | #3

On Sat, Apr 23, 2022 at 06:07:23PM +0200, Lukas Wunner wrote:
> On Thu, Apr 21, 2022 at 10:02:43AM +0200, Paolo Abeni wrote:
> > On Sun, 2022-04-17 at 09:04 +0200, Lukas Wunner wrote:
> > > --- a/net/core/link_watch.c
> > > +++ b/net/core/link_watch.c
> > > @@ -107,7 +107,8 @@ static void linkwatch_add_event(struct net_device *dev)
> > >  	unsigned long flags;
> > >  
> > >  	spin_lock_irqsave(&lweventlist_lock, flags);
> > > -	if (list_empty(&dev->link_watch_list)) {
> > > +	if (list_empty(&dev->link_watch_list) &&
> > > +	    dev->reg_state < NETREG_UNREGISTERED) {
> > >  		list_add_tail(&dev->link_watch_list, &lweventlist);
> > >  		dev_hold_track(dev, &dev->linkwatch_dev_tracker, GFP_ATOMIC);
> > >  	
> > 
> > What about testing dev->reg_state in linkwatch_fire_event() before
> > setting the __LINK_STATE_LINKWATCH_PENDING bit, so that we don't leave
> > the device in an unexpected state?

About __LINK_STATE_LINKWATCH_PENDING being set even though the netdev
is not on link_watch_list:

After this patch (which removes one user of __LINK_STATE_LINKWATCH_PENDING)
the only purpose of the flag is a small speed-up of linkwatch_fire_event():
If the netdev is already on link_watch_list, the function skips acquiring
lweventlist_lock.

I don't think this is a hotpath, so the small speed-up is probably not worth
it and the flag could be removed completely in a follow-up patch.

There is a single other (somewhat oddball) user of the flag in
bond_should_notify_peers() in drivers/net/bonding/bond_main.c.
It would be possible to replace it with "!list_empty(&dev->link_watch_list)".
I don't think acquiring lweventlist_lock is necessary for that because
test_bit() is unordered (per Documentation/atomic_bitops.txt) and the
check is racy anyway.

Thanks,

Lukas

Jakub Kicinski April 25, 2022, 2:41 p.m. UTC | #4

On Sat, 23 Apr 2022 18:07:23 +0200 Lukas Wunner wrote:
> > Looking at the original report it looks like the issue could be
> > resolved with a more usb-specific change: e.g. it looks like
> > usbnet_defer_kevent() is not acquiring a dev reference as it should.
> > 
> > Have you considered that path?  
> 
> First of all, the diffstat of the patch shows this is an opportunity
> to reduce LoC as well as simplify and speed up device teardown.
> 
> Second, the approach you're proposing won't work if a driver calls
> netif_carrier_on/off() after unregister_netdev().
> 
> It seems prudent to prevent such a misbehavior in *any* driver,
> not just usbnet.  usbnet may not be the only one doing it wrong.
> Jann pointed out that there are more syzbot reports related
> to a UAF in linkwatch:
> 
> https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot
> 
> Third, I think an API which schedules work, invisibly to the driver,
> is dangerous and misguided.  If it is illegal to call
> netif_carrier_on/off() for an unregistered but not yet freed netdev,
> catch that in core networking code and don't expect drivers to respect
> a rule which isn't even documented.

Doesn't mean we should make it legal. We can add a warning to catch
abuses.

Jann Horn April 25, 2022, 2:49 p.m. UTC | #5

On Mon, Apr 25, 2022 at 4:41 PM Jakub Kicinski <kuba@kernel.org> wrote:
> On Sat, 23 Apr 2022 18:07:23 +0200 Lukas Wunner wrote:
> > > Looking at the original report it looks like the issue could be
> > > resolved with a more usb-specific change: e.g. it looks like
> > > usbnet_defer_kevent() is not acquiring a dev reference as it should.
> > >
> > > Have you considered that path?
> >
> > First of all, the diffstat of the patch shows this is an opportunity
> > to reduce LoC as well as simplify and speed up device teardown.
> >
> > Second, the approach you're proposing won't work if a driver calls
> > netif_carrier_on/off() after unregister_netdev().
> >
> > It seems prudent to prevent such a misbehavior in *any* driver,
> > not just usbnet.  usbnet may not be the only one doing it wrong.
> > Jann pointed out that there are more syzbot reports related
> > to a UAF in linkwatch:
> >
> > https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot
> >
> > Third, I think an API which schedules work, invisibly to the driver,
> > is dangerous and misguided.  If it is illegal to call
> > netif_carrier_on/off() for an unregistered but not yet freed netdev,
> > catch that in core networking code and don't expect drivers to respect
> > a rule which isn't even documented.
>
> Doesn't mean we should make it legal. We can add a warning to catch
> abuses.

That was the idea with
https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
but I didn't get any replies when I asked what the precise semantics
of dev_hold() are supposed to be
(https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
so I don't know how to proceed...

Jakub Kicinski April 25, 2022, 3 p.m. UTC | #6

On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > Doesn't mean we should make it legal. We can add a warning to catch
> > abuses.  
> 
> That was the idea with
> https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> but I didn't get any replies when I asked what the precise semantics
> of dev_hold() are supposed to be
> (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> so I don't know how to proceed...

Yeah, I think after you pointed out that the netdev per cpu refcounting
is fundamentally broken everybody decided to hit themselves with the
obliviate spell :S

Eric Dumazet April 25, 2022, 3:13 p.m. UTC | #7

On Mon, Apr 25, 2022 at 8:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > > Doesn't mean we should make it legal. We can add a warning to catch
> > > abuses.
> >
> > That was the idea with
> > https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> > but I didn't get any replies when I asked what the precise semantics
> > of dev_hold() are supposed to be
> > (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> > so I don't know how to proceed...
>
> Yeah, I think after you pointed out that the netdev per cpu refcounting
> is fundamentally broken everybody decided to hit themselves with the
> obliviate spell :S

dev_hold() has been an increment of a refcount, and dev_put() a decrement.

Not sure why it is fundamentally broken.

There are specific steps at device dismantles making sure no more
users can dev_hold()

It is a contract. Any buggy layer can overwrite any piece of memory,
including a refcount_t.

Traditionally we could not add a test in dev_hold() to prevent an
increment if the device is in dismantle phase.
Maybe the situation is better nowadays.

Jann Horn April 25, 2022, 3:18 p.m. UTC | #8

On Mon, Apr 25, 2022 at 5:13 PM Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Apr 25, 2022 at 8:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > > > Doesn't mean we should make it legal. We can add a warning to catch
> > > > abuses.
> > >
> > > That was the idea with
> > > https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> > > but I didn't get any replies when I asked what the precise semantics
> > > of dev_hold() are supposed to be
> > > (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> > > so I don't know how to proceed...
> >
> > Yeah, I think after you pointed out that the netdev per cpu refcounting
> > is fundamentally broken everybody decided to hit themselves with the
> > obliviate spell :S
>
> dev_hold() has been an increment of a refcount, and dev_put() a decrement.
>
> Not sure why it is fundamentally broken.

Well, it's not quite a refcount. It's a count that can be incremented
and decremented but can't be read while the device is alive, and then
at some point it turns into a count that can be read and decremented
but can't be incremented (see
https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/).
Normal refcounts allow anyone who is holding a reference to add
another reference.

> There are specific steps at device dismantles making sure no more
> users can dev_hold()

So you're saying it's intentional that even if you're already holding
a dev_hold() reference, you may not be allowed to call dev_hold()
again?

Eric Dumazet April 25, 2022, 3:23 p.m. UTC | #9

On Mon, Apr 25, 2022 at 8:19 AM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Apr 25, 2022 at 5:13 PM Eric Dumazet <edumazet@google.com> wrote:
> > On Mon, Apr 25, 2022 at 8:01 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > On Mon, 25 Apr 2022 16:49:34 +0200 Jann Horn wrote:
> > > > > Doesn't mean we should make it legal. We can add a warning to catch
> > > > > abuses.
> > > >
> > > > That was the idea with
> > > > https://lore.kernel.org/netdev/20220128014303.2334568-1-jannh@google.com/,
> > > > but I didn't get any replies when I asked what the precise semantics
> > > > of dev_hold() are supposed to be
> > > > (https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/),
> > > > so I don't know how to proceed...
> > >
> > > Yeah, I think after you pointed out that the netdev per cpu refcounting
> > > is fundamentally broken everybody decided to hit themselves with the
> > > obliviate spell :S
> >
> > dev_hold() has been an increment of a refcount, and dev_put() a decrement.
> >
> > Not sure why it is fundamentally broken.
>
> Well, it's not quite a refcount. It's a count that can be incremented
> and decremented but can't be read while the device is alive, and then
> at some point it turns into a count that can be read and decremented
> but can't be incremented (see
> https://lore.kernel.org/netdev/CAG48ez1-OyZETvrYAfaHicYW1LbrQUVp=C0EukSWqZrYMej73w@mail.gmail.com/).
> Normal refcounts allow anyone who is holding a reference to add
> another reference.

On a live netdev nothing wants to read the 'current refcount'.
We basically do not care.

>
> > There are specific steps at device dismantles making sure no more
> > users can dev_hold()
>
> So you're saying it's intentional that even if you're already holding
> a dev_hold() reference, you may not be allowed to call dev_hold()
> again?

I think you can/should not.
We might add a test in dev_hold() and catch offenders.

Then add a new api, (dev_hold() is void and can not propagate an
error), and eventually
fix offenders.

Jakub Kicinski April 25, 2022, 3:28 p.m. UTC | #10

On Mon, 25 Apr 2022 08:13:40 -0700 Eric Dumazet wrote:
> dev_hold() has been an increment of a refcount, and dev_put() a decrement.
> 
> Not sure why it is fundamentally broken.

Jann described a case where someone does

    CPU 0      CPU 1     CPU 2

  dev_hold()
   ------  #unregister -------
             dev_hold()
                         dev_put()

Our check for refcount == 0 goes over the CPUs one by one,
so if it sums up CPUs 0 and 1 at the "unregister" point above
and CPU2 after the CPU1 hold and CPU2 release it will "miss"
one refcount.

That's a problem unless doing a dev_hold() on a netdev we only have 
a reference on is illegal.

> There are specific steps at device dismantles making sure no more
> users can dev_hold()
> 
> It is a contract. Any buggy layer can overwrite any piece of memory,
> including a refcount_t.
> 
> Traditionally we could not add a test in dev_hold() to prevent an
> increment if the device is in dismantle phase.
> Maybe the situation is better nowadays.

Eric Dumazet April 25, 2022, 3:31 p.m. UTC | #11

On Mon, Apr 25, 2022 at 8:28 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 25 Apr 2022 08:13:40 -0700 Eric Dumazet wrote:
> > dev_hold() has been an increment of a refcount, and dev_put() a decrement.
> >
> > Not sure why it is fundamentally broken.
>
> Jann described a case where someone does
>
>     CPU 0      CPU 1     CPU 2
>
>   dev_hold()
>    ------  #unregister -------
>              dev_hold()
>                          dev_put()
>
> Our check for refcount == 0 goes over the CPUs one by one,
> so if it sums up CPUs 0 and 1 at the "unregister" point above
> and CPU2 after the CPU1 hold and CPU2 release it will "miss"
> one refcount.
>
> That's a problem unless doing a dev_hold() on a netdev we only have
> a reference on is illegal.

What is 'illegal' is trying to keep using the device after #unregister.

We have barriers to prevent that.

Somehow a layer does not care about the barriers and pretends the
device is still good to use.

It is of course perfectly fine to stack multiple dev_hold() from one
path (if these do not leak, but this is a different issue)

>
> > There are specific steps at device dismantles making sure no more
> > users can dev_hold()
> >
> > It is a contract. Any buggy layer can overwrite any piece of memory,
> > including a refcount_t.
> >
> > Traditionally we could not add a test in dev_hold() to prevent an
> > increment if the device is in dismantle phase.
> > Maybe the situation is better nowadays.


>

Jakub Kicinski April 25, 2022, 3:36 p.m. UTC | #12

On Mon, 25 Apr 2022 08:31:23 -0700 Eric Dumazet wrote:
> > Jann described a case where someone does
> >
> >     CPU 0      CPU 1     CPU 2
> >
> >   dev_hold()
> >    ------  #unregister -------
> >              dev_hold()
> >                          dev_put()
> >
> > Our check for refcount == 0 goes over the CPUs one by one,
> > so if it sums up CPUs 0 and 1 at the "unregister" point above
> > and CPU2 after the CPU1 hold and CPU2 release it will "miss"
> > one refcount.
> >
> > That's a problem unless doing a dev_hold() on a netdev we only have
> > a reference on is illegal.  
> 
> What is 'illegal' is trying to keep using the device after #unregister.
> 
> We have barriers to prevent that.
> 
> Somehow a layer does not care about the barriers and pretends the
> device is still good to use.
> 
> It is of course perfectly fine to stack multiple dev_hold() from one
> path (if these do not leak, but this is a different issue)

So we'd need something like

WARN_ON(dev->reg_state != NETREG_REGISTERED && !rtnl_held())

in dev_hold()?

Lukas Wunner April 25, 2022, 5:20 p.m. UTC | #13

On Mon, Apr 25, 2022 at 05:18:51PM +0200, Jann Horn wrote:
> Well, it's not quite a refcount. It's a count that can be incremented
> and decremented but can't be read while the device is alive, and then
> at some point it turns into a count that can be read and decremented
> but can't be incremented

Pardon me for being dense, but most other subsystems use the refcounting
built into struct device (or rather, its kobject) and tear it down
when the refcount reaches zero (e.g. pci_release_dev(), spidev_release()).

What's the rationale for struct net_device rolling its own refcounting?
Historic artifact?

I think a lot of these issues would solve themselves if that was done away
with and replaced with the generic kobject refcounting.  It's a pity that
the tracking infrastructure is now netdev-specific and other subsystems
cannot benefit from it.

Thanks,

Lukas

Eric Dumazet April 25, 2022, 5:24 p.m. UTC | #14

On Mon, Apr 25, 2022 at 10:20 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Mon, Apr 25, 2022 at 05:18:51PM +0200, Jann Horn wrote:
> > Well, it's not quite a refcount. It's a count that can be incremented
> > and decremented but can't be read while the device is alive, and then
> > at some point it turns into a count that can be read and decremented
> > but can't be incremented
>
> Pardon me for being dense, but most other subsystems use the refcounting
> built into struct device (or rather, its kobject) and tear it down
> when the refcount reaches zero (e.g. pci_release_dev(), spidev_release()).
>
> What's the rationale for struct net_device rolling its own refcounting?
> Historic artifact?


Yes, probably. This was there way before new fancy mechanisms were invented.

>
>
> I think a lot of these issues would solve themselves if that was done away
> with and replaced with the generic kobject refcounting.  It's a pity that
> the tracking infrastructure is now netdev-specific and other subsystems
> cannot benefit from it.

Make sure that whatever replaces it, heavy dev_hold()/dev_put() users
do not come to a crawl.

af_packet is using this stuff.

Some users want to send millions of packets per second, without having
to bypass the kernel because it is suddenly too slow.


>
> Thanks,
>
> Lukas

Lukas Wunner April 25, 2022, 9:18 p.m. UTC | #15

On Mon, Apr 25, 2022 at 07:41:46AM -0700, Jakub Kicinski wrote:
> On Sat, 23 Apr 2022 18:07:23 +0200 Lukas Wunner wrote:
> > > Looking at the original report it looks like the issue could be
> > > resolved with a more usb-specific change: e.g. it looks like
> > > usbnet_defer_kevent() is not acquiring a dev reference as it should.
> > > 
> > > Have you considered that path?  
> > 
> > First of all, the diffstat of the patch shows this is an opportunity
> > to reduce LoC as well as simplify and speed up device teardown.
> > 
> > Second, the approach you're proposing won't work if a driver calls
> > netif_carrier_on/off() after unregister_netdev().
> > 
> > It seems prudent to prevent such a misbehavior in *any* driver,
> > not just usbnet.  usbnet may not be the only one doing it wrong.
> > Jann pointed out that there are more syzbot reports related
> > to a UAF in linkwatch:
> > 
> > https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot
> > 
> > Third, I think an API which schedules work, invisibly to the driver,
> > is dangerous and misguided.  If it is illegal to call
> > netif_carrier_on/off() for an unregistered but not yet freed netdev,
> > catch that in core networking code and don't expect drivers to respect
> > a rule which isn't even documented.
> 
> Doesn't mean we should make it legal. We can add a warning to catch
> abuses.

That would be inconsequent, considering that netif_carrier_on/off()
do not warn for a reg_state of NETREG_UNINITIALIZED.

Thanks,

Lukas

Eric Dumazet April 25, 2022, 9:39 p.m. UTC | #16

On Mon, Apr 25, 2022 at 2:18 PM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Mon, Apr 25, 2022 at 07:41:46AM -0700, Jakub Kicinski wrote:
> > On Sat, 23 Apr 2022 18:07:23 +0200 Lukas Wunner wrote:
> > > > Looking at the original report it looks like the issue could be
> > > > resolved with a more usb-specific change: e.g. it looks like
> > > > usbnet_defer_kevent() is not acquiring a dev reference as it should.
> > > >
> > > > Have you considered that path?
> > >
> > > First of all, the diffstat of the patch shows this is an opportunity
> > > to reduce LoC as well as simplify and speed up device teardown.
> > >
> > > Second, the approach you're proposing won't work if a driver calls
> > > netif_carrier_on/off() after unregister_netdev().
> > >
> > > It seems prudent to prevent such a misbehavior in *any* driver,
> > > not just usbnet.  usbnet may not be the only one doing it wrong.
> > > Jann pointed out that there are more syzbot reports related
> > > to a UAF in linkwatch:
> > >
> > > https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot
> > >
> > > Third, I think an API which schedules work, invisibly to the driver,
> > > is dangerous and misguided.  If it is illegal to call
> > > netif_carrier_on/off() for an unregistered but not yet freed netdev,
> > > catch that in core networking code and don't expect drivers to respect
> > > a rule which isn't even documented.
> >
> > Doesn't mean we should make it legal. We can add a warning to catch
> > abuses.
>
> That would be inconsequent, considering that netif_carrier_on/off()
> do not warn for a reg_state of NETREG_UNINITIALIZED.
>

Yes, only 1500 calls to audit ;)

I guess we could start adding WARN_ON_ONCE(), then wait for a few
syzbot/users reports to fix offenders...

commit b47300168e770b60ab96c8924854c3b0eb4260eb
Author: David S. Miller <davem@davemloft.net>
Date:   Wed Nov 19 15:33:54 2008 -0800

    net: Do not fire linkwatch events until the device is registered.

    Several device drivers try to do things like netif_carrier_off()
    before register_netdev() is invoked.  This is bogus, but too many
    drivers do this to fix them all up in one go.

    Reported-by: Folkert van Heusden <folkert@vanheusden.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Lukas Wunner April 30, 2022, 10:05 a.m. UTC | #17

On Mon, Apr 25, 2022 at 07:41:46AM -0700, Jakub Kicinski wrote:
> On Sat, 23 Apr 2022 18:07:23 +0200 Lukas Wunner wrote:
> > > Looking at the original report it looks like the issue could be
> > > resolved with a more usb-specific change: e.g. it looks like
> > > usbnet_defer_kevent() is not acquiring a dev reference as it should.
> > > 
> > > Have you considered that path?
> > 
> > First of all, the diffstat of the patch shows this is an opportunity
> > to reduce LoC as well as simplify and speed up device teardown.
> > 
> > Second, the approach you're proposing won't work if a driver calls
> > netif_carrier_on/off() after unregister_netdev().
> > 
> > It seems prudent to prevent such a misbehavior in *any* driver,
> > not just usbnet.  usbnet may not be the only one doing it wrong.
> > Jann pointed out that there are more syzbot reports related
> > to a UAF in linkwatch:
> > 
> > https://lore.kernel.org/netdev/?q=__linkwatch_run_queue+syzbot
> > 
> > Third, I think an API which schedules work, invisibly to the driver,
> > is dangerous and misguided.  If it is illegal to call
> > netif_carrier_on/off() for an unregistered but not yet freed netdev,
> > catch that in core networking code and don't expect drivers to respect
> > a rule which isn't even documented.
> 
> Doesn't mean we should make it legal. We can add a warning to catch
> abuses.

It turns out that no, we *cannot* add a warning to catch abuses.

I've identified all the places in USB Ethernet drivers which are
susceptible to calling linkwatch_fire_event() after unregister_netdev(),
see patch below.

I'm fixing each one like this:

-		netif_carrier_on(dev->net);
+		dev_hold(dev->net);
+		if (dev->net->reg_state < NETREG_UNREGISTERED)
+			netif_carrier_on(dev->net);
+		dev_put(dev->net);

If this is called after unregister_netdev(), it becomes a no-op.

However if it is called concurrently to unregister_netdev(),
the reg_state may change to NETREG_UNREGISTERED after the if-clause
has been evaluated and before netif_carrier_on() is called.
Then a linkwatch event *will* be fired.  There won't be a use-after-free
because of the ref I'm acquiring here.  (unregister_netdev() will spin
in netdev_wait_allrefs_any() until the linkwatch event has been handled.)

But this means that we may still call linkwatch_fire_event() after
unregister_netdev()!  So we cannot emit a WARN splat and we cannot
catch use-after-frees outside of the USB Ethernet drivers I'm fixing
in the below patch.  It may thus very well happen that a use-after-free
may still occur for such other drivers and we cannot even WARN about it.

For this reason I would strongly prefer the $SUBJECT_PATCH ("net: linkwatch:
ignore events for unregistered netdevs") instead of the patch below.
I think you are wrong to stall the patch.  It avoids UAFs in *any*
driver, not just the USB Ethernet ones, it reduces LoC and speeds up
netdev unregistration.  What more do you want?

Thanks,

Lukas

-- >8 --

diff --git a/drivers/net/usb/aqc111.c b/drivers/net/usb/aqc111.c
index ea06d10..279a7ca 100644
--- a/drivers/net/usb/aqc111.c
+++ b/drivers/net/usb/aqc111.c
@@ -962,7 +962,11 @@ static int aqc111_link_reset(struct usbnet *dev)
 		aqc111_write16_cmd(dev, AQ_ACCESS_MAC, SFR_RX_CTL,
 				   2, &aqc111_data->rxctl);
 
-		netif_carrier_on(dev->net);
+		dev_hold(dev->net);
+		if (dev->net->reg_state < NETREG_UNREGISTERED)
+			netif_carrier_on(dev->net);
+		dev_put(dev->net);
+
 	} else {
 		aqc111_read16_cmd(dev, AQ_ACCESS_MAC, SFR_MEDIUM_STATUS_MODE,
 				  2, &reg16);
@@ -981,7 +985,10 @@ static int aqc111_link_reset(struct usbnet *dev)
 		aqc111_write_cmd(dev, AQ_ACCESS_MAC, SFR_BULK_OUT_CTRL,
 				 1, 1, &reg8);
 
-		netif_carrier_off(dev->net);
+		dev_hold(dev->net);
+		if (dev->net->reg_state < NETREG_UNREGISTERED)
+			netif_carrier_off(dev->net);
+		dev_put(dev->net);
 	}
 	return 0;
 }
diff --git a/drivers/net/usb/asix_devices.c b/drivers/net/usb/asix_devices.c
index 0872ca12..1e97c0a 100644
--- a/drivers/net/usb/asix_devices.c
+++ b/drivers/net/usb/asix_devices.c
@@ -173,7 +173,11 @@ static int ax88172_link_reset(struct usbnet *dev)
 	u8 mode;
 	struct ethtool_cmd ecmd = { .cmd = ETHTOOL_GSET };
 
-	mii_check_media(&dev->mii, 1, 1);
+	dev_hold(dev->net);
+	if (dev->net->reg_state < NETREG_UNREGISTERED)
+		mii_check_media(&dev->mii, 1, 1);
+	dev_put(dev->net);
+
 	mii_ethtool_gset(&dev->mii, &ecmd);
 	mode = AX88172_MEDIUM_DEFAULT;
 
@@ -1013,7 +1017,11 @@ static int ax88178_link_reset(struct usbnet *dev)
 
 	netdev_dbg(dev->net, "ax88178_link_reset()\n");
 
-	mii_check_media(&dev->mii, 1, 1);
+	dev_hold(dev->net);
+	if (dev->net->reg_state < NETREG_UNREGISTERED)
+		mii_check_media(&dev->mii, 1, 1);
+	dev_put(dev->net);
+
 	mii_ethtool_gset(&dev->mii, &ecmd);
 	mode = AX88178_MEDIUM_DEFAULT;
 	speed = ethtool_cmd_speed(&ecmd);
diff --git a/drivers/net/usb/ax88179_178a.c b/drivers/net/usb/ax88179_178a.c
index a310989..279ddf2 100644
--- a/drivers/net/usb/ax88179_178a.c
+++ b/drivers/net/usb/ax88179_178a.c
@@ -1632,7 +1632,10 @@ static int ax88179_link_reset(struct usbnet *dev)
 
 	ax179_data->eee_enabled = ax88179_chk_eee(dev);
 
-	netif_carrier_on(dev->net);
+	dev_hold(dev->net);
+	if (dev->net->reg_state < NETREG_UNREGISTERED)
+		netif_carrier_on(dev->net);
+	dev_put(dev->net);
 
 	return 0;
 }
diff --git a/drivers/net/usb/ch9200.c b/drivers/net/usb/ch9200.c
index f69d9b9..5c7904c 100644
--- a/drivers/net/usb/ch9200.c
+++ b/drivers/net/usb/ch9200.c
@@ -214,7 +214,11 @@ static int ch9200_link_reset(struct usbnet *dev)
 {
 	struct ethtool_cmd ecmd;
 
-	mii_check_media(&dev->mii, 1, 1);
+	dev_hold(dev->net);
+	if (dev->net->reg_state < NETREG_UNREGISTERED)
+		mii_check_media(&dev->mii, 1, 1);
+	dev_put(dev->net);
+
 	mii_ethtool_gset(&dev->mii, &ecmd);
 
 	netdev_dbg(dev->net, "%s() speed:%d duplex:%d\n",
diff --git a/drivers/net/usb/sierra_net.c b/drivers/net/usb/sierra_net.c
index bb4cbe8f..9ae9359 100644
--- a/drivers/net/usb/sierra_net.c
+++ b/drivers/net/usb/sierra_net.c
@@ -427,7 +427,11 @@ static void sierra_net_handle_lsi(struct usbnet *dev, char *data,
 	} else {
 		priv->link_up = 0;
 	}
-	usbnet_link_change(dev, link_up, 0);
+
+	dev_hold(dev->net);
+	if (dev->net->reg_state < NETREG_UNREGISTERED)
+		usbnet_link_change(dev, link_up, 0);
+	dev_put(dev->net);
 }
 
 static void sierra_net_dosync(struct usbnet *dev)
@@ -758,6 +762,8 @@ static void sierra_net_unbind(struct usbnet *dev, struct usb_interface *intf)
 
 	dev_dbg(&dev->udev->dev, "%s", __func__);
 
+	usbnet_status_stop(dev);
+
 	/* kill the timer and work */
 	del_timer_sync(&priv->sync_timer);
 	cancel_work_sync(&priv->sierra_net_kevent);
@@ -769,8 +775,6 @@ static void sierra_net_unbind(struct usbnet *dev, struct usb_interface *intf)
 		netdev_err(dev->net,
 			"usb_control_msg failed, status %d\n", status);
 
-	usbnet_status_stop(dev);
-
 	sierra_net_set_private(dev, NULL);
 	kfree(priv);
 }
diff --git a/drivers/net/usb/smsc75xx.c b/drivers/net/usb/smsc75xx.c
index 95de452..b7f608a 100644
--- a/drivers/net/usb/smsc75xx.c
+++ b/drivers/net/usb/smsc75xx.c
@@ -640,7 +640,11 @@ static int smsc75xx_link_reset(struct usbnet *dev)
 		return ret;
 	}
 
-	mii_check_media(mii, 1, 1);
+	dev_hold(dev->net);
+	if (dev->net->reg_state < NETREG_UNREGISTERED)
+		mii_check_media(mii, 1, 1);
+	dev_put(dev->net);
+
 	mii_ethtool_gset(&dev->mii, &ecmd);
 	lcladv = smsc75xx_mdio_read(dev->net, mii->phy_id, MII_ADVERTISE);
 	rmtadv = smsc75xx_mdio_read(dev->net, mii->phy_id, MII_LPA);

Lukas Wunner April 30, 2022, 10:09 a.m. UTC | #18

On Sat, Apr 30, 2022 at 12:05:41PM +0200, Lukas Wunner wrote:
> But this means that we may still call linkwatch_fire_event() after
> unregister_netdev()!

I meant to say, "we may still call linkwatch_fire_event() after
the state has changed to NETREG_UNREGISTERED".

net: linkwatch: ignore events for unregistered netdevs

Checks

Commit Message

Comments

Patch