diff mbox series

[RFC,net-next,8/9] net: delay device_del until run_todo

Message ID 20210928125500.167943-9-atenart@kernel.org (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series Userspace spinning on net-sysfs access | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net-next
netdev/subject_prefix success Link
netdev/cc_maintainers warning 10 maintainers not CCed: weiwan@google.com ap420073@gmail.com arnd@arndb.de bjorn@kernel.org yebin10@huawei.com alobakin@pm.me memxor@gmail.com daniel@iogearbox.net edumazet@google.com alexanderduyck@fb.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 6 this patch: 6
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning WARNING: Co-developed-by and Signed-off-by: name/email do not match
netdev/build_allmodconfig_warn success Errors and warnings before: 6 this patch: 6
netdev/header_inline success Link

Commit Message

Antoine Tenart Sept. 28, 2021, 12:54 p.m. UTC
Move the deletion of the device from unregister_netdevice_many to
netdev_run_todo and move it outside the rtnl lock.

12 years ago was reported an ABBA deadlock between net-sysfs and the
netdevice unregistration[1]. The issue was the following:

              A                            B

   unregister_netdevice_many         sysfs access
   rtnl_lock                         sysfs refcount
				     rtnl_lock
   drain sysfs files
   => waits for B                    => waits for A

This was avoided thanks to two patches[2][3], which used rtnl_trylock in
net-sysfs and restarted the syscall when the rtnl lock was already
taken. This way kernfs nodes were not blocking the netdevice
unregistration anymore.

This was fine at the time but is now causing some issues: creating and
moving interfaces makes userspace (systemd, NetworkManager or others) to
spin a lot as syscalls are restarted, which has an impact on
performance. This happens for example when creating pods. While
userspace applications could be improved, fixing this in-kernel has the
benefit of fixing the root cause of the issue.

The sysfs removal is done in device_del, and moving it outside of the
rtnl lock does fix the initial deadlock. With that the trylock/restart
logic can be removed in a following-up patch.

[1] https://lore.kernel.org/netdev/49A4D5D5.5090602@trash.net/
(I'm referencing the full thread but the sysfs issue was discussed later
in the thread).
[2] 336ca57c3b4e ("net-sysfs: Use rtnl_trylock in sysfs methods.")
[3] 5a5990d3090b ("net: Avoid race between network down and sysfs")

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Antoine Tenart <atenart@kernel.org>
---
 net/core/dev.c       | 2 ++
 net/core/net-sysfs.c | 2 --
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski Sept. 29, 2021, 12:02 a.m. UTC | #1
On Tue, 28 Sep 2021 14:54:59 +0200 Antoine Tenart wrote:
> The sysfs removal is done in device_del, and moving it outside of the
> rtnl lock does fix the initial deadlock. With that the trylock/restart
> logic can be removed in a following-up patch.

> diff --git a/net/core/dev.c b/net/core/dev.c
> index a1eab120bb50..d774fbec5d63 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
>  			continue;
>  		}
>  
> +		device_del(&dev->dev);
> +
>  		dev->reg_state = NETREG_UNREGISTERED;
>  
>  		netdev_wait_allrefs(dev);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 21c3fdeccf20..e754f00c117b 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
>  	remove_queue_kobjects(ndev);
>  
>  	pm_runtime_set_memalloc_noio(dev, false);
> -
> -	device_del(dev);
>  }
>  
>  /* Create sysfs entries for network device. */

Doesn't this mean there may be sysfs files which are accessible 
for an unregistered netdevice? Isn't the point of having device_del()
under rtnl_lock() to make sure we sysfs handlers can't run on dead
devices?
Antoine Tenart Sept. 29, 2021, 8:26 a.m. UTC | #2
Quoting Jakub Kicinski (2021-09-29 02:02:29)
> On Tue, 28 Sep 2021 14:54:59 +0200 Antoine Tenart wrote:
> > The sysfs removal is done in device_del, and moving it outside of the
> > rtnl lock does fix the initial deadlock. With that the trylock/restart
> > logic can be removed in a following-up patch.
> 
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index a1eab120bb50..d774fbec5d63 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
> >                       continue;
> >               }
> >  
> > +             device_del(&dev->dev);
> > +
> >               dev->reg_state = NETREG_UNREGISTERED;
> >  
> >               netdev_wait_allrefs(dev);
> > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> > index 21c3fdeccf20..e754f00c117b 100644
> > --- a/net/core/net-sysfs.c
> > +++ b/net/core/net-sysfs.c
> > @@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
> >       remove_queue_kobjects(ndev);
> >  
> >       pm_runtime_set_memalloc_noio(dev, false);
> > -
> > -     device_del(dev);
> >  }
> >  
> >  /* Create sysfs entries for network device. */
> 
> Doesn't this mean there may be sysfs files which are accessible 
> for an unregistered netdevice?

It would mean having accessible sysfs files for a device in the
NETREG_UNREGISTERING state; NETREG_UNREGISTERED still comes after
device_del. It's a small difference but still important, I think.

You raise a good point. Yes, that would mean accessing attributes of net
devices being unregistered, meaning accessing or modifying unused or
obsolete parameters and data (it shouldn't be garbage data though).
Unlisting those sysfs files without removing them would be better here,
to not expose files when the device is being unregistered while still
allowing pending operations to complete. I don't know if that is doable
in sysfs.

(While I did ran stress tests reading/writing attributes while
unregistering devices, I think I missed an issue with the
netdev_queue_default attributes; which hopefully can be fixed — if the
whole idea is deemed acceptable).

> Isn't the point of having device_del() under rtnl_lock() to make sure
> we sysfs handlers can't run on dead devices?

Hard to say what was the initial point, there is a lot of history here
:) I'm not sure it was done because of a particular reason; IMHO it just
made sense to make this simple without having a good reason not to do
so. And it helped with the naming collision detection.

Thanks!
Antoine
Jakub Kicinski Sept. 29, 2021, 1:31 p.m. UTC | #3
On Wed, 29 Sep 2021 10:26:35 +0200 Antoine Tenart wrote:
> Quoting Jakub Kicinski (2021-09-29 02:02:29)
> > On Tue, 28 Sep 2021 14:54:59 +0200 Antoine Tenart wrote:  
> > > The sysfs removal is done in device_del, and moving it outside of the
> > > rtnl lock does fix the initial deadlock. With that the trylock/restart
> > > logic can be removed in a following-up patch.  
> >   
> > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > index a1eab120bb50..d774fbec5d63 100644
> > > --- a/net/core/dev.c
> > > +++ b/net/core/dev.c
> > > @@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
> > >                       continue;
> > >               }
> > >  
> > > +             device_del(&dev->dev);
> > > +
> > >               dev->reg_state = NETREG_UNREGISTERED;
> > >  
> > >               netdev_wait_allrefs(dev);
> > > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> > > index 21c3fdeccf20..e754f00c117b 100644
> > > --- a/net/core/net-sysfs.c
> > > +++ b/net/core/net-sysfs.c
> > > @@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
> > >       remove_queue_kobjects(ndev);
> > >  
> > >       pm_runtime_set_memalloc_noio(dev, false);
> > > -
> > > -     device_del(dev);
> > >  }
> > >  
> > >  /* Create sysfs entries for network device. */  
> > 
> > Doesn't this mean there may be sysfs files which are accessible 
> > for an unregistered netdevice?  
> 
> It would mean having accessible sysfs files for a device in the
> NETREG_UNREGISTERING state; NETREG_UNREGISTERED still comes after
> device_del. It's a small difference but still important, I think.
> 
> You raise a good point. Yes, that would mean accessing attributes of net
> devices being unregistered, meaning accessing or modifying unused or
> obsolete parameters and data (it shouldn't be garbage data though).
> Unlisting those sysfs files without removing them would be better here,
> to not expose files when the device is being unregistered while still
> allowing pending operations to complete. I don't know if that is doable
> in sysfs.

I wonder. Do we somehow remove the queue objects without waiting or are
those also waited on when we remove the device? 'Cause XPS is the part
that jumps out to me - we reset XPS after netdev_unregister_kobject().
Does it mean user can re-instate XPS settings after we thought we
already reset them?

> (While I did ran stress tests reading/writing attributes while
> unregistering devices, I think I missed an issue with the
> netdev_queue_default attributes; which hopefully can be fixed — if the
> whole idea is deemed acceptable).

Well, it's a little wobbly but I think the direction is sane.
It wouldn't feel super clean to add

	if (dev->state != NETREG_REGISTERED)
		goto out;

to the sysfs handlers but maybe it's better than leaving potential
traps for people who are not aware of the intricacies later? Not sure.

> > Isn't the point of having device_del() under rtnl_lock() to make sure
> > we sysfs handlers can't run on dead devices?  
> 
> Hard to say what was the initial point, there is a lot of history here
> :) I'm not sure it was done because of a particular reason; IMHO it just
> made sense to make this simple without having a good reason not to do
> so. And it helped with the naming collision detection.

FWIW the other two pieces of feedback I have is try to avoid the
synchronize_net() in patch 7 and add a new helper for the name
checking, which would return bool. The callers don't have any 
business getting the struct.
Antoine Tenart Sept. 29, 2021, 5:31 p.m. UTC | #4
Quoting Jakub Kicinski (2021-09-29 15:31:26)
> On Wed, 29 Sep 2021 10:26:35 +0200 Antoine Tenart wrote:
> > Quoting Jakub Kicinski (2021-09-29 02:02:29)
> > > On Tue, 28 Sep 2021 14:54:59 +0200 Antoine Tenart wrote:  
> > > > The sysfs removal is done in device_del, and moving it outside of the
> > > > rtnl lock does fix the initial deadlock. With that the trylock/restart
> > > > logic can be removed in a following-up patch.  
> > >   
> > > > diff --git a/net/core/dev.c b/net/core/dev.c
> > > > index a1eab120bb50..d774fbec5d63 100644
> > > > --- a/net/core/dev.c
> > > > +++ b/net/core/dev.c
> > > > @@ -10593,6 +10593,8 @@ void netdev_run_todo(void)
> > > >                       continue;
> > > >               }
> > > >  
> > > > +             device_del(&dev->dev);
> > > > +
> > > >               dev->reg_state = NETREG_UNREGISTERED;
> > > >  
> > > >               netdev_wait_allrefs(dev);
> > > > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> > > > index 21c3fdeccf20..e754f00c117b 100644
> > > > --- a/net/core/net-sysfs.c
> > > > +++ b/net/core/net-sysfs.c
> > > > @@ -1955,8 +1955,6 @@ void netdev_unregister_kobject(struct net_device *ndev)
> > > >       remove_queue_kobjects(ndev);
> > > >  
> > > >       pm_runtime_set_memalloc_noio(dev, false);
> > > > -
> > > > -     device_del(dev);
> > > >  }
> > > >  
> > > >  /* Create sysfs entries for network device. */  
> > > 
> > > Doesn't this mean there may be sysfs files which are accessible 
> > > for an unregistered netdevice?  
> > 
> > It would mean having accessible sysfs files for a device in the
> > NETREG_UNREGISTERING state; NETREG_UNREGISTERED still comes after
> > device_del. It's a small difference but still important, I think.
> > 
> > You raise a good point. Yes, that would mean accessing attributes of net
> > devices being unregistered, meaning accessing or modifying unused or
> > obsolete parameters and data (it shouldn't be garbage data though).
> > Unlisting those sysfs files without removing them would be better here,
> > to not expose files when the device is being unregistered while still
> > allowing pending operations to complete. I don't know if that is doable
> > in sysfs.
> 
> I wonder. Do we somehow remove the queue objects without waiting or are
> those also waited on when we remove the device? 'Cause XPS is the part
> that jumps out to me - we reset XPS after netdev_unregister_kobject().
> Does it mean user can re-instate XPS settings after we thought we
> already reset them?

This should be possible yes (and not really wanted).

> Well, it's a little wobbly but I think the direction is sane.
> It wouldn't feel super clean to add
> 
>         if (dev->state != NETREG_REGISTERED)
>                 goto out;
> 
> to the sysfs handlers but maybe it's better than leaving potential
> traps for people who are not aware of the intricacies later? Not sure.

Agreed, that doesn't feel super clean, but would be quite nice to have
for users (and e.g. would also help in the XPS case). Having a wrapper
should be possible, to minimize the impact and make it a bit better.

> > (While I did ran stress tests reading/writing attributes while
> > unregistering devices, I think I missed an issue with the
> > netdev_queue_default attributes; which hopefully can be fixed — if the
> > whole idea is deemed acceptable).

I had a quick look about queue attributes, their removal should also be
done in run_todo (that's easy). However the queues can be updated in
flight (while holding the rtnl lock) and the error paths[1][2] do drain
sysfs files (in kobject_put).

We can't release the rtnl lock here. It should be possible to delay this
outside the rtnl lock (in the global workqueue) but as the kobject are
embedded in the queues, we might need to have them live outside to allow
async releases while a net device (and ->_rx/->_tx) is being freed[3].
That adds to the complexity...

[1] https://elixir.bootlin.com/linux/latest/source/net/core/net-sysfs.c#L1662
[2] https://elixir.bootlin.com/linux/latest/source/net/core/net-sysfs.c#L1067
[3] Or having a dedicated workqueue and draining it.

> > > Isn't the point of having device_del() under rtnl_lock() to make sure
> > > we sysfs handlers can't run on dead devices?  
> > 
> > Hard to say what was the initial point, there is a lot of history here
> > :) I'm not sure it was done because of a particular reason; IMHO it just
> > made sense to make this simple without having a good reason not to do
> > so. And it helped with the naming collision detection.
> 
> FWIW the other two pieces of feedback I have is try to avoid the
> synchronize_net() in patch 7

I wasn't too happy in adding a call to this. However the node name list
is rcu protected and to make sure all CPUs see the removal before
freeing the node name a call to synchronize_net (synchronize_rcu) is
needed. That being said I think we can just call kfree_rcu instead of
netdev_name_node_free (kfree) here.

> add a new helper for the name checking, which would return bool. The
> callers don't have any business getting the struct.

Good idea. Also I think the replacement of __dev_get_by_name by a new
wrapper might be good even outside of this series (in case the series is
delayed / reworked heavily / etc).

Thanks!
Antoine
Antoine Tenart Oct. 5, 2021, 3:21 p.m. UTC | #5
Quoting Jakub Kicinski (2021-09-29 15:31:26)
> 
> Well, it's a little wobbly but I think the direction is sane.

> FWIW the other two pieces of feedback I have is try to avoid the
> synchronize_net() in patch 7 and add a new helper for the name
> checking, which would return bool. The callers don't have any 
> business getting the struct.

I'll work on an RFC v2 including modifications discussed in this thread
(especially the ones raised about queues attributes; investigating if it
can be fixed). I might send the patches about the name checking helper
separately to reduce the size of the series, as I think they bring value
outside of it.

(In the meantime suggestions or reviews from others are still welcomed).

BTW, what are your thoughts on patch 1? It is not strictly linked to the
others (or to other solutions that might arise).

Thanks!
Antoine
Jakub Kicinski Oct. 5, 2021, 6:34 p.m. UTC | #6
On Tue, 05 Oct 2021 17:21:59 +0200 Antoine Tenart wrote:
> BTW, what are your thoughts on patch 1? It is not strictly linked to the
> others (or to other solutions that might arise).

IMO perfectly reasonable, just needs a standalone repost.
Antoine Tenart Oct. 29, 2021, 9:04 a.m. UTC | #7
Quoting Antoine Tenart (2021-09-29 19:31:56)
> Quoting Jakub Kicinski (2021-09-29 15:31:26)
> > On Wed, 29 Sep 2021 10:26:35 +0200 Antoine Tenart wrote:
> 
> > > (While I did ran stress tests reading/writing attributes while
> > > unregistering devices, I think I missed an issue with the
> > > netdev_queue_default attributes; which hopefully can be fixed — if the
> > > whole idea is deemed acceptable).
> 
> I had a quick look about queue attributes, their removal should also be
> done in run_todo (that's easy). However the queues can be updated in
> flight (while holding the rtnl lock) and the error paths[1][2] do drain
> sysfs files (in kobject_put).
> 
> We can't release the rtnl lock here. It should be possible to delay this
> outside the rtnl lock (in the global workqueue) but as the kobject are
> embedded in the queues, we might need to have them live outside to allow
> async releases while a net device (and ->_rx/->_tx) is being freed[3].
> That adds to the complexity...
> 
> [1] https://elixir.bootlin.com/linux/latest/source/net/core/net-sysfs.c#L1662
> [2] https://elixir.bootlin.com/linux/latest/source/net/core/net-sysfs.c#L1067
> [3] Or having a dedicated workqueue and draining it.

I got back to this and while all other suggestions where easy enough to
get right, handling the queue sysfs files was not... The explanation is
the same for Tx and Rx queues.

Sysfs queue files are special: in addition to being created at device
registration and removed at unregistration, they also can be
reconfigured (added & removed) in-flight. This is done under the rtnl
lock. So for those sysfs files the trylock/restart construction also
helps in not hitting a deadlock while removing queues in-flight.

To make this work here, I had to delay the queue removal outside the
rtnl lock. As the queues are statically allocated in net_device->_tx, I
made them dynamically allocated to allow delaying their removal outside
the rtnl lock (in a workqueue for example).

This worked for allowing the removal of queues not to hit the ABBA
deadlock. (Extra logic to drain the queues before device removal might
be needed too). But this introduced an issue as naming collision between
queues was now possible (if we tried registering new queues while old
ones weren't unregistered yet). This is because the unregistration was
delayed, and for this to work the reconfiguration of queues should be
atomic under the rtnl lock; which is exactly what the changes to not hit
the ABBA deadlock requires... There are contradicting goals here.

This is not really fixable IMHO. First we would need to add a naming
collision logic for queues only, but then we don't have the same
two-step unregistration logic as we have for net devices. And failing on
queues reconfigurations for this doesn't seem acceptable to me (this has
a lot of implications, many "users" can request queues registration &
unregistration). Plus the complexity starts to be quite noticeable,
which doesn't help maintenance.

The above looks like a dead end to me. I tried several approaches to
better handle the queues in sysfs, but couldn't find a solution not
hitting an issue one way or another.

I have however a few other ideas, that may or may not be acceptable.
I'll start a dedicated answer to this thread to discuss those.

Thanks,
Antoine
diff mbox series

Patch

diff --git a/net/core/dev.c b/net/core/dev.c
index a1eab120bb50..d774fbec5d63 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10593,6 +10593,8 @@  void netdev_run_todo(void)
 			continue;
 		}
 
+		device_del(&dev->dev);
+
 		dev->reg_state = NETREG_UNREGISTERED;
 
 		netdev_wait_allrefs(dev);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 21c3fdeccf20..e754f00c117b 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1955,8 +1955,6 @@  void netdev_unregister_kobject(struct net_device *ndev)
 	remove_queue_kobjects(ndev);
 
 	pm_runtime_set_memalloc_noio(dev, false);
-
-	device_del(dev);
 }
 
 /* Create sysfs entries for network device. */