diff mbox series

[wpan-next,19/20] ieee802154: hwsim: Do not check the rtnl

Message ID 20220701143052.1267509-20-miquel.raynal@bootlin.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series net: ieee802154: Support scanning/beaconing | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Miquel Raynal July 1, 2022, 2:30 p.m. UTC
There is no need to ensure the rtnl is locked when changing a driver's
channel. This cause issues when scanning and this is the only driver
relying on it. Just drop this dependency because it does not seem
legitimate.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
---
 drivers/net/ieee802154/mac802154_hwsim.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Alexander Aring July 6, 2022, 1:23 a.m. UTC | #1
Hi,

On Fri, Jul 1, 2022 at 10:37 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> There is no need to ensure the rtnl is locked when changing a driver's
> channel. This cause issues when scanning and this is the only driver
> relying on it. Just drop this dependency because it does not seem
> legitimate.
>

switching channels relies on changing pib attributes, pib attributes
are protected by rtnl. If you experience issues here then it's
probably because you do something wrong. All drivers assuming here
that rtnl lock is held.

- Alex
Alexander Aring Aug. 1, 2022, 11:58 p.m. UTC | #2
Hi,

On Tue, Jul 5, 2022 at 9:23 PM Alexander Aring <aahringo@redhat.com> wrote:
>
> Hi,
>
> On Fri, Jul 1, 2022 at 10:37 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > There is no need to ensure the rtnl is locked when changing a driver's
> > channel. This cause issues when scanning and this is the only driver
> > relying on it. Just drop this dependency because it does not seem
> > legitimate.
> >
>
> switching channels relies on changing pib attributes, pib attributes
> are protected by rtnl. If you experience issues here then it's
> probably because you do something wrong. All drivers assuming here
> that rtnl lock is held.

especially this change could end in invalid free. Maybe we can solve
this problem in a different way, what exactly is the problem by
helding rtnl lock?

- Alex
Miquel Raynal Aug. 19, 2022, 5:09 p.m. UTC | #3
Hi Alexander,

aahringo@redhat.com wrote on Tue, 5 Jul 2022 21:23:21 -0400:

> Hi,
> 
> On Fri, Jul 1, 2022 at 10:37 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > There is no need to ensure the rtnl is locked when changing a driver's
> > channel. This cause issues when scanning and this is the only driver
> > relying on it. Just drop this dependency because it does not seem
> > legitimate.
> >  
> 
> switching channels relies on changing pib attributes, pib attributes
> are protected by rtnl. If you experience issues here then it's
> probably because you do something wrong. All drivers assuming here
> that rtnl lock is held.

---8<---
> especially this change could end in invalid free. Maybe we can solve
> this problem in a different way, what exactly is the problem by
> helding rtnl lock?
--->8---

During a scan we need to change channels. So when the background job
kicks-in, we first acquire scan_lock, then we check the internal
parameters of the structure protected by this lock, like the next
channel to use and the sdata pointer. A channel change must be
performed, preceded by an rtnl_lock(). This will again trigger a
possible circular lockdep dependency warning because the triggering path
acquires the rtnl (as part of the netlink layer) before the scan lock.

One possible solution would be to do the following:
scan_work() {
	acquire(scan_lock);
	// do some config
	release(scan_lock);
	rtnl_lock();
	perform_channel_change();
	rtnl_unlock();
	acquire(scan_lock);
	// reinit the scan struct ptr and the sdata ptr
	// do some more things
	release(scan_lock);
}

It looks highly non-elegant IMHO. Otherwise I need to stop verifying in
the drivers that the rtnl is taken. Any third option here?

BTW I don't see where the invalid free situation you mentioned can
happen here, can you explain?

Thanks,
Miquèl
Miquel Raynal Aug. 25, 2022, 10:41 p.m. UTC | #4
Hi Alexander,

miquel.raynal@bootlin.com wrote on Fri, 19 Aug 2022 19:09:44 +0200:

> Hi Alexander,
> 
> aahringo@redhat.com wrote on Tue, 5 Jul 2022 21:23:21 -0400:
> 
> > Hi,
> > 
> > On Fri, Jul 1, 2022 at 10:37 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:  
> > >
> > > There is no need to ensure the rtnl is locked when changing a driver's
> > > channel. This cause issues when scanning and this is the only driver
> > > relying on it. Just drop this dependency because it does not seem
> > > legitimate.
> > >    
> > 
> > switching channels relies on changing pib attributes, pib attributes
> > are protected by rtnl. If you experience issues here then it's
> > probably because you do something wrong. All drivers assuming here
> > that rtnl lock is held.  
> 
> ---8<---
> > especially this change could end in invalid free. Maybe we can solve
> > this problem in a different way, what exactly is the problem by
> > helding rtnl lock?
> --->8---  
> 
> During a scan we need to change channels. So when the background job
> kicks-in, we first acquire scan_lock, then we check the internal
> parameters of the structure protected by this lock, like the next
> channel to use and the sdata pointer. A channel change must be
> performed, preceded by an rtnl_lock(). This will again trigger a
> possible circular lockdep dependency warning because the triggering path
> acquires the rtnl (as part of the netlink layer) before the scan lock.
> 
> One possible solution would be to do the following:
> scan_work() {
> 	acquire(scan_lock);
> 	// do some config
> 	release(scan_lock);
> 	rtnl_lock();
> 	perform_channel_change();
> 	rtnl_unlock();
> 	acquire(scan_lock);
> 	// reinit the scan struct ptr and the sdata ptr
> 	// do some more things
> 	release(scan_lock);
> }
> 
> It looks highly non-elegant IMHO. Otherwise I need to stop verifying in
> the drivers that the rtnl is taken. Any third option here?

I've tried two other solutions.

A/ Enforcing the dependency rtnl -> scan_lock

This means always acquiring the rtnl before scan_lock, and in terms of
code requires to take the rtnl in the scan worker. Of course enclosing
the drv_change_chan() call would mean releasing the scan_lock in the
middle and re-taking it after all, which would defeat the protection of
the scan_req structure which the lock is supposed to enforce. So I went
for acquiring the lock at the top, before acquiring scan_lock, of
course.

This does not work because we need to acquire the rtnl in the worker,
while at the same time there are places like ->slave_close which need
to acquire the worker lock (during flush_workqueue()) and this can only
happen under rtnl. Lockdep then complains about a possible circular
dependency.

B/ Avoiding the rtnl in scan operations and allowing the reverse
dependency, which is scan_lock -> rtnl

I've drafted this solution because I think the scan operation do not
really need the rtnl. This idea got reinforced when I found this
wireless change: a05829a7222e ("cfg80211: avoid holding the RTNL when
calling the driver").

But unfortunately I get the same issue again, with the ->close()
implementation which needs to acquire the worker lock to flush, this
makes a rtnl -> worker_lock dependency which is incompatible with a
worker_lock -> scan_lock -> rtnl chain (this is is typically what should
happen when changing the channel during a scan).

So I looked at reducing the scope of scan_lock, in order to avoid
taking it for too long and avoid the scan_lock -> rtnl or rtnl ->
scan_lock dependency in the worker, but I think in the end it is a
truly bad idea.

Finally, I decided I could use another workqueue for the mac related
commands which is not the one for the data. We don't care about
flushing it because we _need_ the beacons/scan workers to be stopped,
which is handled in their dedicated helpers. Doing so removes a rtnl ->
worker_lock dependency, which allows to acquire the rtnl from the
worker. I've mostly implemented it, I'll clean all this up and send a
v2 tomorrow.

Thanks,
Miquèl
diff mbox series

Patch

diff --git a/drivers/net/ieee802154/mac802154_hwsim.c b/drivers/net/ieee802154/mac802154_hwsim.c
index 38c217bd7c82..a5b9fc2fb64c 100644
--- a/drivers/net/ieee802154/mac802154_hwsim.c
+++ b/drivers/net/ieee802154/mac802154_hwsim.c
@@ -100,7 +100,7 @@  static int hwsim_hw_channel(struct ieee802154_hw *hw, u8 page, u8 channel)
 	pib->page = page;
 	pib->channel = channel;
 
-	pib_old = rtnl_dereference(phy->pib);
+	rcu_assign_pointer(pib_old, phy->pib);
 	rcu_assign_pointer(phy->pib, pib);
 	kfree_rcu(pib_old, rcu);
 	return 0;