diff mbox series

[wpan-next,09/20] net: mac802154: Introduce a global device lock

Message ID 20220701143052.1267509-10-miquel.raynal@bootlin.com (mailing list archive)
State Superseded
Headers show
Series net: ieee802154: Support scanning/beaconing | expand

Commit Message

Miquel Raynal July 1, 2022, 2:30 p.m. UTC
The purpose of this device lock is to prevent the removal of the device
while an asynchronous MLME operation happens. The RTNL works well for
that but in a later series having the RTNL taken here will be
problematic and will cause lockdep to warn us about a circular
dependency. We don't really need the RTNL here, just a serialization
over this operation.

Replace the RTNL calls with this new lock.

Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
---
 net/mac802154/ieee802154_i.h |  2 ++
 net/mac802154/iface.c        |  4 ++++
 net/mac802154/main.c         |  1 +
 net/mac802154/tx.c           | 12 ++++++------
 4 files changed, 13 insertions(+), 6 deletions(-)

Comments

Alexander Aring July 4, 2022, 1:12 a.m. UTC | #1
Hi,

On Fri, Jul 1, 2022 at 10:36 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
>
> The purpose of this device lock is to prevent the removal of the device
> while an asynchronous MLME operation happens. The RTNL works well for
> that but in a later series having the RTNL taken here will be
> problematic and will cause lockdep to warn us about a circular
> dependency. We don't really need the RTNL here, just a serialization
> over this operation.
>
> Replace the RTNL calls with this new lock.

I am unhappy about this solution. Can we not interrupt the ongoing
operation "scan" here and come to an end at a stop?

The RTNL is NOT only to prevent the removal of something... If mostly
all operations are protected by and I know one which makes trouble
here... setting page/channel. I know we don't hold the rtnl lock on
other transmit functionality for phy settings which has other reasons
why we allow it... but here we offer a mac operation which delivers
wrong results if somebody does another setting e.g. set page/channel
while scan is going on and we should prevent this.

Dropping the rtnl lock, yes we can do that... I cannot think about all
the side effects which this change will bring into, one I know is the
channel setting, mostly everything that is interfering with a scan and
then ugly things which we don't want... preparing the code for the
page/channel gives us a direction on how to fix and check the other
cases if we find them. btw: we should do this on another approach
anyway because the rtnl lock is not held during a whole operation and
we don't want that.

We should also take care that we hold some references which we held
during the scan, even if it's protected by stop (just for
correctness).

- Alex
Miquel Raynal Aug. 19, 2022, 5:06 p.m. UTC | #2
Hi Alexander,

I hope you've had a wonderful summer :-)

aahringo@redhat.com wrote on Sun, 3 Jul 2022 21:12:43 -0400:

> Hi,
> 
> On Fri, Jul 1, 2022 at 10:36 AM Miquel Raynal <miquel.raynal@bootlin.com> wrote:
> >
> > The purpose of this device lock is to prevent the removal of the device
> > while an asynchronous MLME operation happens. The RTNL works well for
> > that but in a later series having the RTNL taken here will be
> > problematic and will cause lockdep to warn us about a circular
> > dependency. We don't really need the RTNL here, just a serialization
> > over this operation.
> >
> > Replace the RTNL calls with this new lock.  
> 
> I am unhappy about this solution. Can we not interrupt the ongoing
> operation "scan" here and come to an end at a stop?
> 
> The RTNL is NOT only to prevent the removal of something... If mostly
> all operations are protected by and I know one which makes trouble
> here... setting page/channel. I know we don't hold the rtnl lock on
> other transmit functionality for phy settings which has other reasons
> why we allow it... but here we offer a mac operation which delivers
> wrong results if somebody does another setting e.g. set page/channel
> while scan is going on and we should prevent this.
> 
> Dropping the rtnl lock, yes we can do that... I cannot think about all
> the side effects which this change will bring into, one I know is the
> channel setting, mostly everything that is interfering with a scan and
> then ugly things which we don't want... preparing the code for the
> page/channel gives us a direction on how to fix and check the other
> cases if we find them. btw: we should do this on another approach
> anyway because the rtnl lock is not held during a whole operation and
> we don't want that.
> 
> We should also take care that we hold some references which we held
> during the scan, even if it's protected by stop (just for
> correctness).

I was also a bit unhappy by this solution but the rtnl is a real mess
when playing with background works. At least I was not able at all to
make it fit. I'm gonna try to summarize the situation to argue in favor
of the current solution, but I am really open if you see another way.

A scan is started by the user, through a netlink command. It basically
involves stopping any other activity on the transceiver, setting a
particular filtering mode, and possibly sending beacons through the MLME
Tx API at a regular interval.

A scan command from the user then involves acquiring the rtnl just to
be sure that nothing else is requested in parallel. The rtnl is taken
and released by the netlink core, just for the time of the
configuring/triggering action.

We absolutely do not want to keep the rtnl here, I believe we are
aligned on that. This means we need to protect ourselves against a
number of user actions:
1- dropping the device (without stopping the background job/cleaning
   everything),
2- transmitting packets
3- changing internal parameters such as the page/channel to avoid
   messing with the ongoing scan.

The current implementation does the following:
1- in the ieee802154 layer we call dev_hold/dev_put to prevent device
   removal,
2- in the soft mac layer we stop the queue,
3- in the soft mac layer we refuse any channel change command coming
   from the netlink layer during scans, because this is not a nl
   constraint, but a mac state constraint, so I think it is safe to
   handle that from the soft mac layer rather than at the nl level.

This is how I planned to handle the refcount and channel change issues.

Now, let me try to argue in favor of this commit.

The problem I faced was a circular dependency on the device sending
beacons or beacons requests, ie. sending MLME frames in the background.
For the record, in both cases, I need to put some parameters in one of
the main soft mac structures. I created local->scan_lock and
local->beacon_lock to protect accesses to the scanning and beaconing
structures respectively (we don't want eg. the struct to be freed while
a job is using it).

Let's take the situation of the device sending beacons in the
background.

For starting to send beacons, the user sends a netlink command. In the
kernel first layers, the rtnl is acquired (almost) automatically, then
the callback function in the soft mac does the job. One of the first
operations is to acquire the beacons_lock.

Lockdep detects that during the background operation, the kworker will
first acquire beacons_lock (it encloses the whole operation) and after
acquiring this first lock it will perform an MLME Tx to send the
beacon. But this, unfortunately, acquires the rtnl, which triggers the
following warning:

[ 1445.105706]  Possible unsafe locking scenario:
//               -> background job            -> nl802154_send_beacons()   
[ 1445.105707]        CPU0                    CPU1
[ 1445.105708]        ----                    ----
[ 1445.105709]   lock(&local->beacon_lock);
[ 1445.105710]                                lock(rtnl_mutex);
[ 1445.105712]                                lock(&local->beacon_lock);
[ 1445.105713]   lock(rtnl_mutex);

Exactly the same happens in the scanning path during active scans:

[   52.518741]  Possible unsafe locking scenario:
//               -> background job            -> nl802154_trigger_scan()
[   52.518742]        CPU0                    CPU1
[   52.518743]        ----                    ----
[   52.518744]   lock(&local->scan_lock);
[   52.518746]                                lock(rtnl_mutex);
[   52.518748]                                lock(&local->scan_lock);
[   52.518750]   lock(rtnl_mutex);

In practice I doubt these situations can really happen because there is
no background job running if the triggering netlink command was not
yet called, but anyway, I feel too weak against locking scenarios
to disobey such a clear lockdep warning :-)

So, from my understanding it was safe not to acquire the rtnl in the
MLME Tx path, as long as the calls were serialized (with another
mutex). You seem not to agree with it, which I completely understand,
but then how do I handle those circular dependencies?

Do you think like me they are false positives?

Thanks,
Miquèl
diff mbox series

Patch

diff --git a/net/mac802154/ieee802154_i.h b/net/mac802154/ieee802154_i.h
index 010365a6364e..b8775bcc9003 100644
--- a/net/mac802154/ieee802154_i.h
+++ b/net/mac802154/ieee802154_i.h
@@ -29,7 +29,9 @@  struct ieee802154_local {
 	/* ieee802154 phy */
 	struct wpan_phy *phy;
 
+	/* Open/close counter and lock */
 	int open_count;
+	struct mutex device_lock;
 
 	/* As in mac80211 slaves list is modified:
 	 * 1) under the RTNL
diff --git a/net/mac802154/iface.c b/net/mac802154/iface.c
index 7ac0c5685d3f..7715e17d9ba1 100644
--- a/net/mac802154/iface.c
+++ b/net/mac802154/iface.c
@@ -315,11 +315,15 @@  static int mac802154_slave_close(struct net_device *dev)
 
 	ASSERT_RTNL();
 
+	mutex_lock(&local->device_lock);
+
 	netif_stop_queue(dev);
 	local->open_count--;
 
 	clear_bit(SDATA_STATE_RUNNING, &sdata->state);
 
+	mutex_unlock(&local->device_lock);
+
 	if (!local->open_count)
 		ieee802154_stop_device(local);
 
diff --git a/net/mac802154/main.c b/net/mac802154/main.c
index a2da9d4c5273..e5fb7ed73663 100644
--- a/net/mac802154/main.c
+++ b/net/mac802154/main.c
@@ -90,6 +90,7 @@  ieee802154_alloc_hw(size_t priv_data_len, const struct ieee802154_ops *ops)
 
 	INIT_LIST_HEAD(&local->interfaces);
 	mutex_init(&local->iflist_mtx);
+	mutex_init(&local->device_lock);
 
 	tasklet_setup(&local->tasklet, ieee802154_tasklet_handler);
 
diff --git a/net/mac802154/tx.c b/net/mac802154/tx.c
index 9d8d43cf1e64..fb555797f326 100644
--- a/net/mac802154/tx.c
+++ b/net/mac802154/tx.c
@@ -143,14 +143,14 @@  int ieee802154_mlme_tx(struct ieee802154_local *local,
 {
 	int ret;
 
-	/* Avoid possible calls to ->ndo_stop() when we asynchronously perform
-	 * MLME transmissions.
+	/* Serialize possible calls to ->ndo_stop() when we asynchronously
+	 * perform MLME transmissions.
 	 */
-	rtnl_lock();
+	mutex_lock(&local->device_lock);
 
 	/* Ensure the device was not stopped, otherwise error out */
 	if (!local->open_count) {
-		rtnl_unlock();
+		mutex_unlock(&local->device_lock);
 		return -ENETDOWN;
 	}
 
@@ -158,14 +158,14 @@  int ieee802154_mlme_tx(struct ieee802154_local *local,
 	 * net interface expects this cannot happen.
 	 */
 	if (WARN_ON_ONCE(!netif_running(sdata->dev))) {
-		rtnl_unlock();
+		mutex_unlock(&local->device_lock);
 		return -ENETDOWN;
 	}
 
 	ieee802154_tx(local, skb);
 	ret = ieee802154_sync_queue(local);
 
-	rtnl_unlock();
+	mutex_unlock(&local->device_lock);
 
 	return ret;
 }