diff mbox

mac80211: handle hw roc cancel vs. expiration race

Message ID 1425902001-3163-1-git-send-email-eliad@wizery.com (mailing list archive)
State Changes Requested
Delegated to: Johannes Berg
Headers show

Commit Message

Eliad Peller March 9, 2015, 11:53 a.m. UTC
mac80211 handles roc expirations (indicated by
ieee80211_remain_on_channel_expired()) with a separate
work (hw_roc_done).

If userspace asks to cancel the ongoing roc before
hw_roc_done was scheduled, mac80211 will cancel
the ongoing roc, while leaving the hw_roc_done work
pending.

This might end up in the next roc being cancelled
by the stale work, causing mac80211 and the low-level
driver to go out of sync.

Fix it by cancelling hw_roc_done on roc cancel.
Since hw_roc_done takes local->mtx, temporarily
release it (before re-acquiring it and starting
the next roc if needed).

Signed-off-by: Eliad Peller <eliad@wizery.com>
---
 net/mac80211/cfg.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

Comments

Johannes Berg March 9, 2015, 12:52 p.m. UTC | #1
On Mon, 2015-03-09 at 13:53 +0200, Eliad Peller wrote:

> Fix it by cancelling hw_roc_done on roc cancel.
> Since hw_roc_done takes local->mtx, temporarily
> release it (before re-acquiring it and starting
> the next roc if needed).

I can't say I like this, and I'm not even sure it doesn't leave a race?
After all, the work will not take the RTNL.

Also, I think the whole concept is racy because you acquire "found"
before the unlock, but use it after the unlock, so if the work actually
ran (_sync, perhaps it already started) it may have freed the "found"
pointer.

I think the only way to really fix that is to make the driver return the
roc pointer or so and then we can either queue a work struct per roc
struct, or at least mark the roc struct as driver-destroyed and
double-check that in the work function?

Or perhaps we could flush the work before we even take the lock, but
then it might still race with the driver trying to cancel just after we
flushed I guess.

johannes


--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eliad Peller March 9, 2015, 1:59 p.m. UTC | #2
On Mon, Mar 9, 2015 at 2:52 PM, Johannes Berg <johannes@sipsolutions.net> wrote:
> On Mon, 2015-03-09 at 13:53 +0200, Eliad Peller wrote:
>
>> Fix it by cancelling hw_roc_done on roc cancel.
>> Since hw_roc_done takes local->mtx, temporarily
>> release it (before re-acquiring it and starting
>> the next roc if needed).
>
> I can't say I like this, and I'm not even sure it doesn't leave a race?
> After all, the work will not take the RTNL.
>
the other way (cancel while hw_roc_done) is protected, since
ieee80211_cancel_roc() looks for a specific roc cookie.

> Also, I think the whole concept is racy because you acquire "found"
> before the unlock, but use it after the unlock, so if the work actually
> ran (_sync, perhaps it already started) it may have freed the "found"
> pointer.
>
hw_roc_done work bails out in case of no started roc, so i think we're
safe here.

> I think the only way to really fix that is to make the driver return the
> roc pointer or so and then we can either queue a work struct per roc
> struct, or at least mark the roc struct as driver-destroyed and
> double-check that in the work function?
>
that was my initial implementation. but then i thought it will be
nicer to implement it in mac80211 and avoid changing all the drivers
(with the need to save the cookie, etc.)
do you prefer adding such cookie/pointer param?

> Or perhaps we could flush the work before we even take the lock, but
> then it might still race with the driver trying to cancel just after we
> flushed I guess.
right.

Eliad.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Johannes Berg May 6, 2015, 1:15 p.m. UTC | #3
To revive an old thread - sorry...

> >> Fix it by cancelling hw_roc_done on roc cancel.
> >> Since hw_roc_done takes local->mtx, temporarily
> >> release it (before re-acquiring it and starting
> >> the next roc if needed).
> >
> > I can't say I like this, and I'm not even sure it doesn't leave a race?
> > After all, the work will not take the RTNL.
> >
> the other way (cancel while hw_roc_done) is protected, since
> ieee80211_cancel_roc() looks for a specific roc cookie.

But you're saying that we rely on RTNL to not modify the list, and the
hw_roc_done work doesn't use the RTNL? Hmm. Are you saying now that it
also checks that it was started? But then why do we even have this
issue?

Or I can see why we have this issue, but perhaps we could fix it by
flushing the work *after* we cancel, so in case the driver was
cancelling at the same time we'd not yet have started the next one?

> > I think the only way to really fix that is to make the driver return the
> > roc pointer or so and then we can either queue a work struct per roc
> > struct, or at least mark the roc struct as driver-destroyed and
> > double-check that in the work function?
> >
> that was my initial implementation. but then i thought it will be
> nicer to implement it in mac80211 and avoid changing all the drivers
> (with the need to save the cookie, etc.)
> do you prefer adding such cookie/pointer param?

How difficult was it for the driver(s)? It seems it would make the code
far easier to reason about, which can only be a good thing. We'd save
the whole race discussion above :)

johannes

--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eliad Peller May 7, 2015, 9:58 a.m. UTC | #4
On Wed, May 6, 2015 at 4:15 PM, Johannes Berg <johannes@sipsolutions.net> wrote:
> To revive an old thread - sorry...
>
>> >> Fix it by cancelling hw_roc_done on roc cancel.
>> >> Since hw_roc_done takes local->mtx, temporarily
>> >> release it (before re-acquiring it and starting
>> >> the next roc if needed).
>> >
>> > I can't say I like this, and I'm not even sure it doesn't leave a race?
>> > After all, the work will not take the RTNL.
>> >
>> the other way (cancel while hw_roc_done) is protected, since
>> ieee80211_cancel_roc() looks for a specific roc cookie.
>
> But you're saying that we rely on RTNL to not modify the list, and the
> hw_roc_done work doesn't use the RTNL? Hmm. Are you saying now that it
> also checks that it was started? But then why do we even have this
> issue?
>
sorry, i'll have to recall the exact details...

> Or I can see why we have this issue, but perhaps we could fix it by
> flushing the work *after* we cancel, so in case the driver was
> cancelling at the same time we'd not yet have started the next one?
>
that's basically what i did - canceled the work after
drv_cancel_remain_on_channel() was called.

>> > I think the only way to really fix that is to make the driver return the
>> > roc pointer or so and then we can either queue a work struct per roc
>> > struct, or at least mark the roc struct as driver-destroyed and
>> > double-check that in the work function?
>> >
>> that was my initial implementation. but then i thought it will be
>> nicer to implement it in mac80211 and avoid changing all the drivers
>> (with the need to save the cookie, etc.)
>> do you prefer adding such cookie/pointer param?
>
> How difficult was it for the driver(s)? It seems it would make the code
> far easier to reason about, which can only be a good thing. We'd save
> the whole race discussion above :)
>
sure. let's forget about this discussion and go with the clearer solution :)

Eliad.
--
To unsubscribe from this list: send the line "unsubscribe linux-wireless" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index dd4ff36..a80f7b3 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -2782,7 +2782,22 @@  static int ieee80211_cancel_roc(struct ieee80211_local *local,
 		}
 
 		list_del(&found->list);
+		mutex_unlock(&local->mtx);
+
+		/*
+		 * The driver might have already expired the ROC (by
+		 * calling ieee80211_remain_on_channel_expired()), so
+		 * cancel the pending roc_done work (otherwise, it might
+		 * cancel the upcoming roc).
+		 */
+		cancel_work_sync(&local->hw_roc_done);
 
+		mutex_lock(&local->mtx);
+		/*
+		 * We hold rtnl, so nothing should have been able to
+		 * manipulate the roc_list during the temporary
+		 * lock release.
+		 */
 		if (found->started)
 			ieee80211_start_next_roc(local);
 		mutex_unlock(&local->mtx);