diff mbox series

Missing wiphy lock in ieee80211_color_collision_detection_work

Message ID D4A40Q44OAY2.W3SIF6UEPBUN@freebox.fr (mailing list archive)
State Not Applicable
Delegated to: Johannes Berg
Headers show
Series Missing wiphy lock in ieee80211_color_collision_detection_work | expand

Commit Message

Nicolas Escande Sept. 19, 2024, 8:15 a.m. UTC
Hi guys,

Running a pretty hacked up kernel under lockdep, I've had a few splats like this

	[   75.453180] Call trace:
	[   75.455595]  cfg80211_bss_color_notify+0x220/0x260
	[   75.460341]  ieee80211_color_collision_detection_work+0x4c/0x5c
	[   75.466205]  process_one_work+0x434/0x6ec
	[   75.470169]  worker_thread+0x9c/0x624
	[   75.473794]  kthread+0x1a4/0x1ac
	[   75.476987]  ret_from_fork+0x10/0x20

Which shows we are calling cfg80211_obss_color_collision_notify and thus
cfg80211_bss_color_notify from ieee80211_color_collision_detection_work without
holding the wiphy's mtx.

It seems that we should either explicitely lock the phy using something like


Or switch to using the wiphy_work_queue infrastructure.

Did I miss something ? Which one should we do ? I'm not sure of all the
implications of switching to the wiphy work queue and why it did not get
converted like the color_change_finalize_work stuff ?

Comments

Nicolas Cavallari Sept. 19, 2024, 10:02 a.m. UTC | #1
On 19/09/2024 10:15, Nicolas Escande wrote:
> Hi guys,
> 
> Running a pretty hacked up kernel under lockdep, I've had a few splats like this
> 
> 	[   75.453180] Call trace:
> 	[   75.455595]  cfg80211_bss_color_notify+0x220/0x260
> 	[   75.460341]  ieee80211_color_collision_detection_work+0x4c/0x5c
> 	[   75.466205]  process_one_work+0x434/0x6ec
> 	[   75.470169]  worker_thread+0x9c/0x624
> 	[   75.473794]  kthread+0x1a4/0x1ac
> 	[   75.476987]  ret_from_fork+0x10/0x20
> 
> Which shows we are calling cfg80211_obss_color_collision_notify and thus
> cfg80211_bss_color_notify from ieee80211_color_collision_detection_work without
> holding the wiphy's mtx.
> 
> It seems that we should either explicitely lock the phy using something like
> 
> diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
> index 847304a3a29a..481e1550cb21 100644
> --- a/net/mac80211/cfg.c
> +++ b/net/mac80211/cfg.c
> @@ -4833,9 +4833,12 @@ void ieee80211_color_collision_detection_work(struct work_struct *work)
>   		container_of(delayed_work, struct ieee80211_link_data,
>   			     color_collision_detect_work);
>   	struct ieee80211_sub_if_data *sdata = link->sdata;
> +	struct ieee80211_local *local = sdata->local;
>   
> +	wiphy_lock(local->hw.wiphy);
>   	cfg80211_obss_color_collision_notify(sdata->dev, link->color_bitmap,
>   					     link->link_id);
> +	wiphy_unlock(local->hw.wiphy);
>   }
>   
>   void ieee80211_color_change_finish(struct ieee80211_vif *vif, u8 link_id)
> 
> Or switch to using the wiphy_work_queue infrastructure.
> 
> Did I miss something ? Which one should we do ? I'm not sure of all the
> implications of switching to the wiphy work queue and why it did not get
> converted like the color_change_finalize_work stuff ?

ieee80211_color_collision_detection_work() used to lock the wdev mutex, now it 
does not hold anything since 076fc8775da("wifi: cfg80211: remove wdev mutex")

Also the rate limiting uses delayed_work_pending(), There is no wiphy work queue 
equivalent AFAIK, so the explicit lock is probably the way to go.
Johannes Berg Sept. 19, 2024, 10:22 a.m. UTC | #2
On Thu, 2024-09-19 at 12:02 +0200, Nicolas Cavallari wrote:
> 
> > Did I miss something ? Which one should we do ? I'm not sure of all the
> > implications of switching to the wiphy work queue and why it did not get
> > converted like the color_change_finalize_work stuff ?
> 
> ieee80211_color_collision_detection_work() used to lock the wdev mutex, now it 
> does not hold anything since 076fc8775da("wifi: cfg80211: remove wdev mutex")
> 
> Also the rate limiting uses delayed_work_pending(), There is no wiphy work queue 
> equivalent AFAIK, so the explicit lock is probably the way to go.

That won't work, it's cancel_delayed_work_sync() under the wiphy mutex,
so that'll cause deadlocks (and should cause lockdep complaints about
them.)

johannes
Nicolas Escande Sept. 20, 2024, 7:05 a.m. UTC | #3
On Thu Sep 19, 2024 at 12:22 PM CEST, Johannes Berg wrote:
> > 
> > Also the rate limiting uses delayed_work_pending(), There is no wiphy work queue 
> > equivalent AFAIK, so the explicit lock is probably the way to go.
>
> That won't work, it's cancel_delayed_work_sync() under the wiphy mutex,
> so that'll cause deadlocks (and should cause lockdep complaints about
> them.)
>
> johannes

Ok then we'll go the wiphy work queue way.
Remi Pommarel Sept. 20, 2024, 8:28 a.m. UTC | #4
On Thu, Sep 19, 2024 at 12:22:10PM +0200, Johannes Berg wrote:
> On Thu, 2024-09-19 at 12:02 +0200, Nicolas Cavallari wrote:
> > 
> > > Did I miss something ? Which one should we do ? I'm not sure of all the
> > > implications of switching to the wiphy work queue and why it did not get
> > > converted like the color_change_finalize_work stuff ?
> > 
> > ieee80211_color_collision_detection_work() used to lock the wdev mutex, now it 
> > does not hold anything since 076fc8775da("wifi: cfg80211: remove wdev mutex")
> > 
> > Also the rate limiting uses delayed_work_pending(), There is no wiphy work queue 
> > equivalent AFAIK, so the explicit lock is probably the way to go.
> 
> That won't work, it's cancel_delayed_work_sync() under the wiphy mutex,
> so that'll cause deadlocks (and should cause lockdep complaints about
> them.)

Yes you are right, and AFAIU that is this kind of issue using wiphy work
queue would prevent. With wiphy work queue if wiphy_delayed_work_cancel
is called with wiphy lock held, work cannot be running after it returns;
making it handy to replace cancel_delayed_work_sync() with.

So, in my opinion, switching to wiphy work queue seems to be the
prefered solution here.

While there is no wiphy work queue equivalent of delayed_work_pending(),
I think using timer_pending(&link->color_collision_detect_work->timer)
to replace delayed_work_pending(), even if the semantic is a bit
different, would be ok to fulfill the rate limiting purpose. Having the
same delayed_work_pending() semantics on wiphy work queue would require
to take wiphy lock which seem a bit superfluous here.

Does that make sense ?

Thanks
Remi Pommarel Sept. 20, 2024, 8:41 a.m. UTC | #5
On Fri, Sep 20, 2024 at 10:28:55AM +0200, Remi Pommarel wrote:
> On Thu, Sep 19, 2024 at 12:22:10PM +0200, Johannes Berg wrote:
> > On Thu, 2024-09-19 at 12:02 +0200, Nicolas Cavallari wrote:
> > > 
> > > > Did I miss something ? Which one should we do ? I'm not sure of all the
> > > > implications of switching to the wiphy work queue and why it did not get
> > > > converted like the color_change_finalize_work stuff ?
> > > 
> > > ieee80211_color_collision_detection_work() used to lock the wdev mutex, now it 
> > > does not hold anything since 076fc8775da("wifi: cfg80211: remove wdev mutex")
> > > 
> > > Also the rate limiting uses delayed_work_pending(), There is no wiphy work queue 
> > > equivalent AFAIK, so the explicit lock is probably the way to go.
> > 
> > That won't work, it's cancel_delayed_work_sync() under the wiphy mutex,
> > so that'll cause deadlocks (and should cause lockdep complaints about
> > them.)
> 
> Yes you are right, and AFAIU that is this kind of issue using wiphy work
> queue would prevent. With wiphy work queue if wiphy_delayed_work_cancel
> is called with wiphy lock held, work cannot be running after it returns;
> making it handy to replace cancel_delayed_work_sync() with.
> 
> So, in my opinion, switching to wiphy work queue seems to be the
> prefered solution here.
> 
> While there is no wiphy work queue equivalent of delayed_work_pending(),
> I think using timer_pending(&link->color_collision_detect_work->timer)
> to replace delayed_work_pending(), even if the semantic is a bit
> different, would be ok to fulfill the rate limiting purpose. Having the
> same delayed_work_pending() semantics on wiphy work queue would require
> to take wiphy lock which seem a bit superfluous here.

Forgot to mention that I am currently running a wiphy work version of
ieee80211_color_collision_detection_work() using timer_pending() as a
delayed_work_pending() substitute since a couple of hours and lockdep
did not complain so far.
Johannes Berg Sept. 20, 2024, 10 a.m. UTC | #6
On Fri, 2024-09-20 at 10:41 +0200, Remi Pommarel wrote:
> > > That won't work, it's cancel_delayed_work_sync() under the wiphy mutex,
> > > so that'll cause deadlocks (and should cause lockdep complaints about
> > > them.)
> > 
> > Yes you are right, and AFAIU that is this kind of issue using wiphy work
> > queue would prevent. With wiphy work queue if wiphy_delayed_work_cancel
> > is called with wiphy lock held, work cannot be running after it returns;
> > making it handy to replace cancel_delayed_work_sync() with.

Right, and safe under the lock, since the lock is taken externally to
the actual work function. It either is completed or still on the list,
both cases are handled safely (do nothing and remove, respectively.)

> > So, in my opinion, switching to wiphy work queue seems to be the
> > prefered solution here.

Yes.

> > While there is no wiphy work queue equivalent of delayed_work_pending(),

Maybe we should add it?

> > I think using timer_pending(&link->color_collision_detect_work->timer)
> > to replace delayed_work_pending(), even if the semantic is a bit
> > different, would be ok to fulfill the rate limiting purpose.

I think you're right. We could as well check list_empty() or so, but it
wouldn't make a big difference. As you say:

> > Having the
> > same delayed_work_pending() semantics on wiphy work queue would require
> > to take wiphy lock

To really have it known _precisely_, that's true.

> >  which seem a bit superfluous here.

It's actually simply also not possible - if we could sleep in
ieee80211_obss_color_collision_notify() and take the wiphy mutex, we'd
probably have done this completely differently :)

And a hypothetical wiphy_delayed_work_pending() API should therefore not
be required to be called with the wiphy mutex held.

I think that perhaps we should add such a trivial inline instead of
open-coding the timer_pending() check, but I'm not sure whether or not
it should also check the list (i.e. check if timer has expired, but work
hasn't started yet): on the one hand it seems more appropriate, and if
actually holding the wiphy mutex it would in fact be completely correct,
on the other hand maybe it encourages being sloppily thinking the return
value is always perfect?

Right now I'd tend to have the check and document that it's only
guaranteed when under the wiphy mutex.


> Forgot to mention that I am currently running a wiphy work version of
> ieee80211_color_collision_detection_work() using timer_pending() as a
> delayed_work_pending() substitute since a couple of hours and lockdep
> did not complain so far.

:)

johannes
Nicolas Cavallari Sept. 20, 2024, 11:35 a.m. UTC | #7
On 20/09/2024 10:28, Remi Pommarel wrote:
> On Thu, Sep 19, 2024 at 12:22:10PM +0200, Johannes Berg wrote:
>> On Thu, 2024-09-19 at 12:02 +0200, Nicolas Cavallari wrote:
>>>
>>>> Did I miss something ? Which one should we do ? I'm not sure of all the
>>>> implications of switching to the wiphy work queue and why it did not get
>>>> converted like the color_change_finalize_work stuff ?
>>>
>>> ieee80211_color_collision_detection_work() used to lock the wdev mutex, now it
>>> does not hold anything since 076fc8775da("wifi: cfg80211: remove wdev mutex")
>>>
>>> Also the rate limiting uses delayed_work_pending(), There is no wiphy work queue
>>> equivalent AFAIK, so the explicit lock is probably the way to go.
>>
>> That won't work, it's cancel_delayed_work_sync() under the wiphy mutex,
>> so that'll cause deadlocks (and should cause lockdep complaints about
>> them.)
> 
> Yes you are right, and AFAIU that is this kind of issue using wiphy work
> queue would prevent. With wiphy work queue if wiphy_delayed_work_cancel
> is called with wiphy lock held, work cannot be running after it returns;
> making it handy to replace cancel_delayed_work_sync() with.
> 
> So, in my opinion, switching to wiphy work queue seems to be the
> prefered solution here.
> 
> While there is no wiphy work queue equivalent of delayed_work_pending(),
> I think using timer_pending(&link->color_collision_detect_work->timer)
> to replace delayed_work_pending(), even if the semantic is a bit
> different, would be ok to fulfill the rate limiting purpose. Having the
> same delayed_work_pending() semantics on wiphy work queue would require
> to take wiphy lock which seem a bit superfluous here.
> 
> Does that make sense ?

At a cost of a spin_trylock, we can also simply take the __ratelimit() option 
(or code an equivalent) and schedule a wiphy_work immediately.

The goal is just to rate limit the work (otherwise netlink is flooded and the 
system is hosed). queue_delayed_work() and delayed_work_pending() was just an 
easy way of implementing it since it had to be a work anyway.
Remi Pommarel Sept. 20, 2024, 12:07 p.m. UTC | #8
On Fri, Sep 20, 2024 at 12:00:08PM +0200, Johannes Berg wrote:
> On Fri, 2024-09-20 at 10:41 +0200, Remi Pommarel wrote:
> > > I think using timer_pending(&link->color_collision_detect_work->timer)
> > > to replace delayed_work_pending(), even if the semantic is a bit
> > > different, would be ok to fulfill the rate limiting purpose.
> 
> I think you're right. We could as well check list_empty() or so, but it
> wouldn't make a big difference. As you say:
> 
> > > Having the
> > > same delayed_work_pending() semantics on wiphy work queue would require
> > > to take wiphy lock
> 
> To really have it known _precisely_, that's true.
> 
> > >  which seem a bit superfluous here.
> 
> It's actually simply also not possible - if we could sleep in
> ieee80211_obss_color_collision_notify() and take the wiphy mutex, we'd
> probably have done this completely differently :)
> 
> And a hypothetical wiphy_delayed_work_pending() API should therefore not
> be required to be called with the wiphy mutex held.
> 
> I think that perhaps we should add such a trivial inline instead of
> open-coding the timer_pending() check, but I'm not sure whether or not
> it should also check the list (i.e. check if timer has expired, but work
> hasn't started yet): on the one hand it seems more appropriate, and if
> actually holding the wiphy mutex it would in fact be completely correct,
> on the other hand maybe it encourages being sloppily thinking the return
> value is always perfect?
> 
> Right now I'd tend to have the check and document that it's only
> guaranteed when under the wiphy mutex.

I had this exact train of thought and was replying that I did agree on
that, and then I checked the other *_pending semantics for which I tend
to forget the details. IIUC timer_pending() and work_pending() can both
return false if callback is still running (or even not yet started).
Thus the hypothetical wiphy_work_pending() semantics could allow to
implement it using timer_pending().

Adding a list_empty() check while making it more "precise" does not make
it "perfect" (even if there is no clear notion of what perfection should
be here) even with wiphy lock held. Indeed if wiphy_work_pending() is
called precisely after delayed work timer has cleared its pending state
but before the callback (i.e wiphy_delayed_work_timer()) has added the
work in the list both !timer_pending() and list_empty(work->entry) would
be true. So there is a small window where wiphy_work_pending() wouldn't
be more precise than just checking timer_pending() as show below:

      CPU0                               CPU1
 expire_timers
   detach_timer
                                    wiphy_work_pending() -> return false
   timer->function
     wiphy_work_queue
       list_add_tail()
                                    wiphy_work_pending() -> return false
Remi Pommarel Sept. 20, 2024, 12:10 p.m. UTC | #9
On Fri, Sep 20, 2024 at 02:07:19PM +0200, Remi Pommarel wrote:
> On Fri, Sep 20, 2024 at 12:00:08PM +0200, Johannes Berg wrote:
> > On Fri, 2024-09-20 at 10:41 +0200, Remi Pommarel wrote:
> > > > I think using timer_pending(&link->color_collision_detect_work->timer)
> > > > to replace delayed_work_pending(), even if the semantic is a bit
> > > > different, would be ok to fulfill the rate limiting purpose.
> > 
> > I think you're right. We could as well check list_empty() or so, but it
> > wouldn't make a big difference. As you say:
> > 
> > > > Having the
> > > > same delayed_work_pending() semantics on wiphy work queue would require
> > > > to take wiphy lock
> > 
> > To really have it known _precisely_, that's true.
> > 
> > > >  which seem a bit superfluous here.
> > 
> > It's actually simply also not possible - if we could sleep in
> > ieee80211_obss_color_collision_notify() and take the wiphy mutex, we'd
> > probably have done this completely differently :)
> > 
> > And a hypothetical wiphy_delayed_work_pending() API should therefore not
> > be required to be called with the wiphy mutex held.
> > 
> > I think that perhaps we should add such a trivial inline instead of
> > open-coding the timer_pending() check, but I'm not sure whether or not
> > it should also check the list (i.e. check if timer has expired, but work
> > hasn't started yet): on the one hand it seems more appropriate, and if
> > actually holding the wiphy mutex it would in fact be completely correct,
> > on the other hand maybe it encourages being sloppily thinking the return
> > value is always perfect?
> > 
> > Right now I'd tend to have the check and document that it's only
> > guaranteed when under the wiphy mutex.
> 
> I had this exact train of thought and was replying that I did agree on
> that, and then I checked the other *_pending semantics for which I tend
> to forget the details. IIUC timer_pending() and work_pending() can both
> return false if callback is still running (or even not yet started).
> Thus the hypothetical wiphy_work_pending() semantics could allow to
> implement it using timer_pending().
> 
> Adding a list_empty() check while making it more "precise" does not make
> it "perfect" (even if there is no clear notion of what perfection should
> be here) even with wiphy lock held. Indeed if wiphy_work_pending() is
> called precisely after delayed work timer has cleared its pending state
> but before the callback (i.e wiphy_delayed_work_timer()) has added the
> work in the list both !timer_pending() and list_empty(work->entry) would
> be true. So there is a small window where wiphy_work_pending() wouldn't
> be more precise than just checking timer_pending() as show below:
> 
>       CPU0                               CPU1
>  expire_timers
>    detach_timer
>                                     wiphy_work_pending() -> return false
>    timer->function
>      wiphy_work_queue
>        list_add_tail()
>                                     wiphy_work_pending() -> return false

I meant wiphy_work_pending() -> return true here ^.
Johannes Berg Sept. 20, 2024, 12:12 p.m. UTC | #10
On Fri, 2024-09-20 at 14:10 +0200, Remi Pommarel wrote:
> On Fri, Sep 20, 2024 at 02:07:19PM +0200, Remi Pommarel wrote:
> > On Fri, Sep 20, 2024 at 12:00:08PM +0200, Johannes Berg wrote:
> > > On Fri, 2024-09-20 at 10:41 +0200, Remi Pommarel wrote:
> > > > > I think using timer_pending(&link->color_collision_detect_work->timer)
> > > > > to replace delayed_work_pending(), even if the semantic is a bit
> > > > > different, would be ok to fulfill the rate limiting purpose.
> > > 
> > > I think you're right. We could as well check list_empty() or so, but it
> > > wouldn't make a big difference. As you say:
> > > 
> > > > > Having the
> > > > > same delayed_work_pending() semantics on wiphy work queue would require
> > > > > to take wiphy lock
> > > 
> > > To really have it known _precisely_, that's true.
> > > 
> > > > >  which seem a bit superfluous here.
> > > 
> > > It's actually simply also not possible - if we could sleep in
> > > ieee80211_obss_color_collision_notify() and take the wiphy mutex, we'd
> > > probably have done this completely differently :)
> > > 
> > > And a hypothetical wiphy_delayed_work_pending() API should therefore not
> > > be required to be called with the wiphy mutex held.
> > > 
> > > I think that perhaps we should add such a trivial inline instead of
> > > open-coding the timer_pending() check, but I'm not sure whether or not
> > > it should also check the list (i.e. check if timer has expired, but work
> > > hasn't started yet): on the one hand it seems more appropriate, and if
> > > actually holding the wiphy mutex it would in fact be completely correct,
> > > on the other hand maybe it encourages being sloppily thinking the return
> > > value is always perfect?
> > > 
> > > Right now I'd tend to have the check and document that it's only
> > > guaranteed when under the wiphy mutex.
> > 
> > I had this exact train of thought and was replying that I did agree on
> > that, and then I checked the other *_pending semantics for which I tend
> > to forget the details.

Haha, yes, no kidding.

> > IIUC timer_pending() and work_pending() can both
> > return false if callback is still running (or even not yet started).
> > Thus the hypothetical wiphy_work_pending() semantics could allow to
> > implement it using timer_pending().
> > 
> > Adding a list_empty() check while making it more "precise" does not make
> > it "perfect" (even if there is no clear notion of what perfection should
> > be here) even with wiphy lock held. Indeed if wiphy_work_pending() is
> > called precisely after delayed work timer has cleared its pending state
> > but before the callback (i.e wiphy_delayed_work_timer()) has added the
> > work in the list both !timer_pending() and list_empty(work->entry) would
> > be true. So there is a small window where wiphy_work_pending() wouldn't
> > be more precise than just checking timer_pending() as show below:
> > 
> >       CPU0                               CPU1
> >  expire_timers
> >    detach_timer
> >                                     wiphy_work_pending() -> return false
> >    timer->function
> >      wiphy_work_queue
> >        list_add_tail()
> >                                     wiphy_work_pending() -> return false
> 
> I meant wiphy_work_pending() -> return true here ^.
> 
Good point! The second call could even be before the list_add_tail and
still return false, so yeah, wiphy lock doesn't do anything.

But I guess we can live with both of these too, given sufficient
documentation :)

Agree also with Nicolas though that we could just ratelimit this
differently too, this was just a convenient way of achieving it with the
existing infrastructure.

johannes
Remi Pommarel Sept. 20, 2024, 1:37 p.m. UTC | #11
On Fri, Sep 20, 2024 at 02:12:23PM +0200, Johannes Berg wrote:
> On Fri, 2024-09-20 at 14:10 +0200, Remi Pommarel wrote:
> > On Fri, Sep 20, 2024 at 02:07:19PM +0200, Remi Pommarel wrote:
> > > On Fri, Sep 20, 2024 at 12:00:08PM +0200, Johannes Berg wrote:
> > > > On Fri, 2024-09-20 at 10:41 +0200, Remi Pommarel wrote:
> > > > > > I think using timer_pending(&link->color_collision_detect_work->timer)
> > > > > > to replace delayed_work_pending(), even if the semantic is a bit
> > > > > > different, would be ok to fulfill the rate limiting purpose.
> > > > 
> > > > I think you're right. We could as well check list_empty() or so, but it
> > > > wouldn't make a big difference. As you say:
> > > > 
> > > > > > Having the
> > > > > > same delayed_work_pending() semantics on wiphy work queue would require
> > > > > > to take wiphy lock
> > > > 
> > > > To really have it known _precisely_, that's true.
> > > > 
> > > > > >  which seem a bit superfluous here.
> > > > 
> > > > It's actually simply also not possible - if we could sleep in
> > > > ieee80211_obss_color_collision_notify() and take the wiphy mutex, we'd
> > > > probably have done this completely differently :)
> > > > 
> > > > And a hypothetical wiphy_delayed_work_pending() API should therefore not
> > > > be required to be called with the wiphy mutex held.
> > > > 
> > > > I think that perhaps we should add such a trivial inline instead of
> > > > open-coding the timer_pending() check, but I'm not sure whether or not
> > > > it should also check the list (i.e. check if timer has expired, but work
> > > > hasn't started yet): on the one hand it seems more appropriate, and if
> > > > actually holding the wiphy mutex it would in fact be completely correct,
> > > > on the other hand maybe it encourages being sloppily thinking the return
> > > > value is always perfect?
> > > > 
> > > > Right now I'd tend to have the check and document that it's only
> > > > guaranteed when under the wiphy mutex.
> > > 
> > > I had this exact train of thought and was replying that I did agree on
> > > that, and then I checked the other *_pending semantics for which I tend
> > > to forget the details.
> 
> Haha, yes, no kidding.
> 
> > > IIUC timer_pending() and work_pending() can both
> > > return false if callback is still running (or even not yet started).
> > > Thus the hypothetical wiphy_work_pending() semantics could allow to
> > > implement it using timer_pending().
> > > 
> > > Adding a list_empty() check while making it more "precise" does not make
> > > it "perfect" (even if there is no clear notion of what perfection should
> > > be here) even with wiphy lock held. Indeed if wiphy_work_pending() is
> > > called precisely after delayed work timer has cleared its pending state
> > > but before the callback (i.e wiphy_delayed_work_timer()) has added the
> > > work in the list both !timer_pending() and list_empty(work->entry) would
> > > be true. So there is a small window where wiphy_work_pending() wouldn't
> > > be more precise than just checking timer_pending() as show below:
> > > 
> > >       CPU0                               CPU1
> > >  expire_timers
> > >    detach_timer
> > >                                     wiphy_work_pending() -> return false
> > >    timer->function
> > >      wiphy_work_queue
> > >        list_add_tail()
> > >                                     wiphy_work_pending() -> return false
> > 
> > I meant wiphy_work_pending() -> return true here ^.
> > 
> Good point! The second call could even be before the list_add_tail and
> still return false, so yeah, wiphy lock doesn't do anything.
> 
> But I guess we can live with both of these too, given sufficient
> documentation :)

Also, sorry for being bothersome, wiphy_delayed_work_queue() is not a
genuine equivalent of queue_delayed_work() as the latter does not
requeue work if already scheduled while the former does so effectively
changing the delayed work deadline (very mod_delayed_work() alike).

If we wanted to fix that, ieee80211_obss_color_collision_notify() would
simply use wiphy_delayed_work_queue() directly. But other calls would
have to be rechecked and switched to use wiphy_delayed_work_mod() if
needed be.

> 
> Agree also with Nicolas though that we could just ratelimit this
> differently too, this was just a convenient way of achieving it with the
> existing infrastructure.

Yes that would work also but still need to convert into wiphy work the
color collision work to avoid deadlock.
diff mbox series

Patch

diff --git a/net/mac80211/cfg.c b/net/mac80211/cfg.c
index 847304a3a29a..481e1550cb21 100644
--- a/net/mac80211/cfg.c
+++ b/net/mac80211/cfg.c
@@ -4833,9 +4833,12 @@  void ieee80211_color_collision_detection_work(struct work_struct *work)
 		container_of(delayed_work, struct ieee80211_link_data,
 			     color_collision_detect_work);
 	struct ieee80211_sub_if_data *sdata = link->sdata;
+	struct ieee80211_local *local = sdata->local;
 
+	wiphy_lock(local->hw.wiphy);
 	cfg80211_obss_color_collision_notify(sdata->dev, link->color_bitmap,
 					     link->link_id);
+	wiphy_unlock(local->hw.wiphy);
 }
 
 void ieee80211_color_change_finish(struct ieee80211_vif *vif, u8 link_id)