diff mbox series

[PATCHv2] wlcore: fix race for WL1271_FLAG_IRQ_RUNNING

Message ID 20191007172800.64249-1-tony@atomide.com (mailing list archive)
State New, archived
Headers show
Series [PATCHv2] wlcore: fix race for WL1271_FLAG_IRQ_RUNNING | expand

Commit Message

Tony Lindgren Oct. 7, 2019, 5:28 p.m. UTC
We set WL1271_FLAG_IRQ_RUNNING in the beginning of wlcore_irq(), and test
for it in wlcore_runtime_resume(). But WL1271_FLAG_IRQ_RUNNING currently
gets cleared too early by wlcore_irq_locked() before wlcore_irq() is done
calling it. And this will race against wlcore_runtime_resume() testing it.

Let's set and clear IRQ_RUNNING in wlcore_irq() so wlcore_runtime_resume()
can rely on it. And let's remove old comments about hardirq, that's no
longer the case as we're using request_threaded_irq().

This fixes occasional annoying wlcore firmware reboots stat start with
"wlcore: WARNING ELP wakeup timeout!" followed by a multisecond latency
when the wlcore firmware gets wrongly rebooted waiting for an ELP wake
interrupt that won't be coming.

Note that I also suspect some form of this issue was the root cause why
the wlcore GPIO interrupt has been often configured as a level interrupt
instead of edge as an attempt to work around the ELP wake timeout errors.

Fixes: fa2648a34e73 ("wlcore: Add support for runtime PM")
Cc: Anders Roxell <anders.roxell@linaro.org>
Cc: Eyal Reizer <eyalr@ti.com>
Cc: Guy Mishol <guym@ti.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>
---

Changes since v1:

- Add locking around clear_bit like we do elsewhere in the driver

 drivers/net/wireless/ti/wlcore/main.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Tony Lindgren Oct. 8, 2019, 2:05 p.m. UTC | #1
* Tony Lindgren <tony@atomide.com> [191007 17:29]:
> We set WL1271_FLAG_IRQ_RUNNING in the beginning of wlcore_irq(), and test
> for it in wlcore_runtime_resume(). But WL1271_FLAG_IRQ_RUNNING currently
> gets cleared too early by wlcore_irq_locked() before wlcore_irq() is done
> calling it. And this will race against wlcore_runtime_resume() testing it.
> 
> Let's set and clear IRQ_RUNNING in wlcore_irq() so wlcore_runtime_resume()
> can rely on it. And let's remove old comments about hardirq, that's no
> longer the case as we're using request_threaded_irq().
> 
> This fixes occasional annoying wlcore firmware reboots stat start with
> "wlcore: WARNING ELP wakeup timeout!" followed by a multisecond latency
> when the wlcore firmware gets wrongly rebooted waiting for an ELP wake
> interrupt that won't be coming.
> 
> Note that I also suspect some form of this issue was the root cause why
> the wlcore GPIO interrupt has been often configured as a level interrupt
> instead of edge as an attempt to work around the ELP wake timeout errors.

So this fixed a reproducable test case where loading some webpages
often produced ELP timeout errors. But looks like I'm still seeing ELP
timeouts elsewhere. So best to wait on this one. Something is still
wrong with the ELP timeout handling.

Regards,

Tony

> Fixes: fa2648a34e73 ("wlcore: Add support for runtime PM")
> Cc: Anders Roxell <anders.roxell@linaro.org>
> Cc: Eyal Reizer <eyalr@ti.com>
> Cc: Guy Mishol <guym@ti.com>
> Cc: John Stultz <john.stultz@linaro.org>
> Cc: Ulf Hansson <ulf.hansson@linaro.org>
> Signed-off-by: Tony Lindgren <tony@atomide.com>
> ---
> 
> Changes since v1:
> 
> - Add locking around clear_bit like we do elsewhere in the driver
> 
>  drivers/net/wireless/ti/wlcore/main.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/wireless/ti/wlcore/main.c b/drivers/net/wireless/ti/wlcore/main.c
> --- a/drivers/net/wireless/ti/wlcore/main.c
> +++ b/drivers/net/wireless/ti/wlcore/main.c
> @@ -544,11 +544,6 @@ static int wlcore_irq_locked(struct wl1271 *wl)
>  	}
>  
>  	while (!done && loopcount--) {
> -		/*
> -		 * In order to avoid a race with the hardirq, clear the flag
> -		 * before acknowledging the chip.
> -		 */
> -		clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags);
>  		smp_mb__after_atomic();
>  
>  		ret = wlcore_fw_status(wl, wl->fw_status);
> @@ -668,7 +663,7 @@ static irqreturn_t wlcore_irq(int irq, void *cookie)
>  		disable_irq_nosync(wl->irq);
>  		pm_wakeup_event(wl->dev, 0);
>  		spin_unlock_irqrestore(&wl->wl_lock, flags);
> -		return IRQ_HANDLED;
> +		goto out_handled;
>  	}
>  	spin_unlock_irqrestore(&wl->wl_lock, flags);
>  
> @@ -692,6 +687,11 @@ static irqreturn_t wlcore_irq(int irq, void *cookie)
>  
>  	mutex_unlock(&wl->mutex);
>  
> +out_handled:
> +	spin_lock_irqsave(&wl->wl_lock, flags);
> +	clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags);
> +	spin_unlock_irqrestore(&wl->wl_lock, flags);
> +
>  	return IRQ_HANDLED;
>  }
>  
> -- 
> 2.23.0
Kalle Valo Oct. 8, 2019, 2:16 p.m. UTC | #2
Tony Lindgren <tony@atomide.com> writes:

> * Tony Lindgren <tony@atomide.com> [191007 17:29]:
>> We set WL1271_FLAG_IRQ_RUNNING in the beginning of wlcore_irq(), and test
>> for it in wlcore_runtime_resume(). But WL1271_FLAG_IRQ_RUNNING currently
>> gets cleared too early by wlcore_irq_locked() before wlcore_irq() is done
>> calling it. And this will race against wlcore_runtime_resume() testing it.
>> 
>> Let's set and clear IRQ_RUNNING in wlcore_irq() so wlcore_runtime_resume()
>> can rely on it. And let's remove old comments about hardirq, that's no
>> longer the case as we're using request_threaded_irq().
>> 
>> This fixes occasional annoying wlcore firmware reboots stat start with
>> "wlcore: WARNING ELP wakeup timeout!" followed by a multisecond latency
>> when the wlcore firmware gets wrongly rebooted waiting for an ELP wake
>> interrupt that won't be coming.
>> 
>> Note that I also suspect some form of this issue was the root cause why
>> the wlcore GPIO interrupt has been often configured as a level interrupt
>> instead of edge as an attempt to work around the ELP wake timeout errors.
>
> So this fixed a reproducable test case where loading some webpages
> often produced ELP timeout errors. But looks like I'm still seeing ELP
> timeouts elsewhere. So best to wait on this one. Something is still
> wrong with the ELP timeout handling.

Ok, I'll drop this then. Please send v3 once you think the patch is
ready to be applied.
Tony Lindgren Oct. 9, 2019, 4:42 p.m. UTC | #3
* Kalle Valo <kvalo@codeaurora.org> [191008 14:17]:
> Tony Lindgren <tony@atomide.com> writes:
> 
> > * Tony Lindgren <tony@atomide.com> [191007 17:29]:
> >> We set WL1271_FLAG_IRQ_RUNNING in the beginning of wlcore_irq(), and test
> >> for it in wlcore_runtime_resume(). But WL1271_FLAG_IRQ_RUNNING currently
> >> gets cleared too early by wlcore_irq_locked() before wlcore_irq() is done
> >> calling it. And this will race against wlcore_runtime_resume() testing it.
> >> 
> >> Let's set and clear IRQ_RUNNING in wlcore_irq() so wlcore_runtime_resume()
> >> can rely on it. And let's remove old comments about hardirq, that's no
> >> longer the case as we're using request_threaded_irq().
> >> 
> >> This fixes occasional annoying wlcore firmware reboots stat start with
> >> "wlcore: WARNING ELP wakeup timeout!" followed by a multisecond latency
> >> when the wlcore firmware gets wrongly rebooted waiting for an ELP wake
> >> interrupt that won't be coming.
> >> 
> >> Note that I also suspect some form of this issue was the root cause why
> >> the wlcore GPIO interrupt has been often configured as a level interrupt
> >> instead of edge as an attempt to work around the ELP wake timeout errors.
> >
> > So this fixed a reproducable test case where loading some webpages
> > often produced ELP timeout errors. But looks like I'm still seeing ELP
> > timeouts elsewhere. So best to wait on this one. Something is still
> > wrong with the ELP timeout handling.
> 
> Ok, I'll drop this then. Please send v3 once you think the patch is
> ready to be applied.

Looks like the real fix is to use level instead of edge interrupt
for omap4 and 5 to avoid the check for untriggered interrupts in
omap_gpio_unidle(). Should not be needed for other SoCs as their
l4per can't idle independent of the CPUs.

I'll send a separate patch for that. And I'll send an updated clean-up
patch for $subject patch as the race described above should never
happen.

The clearing of WL1271_FLAG_IRQ_RUNNING bit happens already within
pm_runtime_get_sync() section of wlcore_irq_locked(). So this patch just
happened to sligthly change the timings for my reproducable test case.
We should not be able to hit the race described above even with super
short autosuspend timeouts between wlcore_irq_locked() and the end of
wlcore_irq() :)

Regards,

Tony


> -- 
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
diff mbox series

Patch

diff --git a/drivers/net/wireless/ti/wlcore/main.c b/drivers/net/wireless/ti/wlcore/main.c
--- a/drivers/net/wireless/ti/wlcore/main.c
+++ b/drivers/net/wireless/ti/wlcore/main.c
@@ -544,11 +544,6 @@  static int wlcore_irq_locked(struct wl1271 *wl)
 	}
 
 	while (!done && loopcount--) {
-		/*
-		 * In order to avoid a race with the hardirq, clear the flag
-		 * before acknowledging the chip.
-		 */
-		clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags);
 		smp_mb__after_atomic();
 
 		ret = wlcore_fw_status(wl, wl->fw_status);
@@ -668,7 +663,7 @@  static irqreturn_t wlcore_irq(int irq, void *cookie)
 		disable_irq_nosync(wl->irq);
 		pm_wakeup_event(wl->dev, 0);
 		spin_unlock_irqrestore(&wl->wl_lock, flags);
-		return IRQ_HANDLED;
+		goto out_handled;
 	}
 	spin_unlock_irqrestore(&wl->wl_lock, flags);
 
@@ -692,6 +687,11 @@  static irqreturn_t wlcore_irq(int irq, void *cookie)
 
 	mutex_unlock(&wl->mutex);
 
+out_handled:
+	spin_lock_irqsave(&wl->wl_lock, flags);
+	clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags);
+	spin_unlock_irqrestore(&wl->wl_lock, flags);
+
 	return IRQ_HANDLED;
 }