Message ID | 20191007172800.64249-1-tony@atomide.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Kalle Valo |
Headers | show |
Series | [PATCHv2] wlcore: fix race for WL1271_FLAG_IRQ_RUNNING | expand |
* Tony Lindgren <tony@atomide.com> [191007 17:29]: > We set WL1271_FLAG_IRQ_RUNNING in the beginning of wlcore_irq(), and test > for it in wlcore_runtime_resume(). But WL1271_FLAG_IRQ_RUNNING currently > gets cleared too early by wlcore_irq_locked() before wlcore_irq() is done > calling it. And this will race against wlcore_runtime_resume() testing it. > > Let's set and clear IRQ_RUNNING in wlcore_irq() so wlcore_runtime_resume() > can rely on it. And let's remove old comments about hardirq, that's no > longer the case as we're using request_threaded_irq(). > > This fixes occasional annoying wlcore firmware reboots stat start with > "wlcore: WARNING ELP wakeup timeout!" followed by a multisecond latency > when the wlcore firmware gets wrongly rebooted waiting for an ELP wake > interrupt that won't be coming. > > Note that I also suspect some form of this issue was the root cause why > the wlcore GPIO interrupt has been often configured as a level interrupt > instead of edge as an attempt to work around the ELP wake timeout errors. So this fixed a reproducable test case where loading some webpages often produced ELP timeout errors. But looks like I'm still seeing ELP timeouts elsewhere. So best to wait on this one. Something is still wrong with the ELP timeout handling. Regards, Tony > Fixes: fa2648a34e73 ("wlcore: Add support for runtime PM") > Cc: Anders Roxell <anders.roxell@linaro.org> > Cc: Eyal Reizer <eyalr@ti.com> > Cc: Guy Mishol <guym@ti.com> > Cc: John Stultz <john.stultz@linaro.org> > Cc: Ulf Hansson <ulf.hansson@linaro.org> > Signed-off-by: Tony Lindgren <tony@atomide.com> > --- > > Changes since v1: > > - Add locking around clear_bit like we do elsewhere in the driver > > drivers/net/wireless/ti/wlcore/main.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/net/wireless/ti/wlcore/main.c b/drivers/net/wireless/ti/wlcore/main.c > --- a/drivers/net/wireless/ti/wlcore/main.c > +++ b/drivers/net/wireless/ti/wlcore/main.c > @@ -544,11 +544,6 @@ static int wlcore_irq_locked(struct wl1271 *wl) > } > > while (!done && loopcount--) { > - /* > - * In order to avoid a race with the hardirq, clear the flag > - * before acknowledging the chip. > - */ > - clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags); > smp_mb__after_atomic(); > > ret = wlcore_fw_status(wl, wl->fw_status); > @@ -668,7 +663,7 @@ static irqreturn_t wlcore_irq(int irq, void *cookie) > disable_irq_nosync(wl->irq); > pm_wakeup_event(wl->dev, 0); > spin_unlock_irqrestore(&wl->wl_lock, flags); > - return IRQ_HANDLED; > + goto out_handled; > } > spin_unlock_irqrestore(&wl->wl_lock, flags); > > @@ -692,6 +687,11 @@ static irqreturn_t wlcore_irq(int irq, void *cookie) > > mutex_unlock(&wl->mutex); > > +out_handled: > + spin_lock_irqsave(&wl->wl_lock, flags); > + clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags); > + spin_unlock_irqrestore(&wl->wl_lock, flags); > + > return IRQ_HANDLED; > } > > -- > 2.23.0
Tony Lindgren <tony@atomide.com> writes: > * Tony Lindgren <tony@atomide.com> [191007 17:29]: >> We set WL1271_FLAG_IRQ_RUNNING in the beginning of wlcore_irq(), and test >> for it in wlcore_runtime_resume(). But WL1271_FLAG_IRQ_RUNNING currently >> gets cleared too early by wlcore_irq_locked() before wlcore_irq() is done >> calling it. And this will race against wlcore_runtime_resume() testing it. >> >> Let's set and clear IRQ_RUNNING in wlcore_irq() so wlcore_runtime_resume() >> can rely on it. And let's remove old comments about hardirq, that's no >> longer the case as we're using request_threaded_irq(). >> >> This fixes occasional annoying wlcore firmware reboots stat start with >> "wlcore: WARNING ELP wakeup timeout!" followed by a multisecond latency >> when the wlcore firmware gets wrongly rebooted waiting for an ELP wake >> interrupt that won't be coming. >> >> Note that I also suspect some form of this issue was the root cause why >> the wlcore GPIO interrupt has been often configured as a level interrupt >> instead of edge as an attempt to work around the ELP wake timeout errors. > > So this fixed a reproducable test case where loading some webpages > often produced ELP timeout errors. But looks like I'm still seeing ELP > timeouts elsewhere. So best to wait on this one. Something is still > wrong with the ELP timeout handling. Ok, I'll drop this then. Please send v3 once you think the patch is ready to be applied.
* Kalle Valo <kvalo@codeaurora.org> [191008 14:17]: > Tony Lindgren <tony@atomide.com> writes: > > > * Tony Lindgren <tony@atomide.com> [191007 17:29]: > >> We set WL1271_FLAG_IRQ_RUNNING in the beginning of wlcore_irq(), and test > >> for it in wlcore_runtime_resume(). But WL1271_FLAG_IRQ_RUNNING currently > >> gets cleared too early by wlcore_irq_locked() before wlcore_irq() is done > >> calling it. And this will race against wlcore_runtime_resume() testing it. > >> > >> Let's set and clear IRQ_RUNNING in wlcore_irq() so wlcore_runtime_resume() > >> can rely on it. And let's remove old comments about hardirq, that's no > >> longer the case as we're using request_threaded_irq(). > >> > >> This fixes occasional annoying wlcore firmware reboots stat start with > >> "wlcore: WARNING ELP wakeup timeout!" followed by a multisecond latency > >> when the wlcore firmware gets wrongly rebooted waiting for an ELP wake > >> interrupt that won't be coming. > >> > >> Note that I also suspect some form of this issue was the root cause why > >> the wlcore GPIO interrupt has been often configured as a level interrupt > >> instead of edge as an attempt to work around the ELP wake timeout errors. > > > > So this fixed a reproducable test case where loading some webpages > > often produced ELP timeout errors. But looks like I'm still seeing ELP > > timeouts elsewhere. So best to wait on this one. Something is still > > wrong with the ELP timeout handling. > > Ok, I'll drop this then. Please send v3 once you think the patch is > ready to be applied. Looks like the real fix is to use level instead of edge interrupt for omap4 and 5 to avoid the check for untriggered interrupts in omap_gpio_unidle(). Should not be needed for other SoCs as their l4per can't idle independent of the CPUs. I'll send a separate patch for that. And I'll send an updated clean-up patch for $subject patch as the race described above should never happen. The clearing of WL1271_FLAG_IRQ_RUNNING bit happens already within pm_runtime_get_sync() section of wlcore_irq_locked(). So this patch just happened to sligthly change the timings for my reproducable test case. We should not be able to hit the race described above even with super short autosuspend timeouts between wlcore_irq_locked() and the end of wlcore_irq() :) Regards, Tony > -- > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
diff --git a/drivers/net/wireless/ti/wlcore/main.c b/drivers/net/wireless/ti/wlcore/main.c --- a/drivers/net/wireless/ti/wlcore/main.c +++ b/drivers/net/wireless/ti/wlcore/main.c @@ -544,11 +544,6 @@ static int wlcore_irq_locked(struct wl1271 *wl) } while (!done && loopcount--) { - /* - * In order to avoid a race with the hardirq, clear the flag - * before acknowledging the chip. - */ - clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags); smp_mb__after_atomic(); ret = wlcore_fw_status(wl, wl->fw_status); @@ -668,7 +663,7 @@ static irqreturn_t wlcore_irq(int irq, void *cookie) disable_irq_nosync(wl->irq); pm_wakeup_event(wl->dev, 0); spin_unlock_irqrestore(&wl->wl_lock, flags); - return IRQ_HANDLED; + goto out_handled; } spin_unlock_irqrestore(&wl->wl_lock, flags); @@ -692,6 +687,11 @@ static irqreturn_t wlcore_irq(int irq, void *cookie) mutex_unlock(&wl->mutex); +out_handled: + spin_lock_irqsave(&wl->wl_lock, flags); + clear_bit(WL1271_FLAG_IRQ_RUNNING, &wl->flags); + spin_unlock_irqrestore(&wl->wl_lock, flags); + return IRQ_HANDLED; }
We set WL1271_FLAG_IRQ_RUNNING in the beginning of wlcore_irq(), and test for it in wlcore_runtime_resume(). But WL1271_FLAG_IRQ_RUNNING currently gets cleared too early by wlcore_irq_locked() before wlcore_irq() is done calling it. And this will race against wlcore_runtime_resume() testing it. Let's set and clear IRQ_RUNNING in wlcore_irq() so wlcore_runtime_resume() can rely on it. And let's remove old comments about hardirq, that's no longer the case as we're using request_threaded_irq(). This fixes occasional annoying wlcore firmware reboots stat start with "wlcore: WARNING ELP wakeup timeout!" followed by a multisecond latency when the wlcore firmware gets wrongly rebooted waiting for an ELP wake interrupt that won't be coming. Note that I also suspect some form of this issue was the root cause why the wlcore GPIO interrupt has been often configured as a level interrupt instead of edge as an attempt to work around the ELP wake timeout errors. Fixes: fa2648a34e73 ("wlcore: Add support for runtime PM") Cc: Anders Roxell <anders.roxell@linaro.org> Cc: Eyal Reizer <eyalr@ti.com> Cc: Guy Mishol <guym@ti.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Tony Lindgren <tony@atomide.com> --- Changes since v1: - Add locking around clear_bit like we do elsewhere in the driver drivers/net/wireless/ti/wlcore/main.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)