Message ID | 20240930122924.21865-1-ville.syrjala@linux.intel.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Kalle Valo |
Headers | show |
Series | iwlegacy: Clear stale interrupts before enabling interrupts | expand |
Ville Syrjala <ville.syrjala@linux.intel.com> writes: > iwl4965 fails upon resume from hibernation on my laptop. The reason > seems to be a stale interrupt which isn't being cleared out before > interrupts are enabled. Is this a regression? Do you know what version still worked?
On Tue, Oct 01, 2024 at 10:03:36AM +0300, Kalle Valo wrote: > Ville Syrjala <ville.syrjala@linux.intel.com> writes: > > > iwl4965 fails upon resume from hibernation on my laptop. The reason > > seems to be a stale interrupt which isn't being cleared out before > > interrupts are enabled. > > Is this a regression? Do you know what version still worked? Looks like the oldest kernel I have around on that machine is 5.0, and a quick test says it's broken exactly in the same way. So if it's a regression then it's an old one.
Hi On Mon, Sep 30, 2024 at 03:29:24PM +0300, Ville Syrjala wrote: > From: Ville Syrjälä <ville.syrjala@linux.intel.com> > > iwl4965 fails upon resume from hibernation on my laptop. The reason > seems to be a stale interrupt which isn't being cleared out before > interrupts are enabled. We end up with a race beween the resume > trying to bring things back up, and the restart work (queued form > the interrupt handler) trying to bring things down. Eventually > the whole thing blows up. > > Fix the problem by clearing out any stale interrupts before > interrupts get enabled. > > Here's a debug log of the indicent: > [ 12.042589] ieee80211 phy0: il_isr ISR inta 0x00000080, enabled 0xaa00008b, fh 0x00000000 > [ 12.042625] ieee80211 phy0: il4965_irq_tasklet inta 0x00000080, enabled 0x00000000, fh 0x00000000 > [ 12.042651] iwl4965 0000:10:00.0: RF_KILL bit toggled to enable radio. > [ 12.042653] iwl4965 0000:10:00.0: On demand firmware reload <snip> > [ 12.052207] ieee80211 phy0: il4965_mac_start enter > [ 12.052212] ieee80211 phy0: il_prep_station Add STA to driver ID 31: ff:ff:ff:ff:ff:ff > [ 12.052244] ieee80211 phy0: il4965_set_hw_ready hardware ready > [ 12.052324] ieee80211 phy0: il_apm_init Init card's basic functions > [ 12.052348] ieee80211 phy0: il_apm_init L1 Enabled; Disabling L0S > [ 12.055727] ieee80211 phy0: il4965_load_bsm Begin load bsm > [ 12.056140] ieee80211 phy0: il4965_verify_bsm Begin verify bsm > [ 12.058642] ieee80211 phy0: il4965_verify_bsm BSM bootstrap uCode image OK > [ 12.058721] ieee80211 phy0: il4965_load_bsm BSM write complete, poll 1 iterations > [ 12.058734] ieee80211 phy0: __il4965_up iwl4965 is coming up > [ 12.058737] ieee80211 phy0: il4965_mac_start Start UP work done. > [ 12.058757] ieee80211 phy0: __il4965_down iwl4965 is going down > [ 12.058761] ieee80211 phy0: il_scan_cancel_timeout Scan cancel timeout > [ 12.058762] ieee80211 phy0: il_do_scan_abort Not performing scan to abort > [ 12.058765] ieee80211 phy0: il_clear_ucode_stations Clearing ucode stations in driver > [ 12.058767] ieee80211 phy0: il_clear_ucode_stations No active stations found to be cleared > [ 12.058819] ieee80211 phy0: _il_apm_stop Stop card, put in low power state > [ 12.058827] ieee80211 phy0: _il_apm_stop_master stop master > [ 12.058864] ieee80211 phy0: il4965_clear_free_frames 0 frames on pre-allocated heap on clear. > [ 12.058869] ieee80211 phy0: Hardware restart was requested > [ 16.132299] iwl4965 0000:10:00.0: START_ALIVE timeout after 4000ms. > [ 16.132303] ------------[ cut here ]------------ > [ 16.132304] Hardware became unavailable upon resume. This could be a software issue prior to suspend or a hardware issue. <snip> > drivers/net/wireless/intel/iwlegacy/common.h | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/net/wireless/intel/iwlegacy/common.h b/drivers/net/wireless/intel/iwlegacy/common.h > index 2147781b5fff..758984d527bf 100644 > --- a/drivers/net/wireless/intel/iwlegacy/common.h > +++ b/drivers/net/wireless/intel/iwlegacy/common.h > @@ -2342,6 +2342,8 @@ static inline void > il_enable_interrupts(struct il_priv *il) > { > set_bit(S_INT_ENABLED, &il->status); > + _il_wr(il, CSR_INT, 0xffffffff); > + _il_wr(il, CSR_FH_INT_STATUS, 0xffffffff); > _il_wr(il, CSR_INT_MASK, il->inta_mask); > } RF_KILL is CSR_INT and we already acknowledged CSR_INT before il_enable_interrupts() in il4965_mac_start() -> __il4965_up()): /* make sure rfkill handshake bits are cleared */ _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_DRV_GP1_BIT_CMD_BLOCKED); /* clear (again), then enable host interrupts */ _il_wr(il, CSR_INT, 0xFFFFFFFF); il_enable_interrupts(il); /* really make sure rfkill handshake bits are cleared */ _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); So the only explanation I can see the patch help with the problem is additional delay between first and second rfkill handshake bits clearing (which BTW looks fishy, since is done 2 times, and seems in the second line in the second clearing bit CSR_UCODE_DRV_GP1_BIT_CMD_BLOCKED should be used). I suspect the real problem is that we do enable rfkill interrupt by il_enable_rfkill_int() on il4965_mac_stop(). Since we want to know RF KILL state regardless interface is up or down. But then the interrupt is enabled during suspend period as well. Probably better fix would be add il_disable_interrupts() to il_pci_suspend(). Would you like to check that? If not, patch is ok for me, if it fixes the issue in practice. Acked-by: Stanislaw Gruszka <stf_xl@wp.pl>
On Tue, Oct 01, 2024 at 08:18:16PM +0200, Stanislaw Gruszka wrote: > Hi > > On Mon, Sep 30, 2024 at 03:29:24PM +0300, Ville Syrjala wrote: > > From: Ville Syrjälä <ville.syrjala@linux.intel.com> > > > > iwl4965 fails upon resume from hibernation on my laptop. The reason > > seems to be a stale interrupt which isn't being cleared out before > > interrupts are enabled. We end up with a race beween the resume > > trying to bring things back up, and the restart work (queued form > > the interrupt handler) trying to bring things down. Eventually > > the whole thing blows up. > > > > Fix the problem by clearing out any stale interrupts before > > interrupts get enabled. > > > > Here's a debug log of the indicent: > > [ 12.042589] ieee80211 phy0: il_isr ISR inta 0x00000080, enabled 0xaa00008b, fh 0x00000000 > > [ 12.042625] ieee80211 phy0: il4965_irq_tasklet inta 0x00000080, enabled 0x00000000, fh 0x00000000 > > [ 12.042651] iwl4965 0000:10:00.0: RF_KILL bit toggled to enable radio. > > [ 12.042653] iwl4965 0000:10:00.0: On demand firmware reload > <snip> > > [ 12.052207] ieee80211 phy0: il4965_mac_start enter > > [ 12.052212] ieee80211 phy0: il_prep_station Add STA to driver ID 31: ff:ff:ff:ff:ff:ff > > [ 12.052244] ieee80211 phy0: il4965_set_hw_ready hardware ready > > [ 12.052324] ieee80211 phy0: il_apm_init Init card's basic functions > > [ 12.052348] ieee80211 phy0: il_apm_init L1 Enabled; Disabling L0S > > [ 12.055727] ieee80211 phy0: il4965_load_bsm Begin load bsm > > [ 12.056140] ieee80211 phy0: il4965_verify_bsm Begin verify bsm > > [ 12.058642] ieee80211 phy0: il4965_verify_bsm BSM bootstrap uCode image OK > > [ 12.058721] ieee80211 phy0: il4965_load_bsm BSM write complete, poll 1 iterations > > [ 12.058734] ieee80211 phy0: __il4965_up iwl4965 is coming up > > [ 12.058737] ieee80211 phy0: il4965_mac_start Start UP work done. > > [ 12.058757] ieee80211 phy0: __il4965_down iwl4965 is going down > > [ 12.058761] ieee80211 phy0: il_scan_cancel_timeout Scan cancel timeout > > [ 12.058762] ieee80211 phy0: il_do_scan_abort Not performing scan to abort > > [ 12.058765] ieee80211 phy0: il_clear_ucode_stations Clearing ucode stations in driver > > [ 12.058767] ieee80211 phy0: il_clear_ucode_stations No active stations found to be cleared > > [ 12.058819] ieee80211 phy0: _il_apm_stop Stop card, put in low power state > > [ 12.058827] ieee80211 phy0: _il_apm_stop_master stop master > > [ 12.058864] ieee80211 phy0: il4965_clear_free_frames 0 frames on pre-allocated heap on clear. > > [ 12.058869] ieee80211 phy0: Hardware restart was requested > > [ 16.132299] iwl4965 0000:10:00.0: START_ALIVE timeout after 4000ms. > > [ 16.132303] ------------[ cut here ]------------ > > [ 16.132304] Hardware became unavailable upon resume. This could be a software issue prior to suspend or a hardware issue. > <snip> > > drivers/net/wireless/intel/iwlegacy/common.h | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/net/wireless/intel/iwlegacy/common.h b/drivers/net/wireless/intel/iwlegacy/common.h > > index 2147781b5fff..758984d527bf 100644 > > --- a/drivers/net/wireless/intel/iwlegacy/common.h > > +++ b/drivers/net/wireless/intel/iwlegacy/common.h > > @@ -2342,6 +2342,8 @@ static inline void > > il_enable_interrupts(struct il_priv *il) > > { > > set_bit(S_INT_ENABLED, &il->status); > > + _il_wr(il, CSR_INT, 0xffffffff); > > + _il_wr(il, CSR_FH_INT_STATUS, 0xffffffff); > > _il_wr(il, CSR_INT_MASK, il->inta_mask); > > } > > RF_KILL is CSR_INT and we already acknowledged CSR_INT before > il_enable_interrupts() in il4965_mac_start() -> __il4965_up()): > > /* make sure rfkill handshake bits are cleared */ > _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); > _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_DRV_GP1_BIT_CMD_BLOCKED); > > /* clear (again), then enable host interrupts */ > _il_wr(il, CSR_INT, 0xFFFFFFFF); > il_enable_interrupts(il); > > /* really make sure rfkill handshake bits are cleared */ > _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); > _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); > > So the only explanation I can see the patch help with the problem > is additional delay between first and second rfkill handshake bits > clearing (which BTW looks fishy, since is done 2 times, > and seems in the second line in the second clearing bit > CSR_UCODE_DRV_GP1_BIT_CMD_BLOCKED should be used). > > I suspect the real problem is that we do enable rfkill > interrupt by il_enable_rfkill_int() on il4965_mac_stop(). > Since we want to know RF KILL state regardless interface > is up or down. But then the interrupt is enabled during > suspend period as well. > > Probably better fix would be add il_disable_interrupts() > to il_pci_suspend(). Would you like to check that? That doesn't work, which doesn't surprise me since the state of the device in .suspend() has no bearing on the state of the device in .resume() when we're talking about hibernation. Hmm. I suppose we might want to minimize the potential changes to any runtime behaviour in which case we could do this instead: diff --git a/drivers/net/wireless/intel/iwlegacy/common.c b/drivers/net/wireless/intel/iwlegacy/common.c index 9d33a66a49b5..7f58e31d23fe 100644 --- a/drivers/net/wireless/intel/iwlegacy/common.c +++ b/drivers/net/wireless/intel/iwlegacy/common.c @@ -4962,6 +4962,8 @@ il_pci_resume(struct device *device) */ pci_write_config_byte(pdev, PCI_CFG_RETRY_TIMEOUT, 0x00); + _il_wr(il, CSR_INT, 0xffffffff); + _il_wr(il, CSR_FH_INT_STATUS, 0xffffffff); il_enable_interrupts(il); if (!(_il_rd(il, CSR_GP_CNTRL) & CSR_GP_CNTRL_REG_FLAG_HW_RF_KILL_SW)) which does work. > > If not, patch is ok for me, if it fixes the issue in practice. > > Acked-by: Stanislaw Gruszka <stf_xl@wp.pl>
On Tue, Oct 01, 2024 at 10:22:32PM +0300, Ville Syrjälä wrote: > On Tue, Oct 01, 2024 at 08:18:16PM +0200, Stanislaw Gruszka wrote: > > Hi > > > > On Mon, Sep 30, 2024 at 03:29:24PM +0300, Ville Syrjala wrote: > > > From: Ville Syrjälä <ville.syrjala@linux.intel.com> > > > > > > iwl4965 fails upon resume from hibernation on my laptop. The reason > > > seems to be a stale interrupt which isn't being cleared out before > > > interrupts are enabled. We end up with a race beween the resume > > > trying to bring things back up, and the restart work (queued form > > > the interrupt handler) trying to bring things down. Eventually > > > the whole thing blows up. > > > > > > Fix the problem by clearing out any stale interrupts before > > > interrupts get enabled. > > > > > > Here's a debug log of the indicent: > > > [ 12.042589] ieee80211 phy0: il_isr ISR inta 0x00000080, enabled 0xaa00008b, fh 0x00000000 > > > [ 12.042625] ieee80211 phy0: il4965_irq_tasklet inta 0x00000080, enabled 0x00000000, fh 0x00000000 > > > [ 12.042651] iwl4965 0000:10:00.0: RF_KILL bit toggled to enable radio. > > > [ 12.042653] iwl4965 0000:10:00.0: On demand firmware reload > > <snip> > > > [ 12.052207] ieee80211 phy0: il4965_mac_start enter > > > [ 12.052212] ieee80211 phy0: il_prep_station Add STA to driver ID 31: ff:ff:ff:ff:ff:ff > > > [ 12.052244] ieee80211 phy0: il4965_set_hw_ready hardware ready > > > [ 12.052324] ieee80211 phy0: il_apm_init Init card's basic functions > > > [ 12.052348] ieee80211 phy0: il_apm_init L1 Enabled; Disabling L0S > > > [ 12.055727] ieee80211 phy0: il4965_load_bsm Begin load bsm > > > [ 12.056140] ieee80211 phy0: il4965_verify_bsm Begin verify bsm > > > [ 12.058642] ieee80211 phy0: il4965_verify_bsm BSM bootstrap uCode image OK > > > [ 12.058721] ieee80211 phy0: il4965_load_bsm BSM write complete, poll 1 iterations > > > [ 12.058734] ieee80211 phy0: __il4965_up iwl4965 is coming up > > > [ 12.058737] ieee80211 phy0: il4965_mac_start Start UP work done. > > > [ 12.058757] ieee80211 phy0: __il4965_down iwl4965 is going down > > > [ 12.058761] ieee80211 phy0: il_scan_cancel_timeout Scan cancel timeout > > > [ 12.058762] ieee80211 phy0: il_do_scan_abort Not performing scan to abort > > > [ 12.058765] ieee80211 phy0: il_clear_ucode_stations Clearing ucode stations in driver > > > [ 12.058767] ieee80211 phy0: il_clear_ucode_stations No active stations found to be cleared > > > [ 12.058819] ieee80211 phy0: _il_apm_stop Stop card, put in low power state > > > [ 12.058827] ieee80211 phy0: _il_apm_stop_master stop master > > > [ 12.058864] ieee80211 phy0: il4965_clear_free_frames 0 frames on pre-allocated heap on clear. > > > [ 12.058869] ieee80211 phy0: Hardware restart was requested > > > [ 16.132299] iwl4965 0000:10:00.0: START_ALIVE timeout after 4000ms. > > > [ 16.132303] ------------[ cut here ]------------ > > > [ 16.132304] Hardware became unavailable upon resume. This could be a software issue prior to suspend or a hardware issue. > > <snip> > > > drivers/net/wireless/intel/iwlegacy/common.h | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/drivers/net/wireless/intel/iwlegacy/common.h b/drivers/net/wireless/intel/iwlegacy/common.h > > > index 2147781b5fff..758984d527bf 100644 > > > --- a/drivers/net/wireless/intel/iwlegacy/common.h > > > +++ b/drivers/net/wireless/intel/iwlegacy/common.h > > > @@ -2342,6 +2342,8 @@ static inline void > > > il_enable_interrupts(struct il_priv *il) > > > { > > > set_bit(S_INT_ENABLED, &il->status); > > > + _il_wr(il, CSR_INT, 0xffffffff); > > > + _il_wr(il, CSR_FH_INT_STATUS, 0xffffffff); > > > _il_wr(il, CSR_INT_MASK, il->inta_mask); > > > } > > > > RF_KILL is CSR_INT and we already acknowledged CSR_INT before > > il_enable_interrupts() in il4965_mac_start() -> __il4965_up()): > > > > /* make sure rfkill handshake bits are cleared */ > > _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); > > _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_DRV_GP1_BIT_CMD_BLOCKED); > > > > /* clear (again), then enable host interrupts */ > > _il_wr(il, CSR_INT, 0xFFFFFFFF); > > il_enable_interrupts(il); > > > > /* really make sure rfkill handshake bits are cleared */ > > _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); > > _il_wr(il, CSR_UCODE_DRV_GP1_CLR, CSR_UCODE_SW_BIT_RFKILL); > > > > So the only explanation I can see the patch help with the problem > > is additional delay between first and second rfkill handshake bits > > clearing (which BTW looks fishy, since is done 2 times, > > and seems in the second line in the second clearing bit > > CSR_UCODE_DRV_GP1_BIT_CMD_BLOCKED should be used). > > > > I suspect the real problem is that we do enable rfkill > > interrupt by il_enable_rfkill_int() on il4965_mac_stop(). > > Since we want to know RF KILL state regardless interface > > is up or down. But then the interrupt is enabled during > > suspend period as well. > > > > Probably better fix would be add il_disable_interrupts() > > to il_pci_suspend(). Would you like to check that? > > That doesn't work, which doesn't surprise me since the state > of the device in .suspend() has no bearing on the state of the > device in .resume() when we're talking about hibernation. Oh, ok. > Hmm. I suppose we might want to minimize the potential > changes to any runtime behaviour in which case we could > do this instead: > > diff --git a/drivers/net/wireless/intel/iwlegacy/common.c b/drivers/net/wireless/intel/iwlegacy/common.c > index 9d33a66a49b5..7f58e31d23fe 100644 > --- a/drivers/net/wireless/intel/iwlegacy/common.c > +++ b/drivers/net/wireless/intel/iwlegacy/common.c > @@ -4962,6 +4962,8 @@ il_pci_resume(struct device *device) > */ > pci_write_config_byte(pdev, PCI_CFG_RETRY_TIMEOUT, 0x00); > > + _il_wr(il, CSR_INT, 0xffffffff); > + _il_wr(il, CSR_FH_INT_STATUS, 0xffffffff); > il_enable_interrupts(il); > > if (!(_il_rd(il, CSR_GP_CNTRL) & CSR_GP_CNTRL_REG_FLAG_HW_RF_KILL_SW)) > > which does work. I see, we have clear CSR_INT, not in il4965_mac_start() but in il_pci_resume() where we actually enable interrupts on resume. Now the solution makes sense to me. Feel free to post the second one if you think it's better, I'm also fine with the original patch. Thanks Stanislaw
diff --git a/drivers/net/wireless/intel/iwlegacy/common.h b/drivers/net/wireless/intel/iwlegacy/common.h index 2147781b5fff..758984d527bf 100644 --- a/drivers/net/wireless/intel/iwlegacy/common.h +++ b/drivers/net/wireless/intel/iwlegacy/common.h @@ -2342,6 +2342,8 @@ static inline void il_enable_interrupts(struct il_priv *il) { set_bit(S_INT_ENABLED, &il->status); + _il_wr(il, CSR_INT, 0xffffffff); + _il_wr(il, CSR_FH_INT_STATUS, 0xffffffff); _il_wr(il, CSR_INT_MASK, il->inta_mask); }