diff mbox series

ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash

Message ID 1542163824-795-1-git-send-email-wgong@codeaurora.org (mailing list archive)
State Changes Requested
Delegated to: Kalle Valo
Headers show
Series ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash | expand

Commit Message

Wen Gong Nov. 14, 2018, 2:50 a.m. UTC
When test simulate firmware crash, it is easy to trigger error.
command:
echo soft > /sys/kernel/debug/ieee80211/phyxx/ath10k/simulate_fw_crash.

If input more than two times continuously, then it will have error.
Error message:
ath10k_pci 0000:02:00.0: failed to set vdev 1 RX wake policy: -108
ath10k_pci 0000:02:00.0: device is wedged, will not restart

It is because the state has not changed to ATH10K_STATE_ON immediately,
then it will have more than two simulate crash process running meanwhile,
and complete/wakeup some field twice, it destroy the normal recovery
process.

Tested with QCA6174 PCI with firmware
WLAN.RM.4.4.1-00109-QCARMSWPZ-1, but this will also affect QCA9377 PCI.
It's not a regression with new firmware releases.

Signed-off-by: Wen Gong <wgong@codeaurora.org>
---
 drivers/net/wireless/ath/ath10k/debug.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Michal Kazior Nov. 14, 2018, 7:48 a.m. UTC | #1
On Wed, 14 Nov 2018 at 03:51, Wen Gong <wgong@codeaurora.org> wrote:
>
> When test simulate firmware crash, it is easy to trigger error.
> command:
> echo soft > /sys/kernel/debug/ieee80211/phyxx/ath10k/simulate_fw_crash.
>
> If input more than two times continuously, then it will have error.
> Error message:
> ath10k_pci 0000:02:00.0: failed to set vdev 1 RX wake policy: -108
> ath10k_pci 0000:02:00.0: device is wedged, will not restart
>
> It is because the state has not changed to ATH10K_STATE_ON immediately,
> then it will have more than two simulate crash process running meanwhile,
> and complete/wakeup some field twice, it destroy the normal recovery
> process.

This was intended to allow testing not only firmware crash path (and
recovery) but also firmware crash while recovering from a firmware
crash.


Michał
Wen Gong Jan. 7, 2019, 7:16 a.m. UTC | #2
> > It is because the state has not changed to ATH10K_STATE_ON
> > immediately, then it will have more than two simulate crash process
> > running meanwhile, and complete/wakeup some field twice, it destroy
> > the normal recovery process.
> 
> This was intended to allow testing not only firmware crash path (and
> recovery) but also firmware crash while recovering from a firmware crash.
> 
If firmware is recovering from crash, then simulate a new crash will trigger error.
So remove it.
> 
> Michał
> 
> _______________________________________________
> ath10k mailing list
> ath10k@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/ath10k
Michal Kazior Jan. 7, 2019, 8:35 a.m. UTC | #3
On Mon, 7 Jan 2019 at 08:16, Wen Gong <wgong@qti.qualcomm.com> wrote:
>
> > > It is because the state has not changed to ATH10K_STATE_ON
> > > immediately, then it will have more than two simulate crash process
> > > running meanwhile, and complete/wakeup some field twice, it destroy
> > > the normal recovery process.
> >
> > This was intended to allow testing not only firmware crash path (and
> > recovery) but also firmware crash while recovering from a firmware crash.
> >
> If firmware is recovering from crash, then simulate a new crash will trigger error.
> So remove it.

That's actually a feature, not a bug. If firmware crashes while driver
is restarting after a crash then its likely going to fail again and
again causing a crash-restart loop which can affect system performance
and responsiveness. It's better to give up and let the system admin
take over.

If it's still bothering you then please consider a crash counter
threshold so that, e.g. after 5 crash-while-restarting it's going to
give up. However I doubt it's worth the effort. My experience tells me
firmware crashes during recovery are rarely, if at all, transient.

The simulated fw crash is not representative here. It's a mere tool to
test driver code.


Michał
Wen Gong Jan. 8, 2019, 8:45 a.m. UTC | #4
> >
> > > > It is because the state has not changed to ATH10K_STATE_ON
> > > > immediately, then it will have more than two simulate crash
> > > > process running meanwhile, and complete/wakeup some field twice,
> > > > it destroy the normal recovery process.
> > >
> > > This was intended to allow testing not only firmware crash path (and
> > > recovery) but also firmware crash while recovering from a firmware crash.
> > >
> > If firmware is recovering from crash, then simulate a new crash will trigger
> error.
> > So remove it.
> 
> That's actually a feature, not a bug. If firmware crashes while driver is
> restarting after a crash then its likely going to fail again and again causing a
> crash-restart loop which can affect system performance and responsiveness.
> It's better to give up and let the system admin take over.
> 
> If it's still bothering you then please consider a crash counter threshold so
> that, e.g. after 5 crash-while-restarting it's going to give up. However I doubt
> it's worth the effort. My experience tells me firmware crashes during
> recovery are rarely, if at all, transient.
> 
> The simulated fw crash is not representative here. It's a mere tool to test
> driver code.

The simulated fw crash is only a tool for user to trigger fw crash with command,
This change's purpose is to disallow user to trigger fw crash if the fw is not in a
Normal state.

If the fw is in recovering state triggered by user's command or by fw, then it will
disallow user to run command to trigger fw crash again until fw become to a normal
State.

> 
> 
> Michał
Kalle Valo Feb. 8, 2019, 1:32 p.m. UTC | #5
Wen Gong <wgong@qti.qualcomm.com> writes:

>> > > > It is because the state has not changed to ATH10K_STATE_ON
>> > > > immediately, then it will have more than two simulate crash
>> > > > process running meanwhile, and complete/wakeup some field twice,
>> > > > it destroy the normal recovery process.
>> > >
>> > > This was intended to allow testing not only firmware crash path (and
>> > > recovery) but also firmware crash while recovering from a firmware crash.
>> > >
>> > If firmware is recovering from crash, then simulate a new crash will trigger
>> error.
>> > So remove it.
>> 
>> That's actually a feature, not a bug. If firmware crashes while driver is
>> restarting after a crash then its likely going to fail again and again causing a
>> crash-restart loop which can affect system performance and responsiveness.
>> It's better to give up and let the system admin take over.
>> 
>> If it's still bothering you then please consider a crash counter threshold so
>> that, e.g. after 5 crash-while-restarting it's going to give up. However I doubt
>> it's worth the effort. My experience tells me firmware crashes during
>> recovery are rarely, if at all, transient.
>> 
>> The simulated fw crash is not representative here. It's a mere tool to test
>> driver code.
>
> The simulated fw crash is only a tool for user to trigger fw crash
> with command

I think Michal knows what simulate_fw_crash as he is the one who
implemented it in commit 278c4a85e626 :)

> This change's purpose is to disallow user to trigger fw crash if the fw is not in a
> Normal state.
>
> If the fw is in recovering state triggered by user's command or by fw, then it will
> disallow user to run command to trigger fw crash again until fw become to a normal
> State.

I agree with Michal here and his proposal about having a crash counter
sounds like a good to me. So I'm dropping this patch.
Kalle Valo Feb. 8, 2019, 1:34 p.m. UTC | #6
Kalle Valo <kvalo@codeaurora.org> writes:

>> This change's purpose is to disallow user to trigger fw crash if the fw is not in a
>> Normal state.
>>
>> If the fw is in recovering state triggered by user's command or by fw, then it will
>> disallow user to run command to trigger fw crash again until fw become to a normal
>> State.
>
> I agree with Michal here and his proposal about having a crash counter
> sounds like a good to me. So I'm dropping this patch.

Bah, missed a word again. I meant "sounds like a good idea to me".
Wen Gong April 1, 2019, 6:11 a.m. UTC | #7
> -----Original Message-----
> From: Michał Kazior <kazikcz@gmail.com>
> Sent: Monday, January 7, 2019 4:36 PM
> To: Wen Gong <wgong@qti.qualcomm.com>
> Cc: Wen Gong <wgong@codeaurora.org>; linux-wireless <linux-
> wireless@vger.kernel.org>; ath10k@lists.infradead.org
> Subject: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
> That's actually a feature, not a bug. If firmware crashes while driver
> is restarting after a crash then its likely going to fail again and
> again causing a crash-restart loop which can affect system performance
> and responsiveness. It's better to give up and let the system admin
> take over.
> 
> If it's still bothering you then please consider a crash counter
> threshold so that, e.g. after 5 crash-while-restarting it's going to
> give up. However I doubt it's worth the effort. My experience tells me
> firmware crashes during recovery are rarely, if at all, transient.
> 
> The simulated fw crash is not representative here. It's a mere tool to
> test driver code.
> 
Hi Michal,
There have a stress test case for the simulate fw crash, it will simulate fw crash
in a very short time for each test, this will trigger the stress test fail.
The simulate fw crash process should not be run parallel, after this patch, the
Stress test case will pass.
> 
> Michał
Wen Gong April 8, 2019, 10:19 a.m. UTC | #8
> -----Original Message-----
> From: Wen Gong
> Sent: Monday, April 1, 2019 2:11 PM
> To: 'Michał Kazior' <kazikcz@gmail.com>
> Cc: Wen Gong <wgong@codeaurora.org>; linux-wireless <linux-
> wireless@vger.kernel.org>; ath10k@lists.infradead.org
> Subject: RE: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
> 
> >
> > If it's still bothering you then please consider a crash counter
> > threshold so that, e.g. after 5 crash-while-restarting it's going to
> > give up. However I doubt it's worth the effort. My experience tells me
> > firmware crashes during recovery are rarely, if at all, transient.
> >
> > The simulated fw crash is not representative here. It's a mere tool to
> > test driver code.
> >
> Hi Michal,
> There have a stress test case for the simulate fw crash, it will simulate fw
> crash
> in a very short time for each test, this will trigger the stress test fail.
> The simulate fw crash process should not be run parallel, after this patch, the
> Stress test case will pass.
> >

Hi Michał,
Do you have some new comments?

> > Michał
Michal Kazior April 8, 2019, 5:27 p.m. UTC | #9
On Mon, 8 Apr 2019 at 12:20, Wen Gong <wgong@qti.qualcomm.com> wrote:
> > -----Original Message-----
> > From: Wen Gong
> > Sent: Monday, April 1, 2019 2:11 PM
> > To: 'Michał Kazior' <kazikcz@gmail.com>
> > Cc: Wen Gong <wgong@codeaurora.org>; linux-wireless <linux-
> > wireless@vger.kernel.org>; ath10k@lists.infradead.org
> > Subject: RE: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> > simulate fw crash
> >
> > >
> > > If it's still bothering you then please consider a crash counter
> > > threshold so that, e.g. after 5 crash-while-restarting it's going to
> > > give up. However I doubt it's worth the effort. My experience tells me
> > > firmware crashes during recovery are rarely, if at all, transient.
> > >
> > > The simulated fw crash is not representative here. It's a mere tool to
> > > test driver code.
> > >
> > Hi Michal,
> > There have a stress test case for the simulate fw crash, it will simulate fw
> > crash
> > in a very short time for each test, this will trigger the stress test fail.
> > The simulate fw crash process should not be run parallel, after this patch, the
> > Stress test case will pass.
> > >
>
> Hi Michał,
> Do you have some new comments?

My original use case was to be able to exercise the driver's
robustness in handling nested fw crashes, IOW crash-within-a-crash.

Your test case, as far as I understand, intends to perform
consecutive, non-nested fw crash simulation stress test.

Both of these are mutually exclusive and your patch fixes your test
case at the expense of breaking my original case.

To satisfy both I would suggest you either expose ar->state via
debugfs and make your test procedure wait for that to get back into ON
state before simulating a crash again, or to extend the set of current
simulate_fw_crash commands (currently just: soft, hard, assert,
hw-restart) to something that allows expressing the intent whether
crash-in-crash prevention is intended (your case) or not (my original
case).

This could be for example something like this:
  echo soft wait-ready > simulate_fw_crash

The "wait-ready" extra keyword would imply crash-in-crash prevention.
This would keep existing tools working (both behavior and syntax) and
would allow your test case to be implemented.


Michał
Wen Gong April 9, 2019, 5:09 a.m. UTC | #10
> From: Michał Kazior <kazikcz@gmail.com>
> Sent: Tuesday, April 9, 2019 1:27 AM
> To: Wen Gong <wgong@qti.qualcomm.com>
> Cc: Wen Gong <wgong@codeaurora.org>; linux-wireless <linux-
> wireless@vger.kernel.org>; ath10k@lists.infradead.org
> Subject: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
> 
> > > Hi Michal,
> > > There have a stress test case for the simulate fw crash, it will simulate fw
> > > crash
> > > in a very short time for each test, this will trigger the stress test fail.
> > > The simulate fw crash process should not be run parallel, after this patch,
> the
> > > Stress test case will pass.
> > > >
> >
> > Hi Michał,
> > Do you have some new comments?
> 
> My original use case was to be able to exercise the driver's
> robustness in handling nested fw crashes, IOW crash-within-a-crash.
> 
> Your test case, as far as I understand, intends to perform
> consecutive, non-nested fw crash simulation stress test.
> 
> Both of these are mutually exclusive and your patch fixes your test
> case at the expense of breaking my original case.
> 
> To satisfy both I would suggest you either expose ar->state via
> debugfs and make your test procedure wait for that to get back into ON
> state before simulating a crash again, or to extend the set of current
> simulate_fw_crash commands (currently just: soft, hard, assert,
> hw-restart) to something that allows expressing the intent whether
> crash-in-crash prevention is intended (your case) or not (my original
> case).
> 
> This could be for example something like this:
>   echo soft wait-ready > simulate_fw_crash
> 
> The "wait-ready" extra keyword would imply crash-in-crash prevention.
> This would keep existing tools working (both behavior and syntax) and
> would allow your test case to be implemented.
> 
Is it easy to change your existing tools?
I want to change it to: echo soft skip-ready > simulate_fw_crash
The "skip-ready" extra keyword would imply crash-in-crash, *not* prevention.
My test tools is hard to change.

> 
> Michał
Brian Norris April 9, 2019, 11:25 p.m. UTC | #11
On Mon, Apr 8, 2019 at 10:09 PM Wen Gong <wgong@qti.qualcomm.com> wrote:
> > From: Michał Kazior <kazikcz@gmail.com>
> > To satisfy both I would suggest you either expose ar->state via
> > debugfs and make your test procedure wait for that to get back into ON
> > state before simulating a crash again, or to extend the set of current
> > simulate_fw_crash commands (currently just: soft, hard, assert,
> > hw-restart) to something that allows expressing the intent whether
> > crash-in-crash prevention is intended (your case) or not (my original
> > case).
> >
> > This could be for example something like this:
> >   echo soft wait-ready > simulate_fw_crash
> >
> > The "wait-ready" extra keyword would imply crash-in-crash prevention.
> > This would keep existing tools working (both behavior and syntax) and
> > would allow your test case to be implemented.
> >
> Is it easy to change your existing tools?
> I want to change it to: echo soft skip-ready > simulate_fw_crash
> The "skip-ready" extra keyword would imply crash-in-crash, *not* prevention.
> My test tools is hard to change.

In case you're talking about the test framework we run for ChromeOS
validation, no, it's not hard at all to change. As long as there's a
good reason.

I haven't closely followed this, but judging by the above summary,
it's probably more reasonable for our test framework to only simulate
FW crashes after the driver returns to "ready" (or at least, if we do
crash-in-crash, don't expect the driver to recover?). I expect we can
work with whatever mechanism you implement for that (exposing the
"state", or providing a new simulate_fw_crash mode).

Brian
Wen Gong April 10, 2019, 2:45 a.m. UTC | #12
> -----Original Message-----
> From: Brian Norris <briannorris@chromium.org>
> Sent: Wednesday, April 10, 2019 7:25 AM
> To: Wen Gong <wgong@qti.qualcomm.com>
> Cc: Michał Kazior <kazikcz@gmail.com>; Wen Gong
> <wgong@codeaurora.org>; linux-wireless <linux-wireless@vger.kernel.org>;
> ath10k@lists.infradead.org
> Subject: [EXT] Re: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
> 
> On Mon, Apr 8, 2019 at 10:09 PM Wen Gong <wgong@qti.qualcomm.com>
> wrote:
> > > From: Michał Kazior <kazikcz@gmail.com>
> > > To satisfy both I would suggest you either expose ar->state via
> > > debugfs and make your test procedure wait for that to get back into ON
> > > state before simulating a crash again, or to extend the set of current
> > > simulate_fw_crash commands (currently just: soft, hard, assert,
> > > hw-restart) to something that allows expressing the intent whether
> > > crash-in-crash prevention is intended (your case) or not (my original
> > > case).
> > >
> > > This could be for example something like this:
> > >   echo soft wait-ready > simulate_fw_crash
> > >
> > > The "wait-ready" extra keyword would imply crash-in-crash prevention.
> > > This would keep existing tools working (both behavior and syntax) and
> > > would allow your test case to be implemented.
> > >
> > Is it easy to change your existing tools?
> > I want to change it to: echo soft skip-ready > simulate_fw_crash
> > The "skip-ready" extra keyword would imply crash-in-crash, *not*
> prevention.
> > My test tools is hard to change.
> 
> In case you're talking about the test framework we run for ChromeOS
> validation, no, it's not hard at all to change. As long as there's a
> good reason.
> 
> I haven't closely followed this, but judging by the above summary,
> it's probably more reasonable for our test framework to only simulate
> FW crashes after the driver returns to "ready" (or at least, if we do
> crash-in-crash, don't expect the driver to recover?). I expect we can
> work with whatever mechanism you implement for that (exposing the
> "state", or providing a new simulate_fw_crash mode).
> 

If ChromeOS is easy to change tool, 
I think I will change the mechanism of the simulate_fw_crash.
Then all tools will work normally.

> Brian
Wen Gong May 28, 2019, 2:49 a.m. UTC | #13
> -----Original Message-----
> From: ath10k <ath10k-bounces@lists.infradead.org> On Behalf Of Wen Gong
> Sent: Wednesday, April 10, 2019 10:45 AM
> To: Brian Norris <briannorris@chromium.org>
> Cc: Michał Kazior <kazikcz@gmail.com>; linux-wireless <linux-
> wireless@vger.kernel.org>; ath10k@lists.infradead.org; Wen Gong
> <wgong@codeaurora.org>
> Subject: [EXT] RE: [PATCH] ath10k: Remove ATH10K_STATE_RESTARTED in
> simulate fw crash
> 
> If ChromeOS is easy to change tool,
> I think I will change the mechanism of the simulate_fw_crash.
> Then all tools will work normally.
> 
New patch uploaded
https://patchwork.kernel.org/patch/10897587/
[v2] ath10k: Remove ATH10K_STATE_RESTARTED in simulate fw crash
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath10k/debug.c b/drivers/net/wireless/ath/ath10k/debug.c
index ada29a4..dc8700b 100644
--- a/drivers/net/wireless/ath/ath10k/debug.c
+++ b/drivers/net/wireless/ath/ath10k/debug.c
@@ -569,8 +569,7 @@  static ssize_t ath10k_write_simulate_fw_crash(struct file *file,
 
 	mutex_lock(&ar->conf_mutex);
 
-	if (ar->state != ATH10K_STATE_ON &&
-	    ar->state != ATH10K_STATE_RESTARTED) {
+	if (ar->state != ATH10K_STATE_ON) {
 		ret = -ENETDOWN;
 		goto exit;
 	}