diff mbox series

wireless: ath10k: Return early in ath10k_qmi_event_server_exit() to avoid hard crash on reboot

Message ID 20200602052533.15048-1-john.stultz@linaro.org (mailing list archive)
State New, archived
Headers show
Series wireless: ath10k: Return early in ath10k_qmi_event_server_exit() to avoid hard crash on reboot | expand

Commit Message

John Stultz June 2, 2020, 5:25 a.m. UTC
Ever since 5.7-rc1, if we call
ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
reboot, resulting in the device getting stuck in the usb crash
debug mode and not coming back up wihthout a hard power off.

This hack avoids the issue by returning early in
ath10k_qmi_event_server_exit().

A better solution is very much desired!

Feedback and suggestions welcome!

Cc: Rakesh Pillai <pillair@qti.qualcomm.com>
Cc: Govind Singh <govinds@codeaurora.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Niklas Cassel <niklas.cassel@linaro.org>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Cc: Amit Pundir <amit.pundir@linaro.org>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: ath10k@lists.infradead.org
Reported-by: Amit Pundir <amit.pundir@linaro.org>
Signed-off-by: John Stultz <john.stultz@linaro.org>
---
 drivers/net/wireless/ath/ath10k/qmi.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Brian Norris June 2, 2020, 7:16 p.m. UTC | #1
+ Sibi

On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote:
>
> Ever since 5.7-rc1, if we call
> ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
> reboot, resulting in the device getting stuck in the usb crash
> debug mode and not coming back up wihthout a hard power off.
>
> This hack avoids the issue by returning early in
> ath10k_qmi_event_server_exit().
>
> A better solution is very much desired!

Any chance you can bisect what caused this? There are a lot of
non-ath10k pieces involved in this stuff.

Brian
John Stultz June 2, 2020, 7:40 p.m. UTC | #2
On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote:
>
> + Sibi
>
> On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote:
> >
> > Ever since 5.7-rc1, if we call
> > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
> > reboot, resulting in the device getting stuck in the usb crash
> > debug mode and not coming back up wihthout a hard power off.
> >
> > This hack avoids the issue by returning early in
> > ath10k_qmi_event_server_exit().
> >
> > A better solution is very much desired!
>
> Any chance you can bisect what caused this? There are a lot of
> non-ath10k pieces involved in this stuff.

Amit had spent some work on chasing it down to the in kernel qrtr-ns
work, and reported it here:
  https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html

But that discussion seemingly stalled out, so I came up with this hack
to workaround it for us.

thanks
-john
Brian Norris June 2, 2020, 8:04 p.m. UTC | #3
On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> wrote:
> On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote:
> > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote:
> > >
> > > Ever since 5.7-rc1, if we call
> > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
> > > reboot, resulting in the device getting stuck in the usb crash
> > > debug mode and not coming back up wihthout a hard power off.
> > >
> > > This hack avoids the issue by returning early in
> > > ath10k_qmi_event_server_exit().
> > >
> > > A better solution is very much desired!
> >
> > Any chance you can bisect what caused this? There are a lot of
> > non-ath10k pieces involved in this stuff.
>
> Amit had spent some work on chasing it down to the in kernel qrtr-ns
> work, and reported it here:
>   https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html
>
> But that discussion seemingly stalled out, so I came up with this hack
> to workaround it for us.

If I'm reading it right, then that means we should revert this stuff
from v5.7-rc1:

0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace

At least, until people can resolve the tail end of that thread. New
features (ath11k, etc.) are not a reason to break existing features
(ath10k/wcn3990).

Brian
Manivannan Sadhasivam June 3, 2020, 12:27 a.m. UTC | #4
On Tue, Jun 02, 2020 at 01:04:26PM -0700, Brian Norris wrote:
> On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> wrote:
> > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote:
> > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote:
> > > >
> > > > Ever since 5.7-rc1, if we call
> > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
> > > > reboot, resulting in the device getting stuck in the usb crash
> > > > debug mode and not coming back up wihthout a hard power off.
> > > >
> > > > This hack avoids the issue by returning early in
> > > > ath10k_qmi_event_server_exit().
> > > >
> > > > A better solution is very much desired!
> > >
> > > Any chance you can bisect what caused this? There are a lot of
> > > non-ath10k pieces involved in this stuff.
> >
> > Amit had spent some work on chasing it down to the in kernel qrtr-ns
> > work, and reported it here:
> >   https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html
> >
> > But that discussion seemingly stalled out, so I came up with this hack
> > to workaround it for us.
> 
> If I'm reading it right, then that means we should revert this stuff
> from v5.7-rc1:
> 
> 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace
> 
> At least, until people can resolve the tail end of that thread. New
> features (ath11k, etc.) are not a reason to break existing features
> (ath10k/wcn3990).

I don't agree with this. If you read through the replies to the bug report,
it is clear that NS migration uncovered a corner case or even a bug. So we
should try to fix that indeed.

Govind: Did you get chance to work on fixing this issue?

Thanks,
Mani

> 
> Brian
Govind Singh June 3, 2020, 10:07 a.m. UTC | #5
Hi Mani,

On 2020-06-03 05:57, Manivannan Sadhasivam wrote:
> On Tue, Jun 02, 2020 at 01:04:26PM -0700, Brian Norris wrote:
>> On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> 
>> wrote:
>> > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote:
>> > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote:
>> > > >
>> > > > Ever since 5.7-rc1, if we call
>> > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
>> > > > reboot, resulting in the device getting stuck in the usb crash
>> > > > debug mode and not coming back up wihthout a hard power off.
>> > > >
>> > > > This hack avoids the issue by returning early in
>> > > > ath10k_qmi_event_server_exit().
>> > > >
>> > > > A better solution is very much desired!
>> > >
>> > > Any chance you can bisect what caused this? There are a lot of
>> > > non-ath10k pieces involved in this stuff.
>> >
>> > Amit had spent some work on chasing it down to the in kernel qrtr-ns
>> > work, and reported it here:
>> >   https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html
>> >
>> > But that discussion seemingly stalled out, so I came up with this hack
>> > to workaround it for us.
>> 
>> If I'm reading it right, then that means we should revert this stuff
>> from v5.7-rc1:
>> 
>> 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace
>> 
>> At least, until people can resolve the tail end of that thread. New
>> features (ath11k, etc.) are not a reason to break existing features
>> (ath10k/wcn3990).
> 
> I don't agree with this. If you read through the replies to the bug 
> report,
> it is clear that NS migration uncovered a corner case or even a bug. So 
> we
> should try to fix that indeed.
> 
> Govind: Did you get chance to work on fixing this issue?
> 

I have done basic testing by moving msa map/unmap from qmi service 
callbacks to init/de-init path.
I will send patch for review.
Reason for del_server needs to investigated from rproc side.

> Thanks,
> Mani
> 
>> 
>> Brian

Thanks,
Govind
Sibi Sankar June 4, 2020, 6:17 p.m. UTC | #6
On 2020-06-03 15:37, govinds@codeaurora.org wrote:
> Hi Mani,
> 
> On 2020-06-03 05:57, Manivannan Sadhasivam wrote:
>> On Tue, Jun 02, 2020 at 01:04:26PM -0700, Brian Norris wrote:
>>> On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> 
>>> wrote:
>>> > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote:
>>> > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote:
>>> > > >
>>> > > > Ever since 5.7-rc1, if we call
>>> > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
>>> > > > reboot, resulting in the device getting stuck in the usb crash
>>> > > > debug mode and not coming back up wihthout a hard power off.
>>> > > >
>>> > > > This hack avoids the issue by returning early in
>>> > > > ath10k_qmi_event_server_exit().
>>> > > >
>>> > > > A better solution is very much desired!
>>> > >
>>> > > Any chance you can bisect what caused this? There are a lot of
>>> > > non-ath10k pieces involved in this stuff.
>>> >
>>> > Amit had spent some work on chasing it down to the in kernel qrtr-ns
>>> > work, and reported it here:
>>> >   https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html
>>> >
>>> > But that discussion seemingly stalled out, so I came up with this hack
>>> > to workaround it for us.
>>> 
>>> If I'm reading it right, then that means we should revert this stuff
>>> from v5.7-rc1:
>>> 
>>> 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace
>>> 
>>> At least, until people can resolve the tail end of that thread. New
>>> features (ath11k, etc.) are not a reason to break existing features
>>> (ath10k/wcn3990).
>> 
>> I don't agree with this. If you read through the replies to the bug 
>> report,
>> it is clear that NS migration uncovered a corner case or even a bug. 
>> So we
>> should try to fix that indeed.
>> 
>> Govind: Did you get chance to work on fixing this issue?
>> 
> 
> I have done basic testing by moving msa map/unmap from qmi service
> callbacks to init/de-init path.
> I will send patch for review.
> Reason for del_server needs to investigated from rproc side.

Govind,
On receiving SIGTERM, rmtfs would try
to perform a graceful shutdown of the
modem, that should be the source of
the del_server.

> 
>> Thanks,
>> Mani
>> 
>>> 
>>> Brian
> 
> Thanks,
> Govind
Kalle Valo June 8, 2020, 11:17 a.m. UTC | #7
John Stultz <john.stultz@linaro.org> writes:

> Ever since 5.7-rc1, if we call
> ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
> reboot, resulting in the device getting stuck in the usb crash
> debug mode and not coming back up wihthout a hard power off.
>
> This hack avoids the issue by returning early in
> ath10k_qmi_event_server_exit().
>
> A better solution is very much desired!
>
> Feedback and suggestions welcome!
>
> Cc: Rakesh Pillai <pillair@qti.qualcomm.com>
> Cc: Govind Singh <govinds@codeaurora.org>
> Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
> Cc: Niklas Cassel <niklas.cassel@linaro.org>
> Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
> Cc: Amit Pundir <amit.pundir@linaro.org>
> Cc: Brian Norris <briannorris@chromium.org>
> Cc: Kalle Valo <kvalo@codeaurora.org>
> Cc: ath10k@lists.infradead.org
> Reported-by: Amit Pundir <amit.pundir@linaro.org>
> Signed-off-by: John Stultz <john.stultz@linaro.org>

Just so you know: as you didn't CC linux-wireless it's not on patchwork
and hence not on my radar. But hopefully we find a better solution to
fix this.
Kalle Valo June 8, 2020, 11:37 a.m. UTC | #8
Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> writes:

> On Tue, Jun 02, 2020 at 01:04:26PM -0700, Brian Norris wrote:
>> On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> wrote:
>> > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote:
>> > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote:
>> > > >
>> > > > Ever since 5.7-rc1, if we call
>> > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on
>> > > > reboot, resulting in the device getting stuck in the usb crash
>> > > > debug mode and not coming back up wihthout a hard power off.
>> > > >
>> > > > This hack avoids the issue by returning early in
>> > > > ath10k_qmi_event_server_exit().
>> > > >
>> > > > A better solution is very much desired!
>> > >
>> > > Any chance you can bisect what caused this? There are a lot of
>> > > non-ath10k pieces involved in this stuff.
>> >
>> > Amit had spent some work on chasing it down to the in kernel qrtr-ns
>> > work, and reported it here:
>> >   https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html
>> >
>> > But that discussion seemingly stalled out, so I came up with this hack
>> > to workaround it for us.
>> 
>> If I'm reading it right, then that means we should revert this stuff
>> from v5.7-rc1:
>> 
>> 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace
>> 
>> At least, until people can resolve the tail end of that thread. New
>> features (ath11k, etc.) are not a reason to break existing features
>> (ath10k/wcn3990).
>
> I don't agree with this. If you read through the replies to the bug report,
> it is clear that NS migration uncovered a corner case or even a bug. So we
> should try to fix that indeed.

I'm with Mani, we should try to fix ath10k instead. Hopefully we can
find a fix soon.

Forcing QCA6390 users to use the userspace qrtr-ns would be bad user
experience, I really would want to avoid that.
Amit Pundir Aug. 17, 2020, 9:06 a.m. UTC | #9
On Mon, 8 Jun 2020 at 17:07, Kalle Valo <kvalo@codeaurora.org> wrote:
> > I don't agree with this. If you read through the replies to the bug report,
> > it is clear that NS migration uncovered a corner case or even a bug. So we
> > should try to fix that indeed.
>
> I'm with Mani, we should try to fix ath10k instead. Hopefully we can
> find a fix soon.

Hi Team,

Any updates on this? I can reproduce this hard crash on v5.9-rc1 as well.

It is not a blocker for us because we switched to a userspace
workaround, where we do not wait for modem to shutdown gracefully and
SIGKILL it instead, during the shutdown/reboot process. But I'm happy
to take a swing at any intermediate/in-progress solution available.

Regards,
Amit Pundir

>
> Forcing QCA6390 users to use the userspace qrtr-ns would be bad user
> experience, I really would want to avoid that.
>
> --
> https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Kalle Valo Aug. 28, 2020, 12:52 p.m. UTC | #10
Amit Pundir <amit.pundir@linaro.org> writes:

> On Mon, 8 Jun 2020 at 17:07, Kalle Valo <kvalo@codeaurora.org> wrote:
>> > I don't agree with this. If you read through the replies to the bug report,
>> > it is clear that NS migration uncovered a corner case or even a bug. So we
>> > should try to fix that indeed.
>>
>> I'm with Mani, we should try to fix ath10k instead. Hopefully we can
>> find a fix soon.
>
> Hi Team,
>
> Any updates on this? I can reproduce this hard crash on v5.9-rc1 as well.
>
> It is not a blocker for us because we switched to a userspace
> workaround, where we do not wait for modem to shutdown gracefully and
> SIGKILL it instead, during the shutdown/reboot process. But I'm happy
> to take a swing at any intermediate/in-progress solution available.

Govind submitted this patch and later he asked to drop it, but I think
it would be a good idea to test it anyway:

ath10k: Move msa region map/unmap to init/deinit path

https://lkml.kernel.org/r/1591191231-31917-1-git-send-email-govinds@codeaurora.org

(patchwork is down so I cannot give a patchwork link)
Govind Singh Aug. 28, 2020, 1:07 p.m. UTC | #11
Hi Kalle,

On 2020-08-28 18:22, Kalle Valo wrote:
> Amit Pundir <amit.pundir@linaro.org> writes:
> 
>> On Mon, 8 Jun 2020 at 17:07, Kalle Valo <kvalo@codeaurora.org> wrote:
>>> > I don't agree with this. If you read through the replies to the bug report,
>>> > it is clear that NS migration uncovered a corner case or even a bug. So we
>>> > should try to fix that indeed.
>>> 
>>> I'm with Mani, we should try to fix ath10k instead. Hopefully we can
>>> find a fix soon.
>> 
>> Hi Team,
>> 
>> Any updates on this? I can reproduce this hard crash on v5.9-rc1 as 
>> well.
>> 
>> It is not a blocker for us because we switched to a userspace
>> workaround, where we do not wait for modem to shutdown gracefully and
>> SIGKILL it instead, during the shutdown/reboot process. But I'm happy
>> to take a swing at any intermediate/in-progress solution available.
> 
> Govind submitted this patch and later he asked to drop it, but I think
> it would be a good idea to test it anyway:
> 
> ath10k: Move msa region map/unmap to init/deinit path
> 
> https://lkml.kernel.org/r/1591191231-31917-1-git-send-email-govinds@codeaurora.org
> 
> (patchwork is down so I cannot give a patchwork link)

This patchwork is not fixing the issue and changing MSA mapping sequence 
is major design change.
This issue is only seen with DB845 which uses SCM call, newer targets 
QCS404/SC7180/SM8150 will not have this issue as MSA mapping is 
hard-coded in TZ.
Probably changes in qmi layer to give different indication for this 
scenario and changes in FW is required to mitigate this issue 
gracefully.

BR,
Govind
Kalle Valo Sept. 7, 2020, 4:12 p.m. UTC | #12
Govind Singh <govinds@codeaurora.org> writes:

> On 2020-08-28 18:22, Kalle Valo wrote:
>> Amit Pundir <amit.pundir@linaro.org> writes:
>>
>>> On Mon, 8 Jun 2020 at 17:07, Kalle Valo <kvalo@codeaurora.org> wrote:
>>>> > I don't agree with this. If you read through the replies to the bug report,
>>>> > it is clear that NS migration uncovered a corner case or even a bug. So we
>>>> > should try to fix that indeed.
>>>>
>>>> I'm with Mani, we should try to fix ath10k instead. Hopefully we can
>>>> find a fix soon.
>>>
>>> Hi Team,
>>>
>>> Any updates on this? I can reproduce this hard crash on v5.9-rc1 as
>>> well.
>>>
>>> It is not a blocker for us because we switched to a userspace
>>> workaround, where we do not wait for modem to shutdown gracefully and
>>> SIGKILL it instead, during the shutdown/reboot process. But I'm happy
>>> to take a swing at any intermediate/in-progress solution available.
>>
>> Govind submitted this patch and later he asked to drop it, but I think
>> it would be a good idea to test it anyway:
>>
>> ath10k: Move msa region map/unmap to init/deinit path
>>
>> https://lkml.kernel.org/r/1591191231-31917-1-git-send-email-govinds@codeaurora.org
>>
>> (patchwork is down so I cannot give a patchwork link)
>
> This patchwork is not fixing the issue and changing MSA mapping
> sequence is major design change. This issue is only seen with DB845
> which uses SCM call, newer targets QCS404/SC7180/SM8150 will not have
> this issue as MSA mapping is hard-coded in TZ. Probably changes in qmi
> layer to give different indication for this scenario and changes in FW
> is required to mitigate this issue gracefully.

Oh, bad news :/ Can anyone look at that in detail? Even a quick hack
patch would get this forward.
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath10k/qmi.c b/drivers/net/wireless/ath/ath10k/qmi.c
index 85dce43c5439..ab38562ce1cb 100644
--- a/drivers/net/wireless/ath/ath10k/qmi.c
+++ b/drivers/net/wireless/ath/ath10k/qmi.c
@@ -854,6 +854,11 @@  static void ath10k_qmi_event_server_exit(struct ath10k_qmi *qmi)
 	struct ath10k *ar = qmi->ar;
 	struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
 
+	/*
+	 * HACK: Calling ath10k_qmi_remove_msa_permission causes
+	 * hardware to hard crash on reboot
+	 */
+	return;
 	ath10k_qmi_remove_msa_permission(qmi);
 	ath10k_core_free_board_files(ar);
 	if (!test_bit(ATH10K_SNOC_FLAG_UNREGISTERING, &ar_snoc->flags))