Message ID | 20200602052533.15048-1-john.stultz@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | wireless: ath10k: Return early in ath10k_qmi_event_server_exit() to avoid hard crash on reboot | expand |
+ Sibi On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote: > > Ever since 5.7-rc1, if we call > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on > reboot, resulting in the device getting stuck in the usb crash > debug mode and not coming back up wihthout a hard power off. > > This hack avoids the issue by returning early in > ath10k_qmi_event_server_exit(). > > A better solution is very much desired! Any chance you can bisect what caused this? There are a lot of non-ath10k pieces involved in this stuff. Brian
On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote: > > + Sibi > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote: > > > > Ever since 5.7-rc1, if we call > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on > > reboot, resulting in the device getting stuck in the usb crash > > debug mode and not coming back up wihthout a hard power off. > > > > This hack avoids the issue by returning early in > > ath10k_qmi_event_server_exit(). > > > > A better solution is very much desired! > > Any chance you can bisect what caused this? There are a lot of > non-ath10k pieces involved in this stuff. Amit had spent some work on chasing it down to the in kernel qrtr-ns work, and reported it here: https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html But that discussion seemingly stalled out, so I came up with this hack to workaround it for us. thanks -john
On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> wrote: > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote: > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote: > > > > > > Ever since 5.7-rc1, if we call > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on > > > reboot, resulting in the device getting stuck in the usb crash > > > debug mode and not coming back up wihthout a hard power off. > > > > > > This hack avoids the issue by returning early in > > > ath10k_qmi_event_server_exit(). > > > > > > A better solution is very much desired! > > > > Any chance you can bisect what caused this? There are a lot of > > non-ath10k pieces involved in this stuff. > > Amit had spent some work on chasing it down to the in kernel qrtr-ns > work, and reported it here: > https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html > > But that discussion seemingly stalled out, so I came up with this hack > to workaround it for us. If I'm reading it right, then that means we should revert this stuff from v5.7-rc1: 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace At least, until people can resolve the tail end of that thread. New features (ath11k, etc.) are not a reason to break existing features (ath10k/wcn3990). Brian
On Tue, Jun 02, 2020 at 01:04:26PM -0700, Brian Norris wrote: > On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> wrote: > > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote: > > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote: > > > > > > > > Ever since 5.7-rc1, if we call > > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on > > > > reboot, resulting in the device getting stuck in the usb crash > > > > debug mode and not coming back up wihthout a hard power off. > > > > > > > > This hack avoids the issue by returning early in > > > > ath10k_qmi_event_server_exit(). > > > > > > > > A better solution is very much desired! > > > > > > Any chance you can bisect what caused this? There are a lot of > > > non-ath10k pieces involved in this stuff. > > > > Amit had spent some work on chasing it down to the in kernel qrtr-ns > > work, and reported it here: > > https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html > > > > But that discussion seemingly stalled out, so I came up with this hack > > to workaround it for us. > > If I'm reading it right, then that means we should revert this stuff > from v5.7-rc1: > > 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace > > At least, until people can resolve the tail end of that thread. New > features (ath11k, etc.) are not a reason to break existing features > (ath10k/wcn3990). I don't agree with this. If you read through the replies to the bug report, it is clear that NS migration uncovered a corner case or even a bug. So we should try to fix that indeed. Govind: Did you get chance to work on fixing this issue? Thanks, Mani > > Brian
Hi Mani, On 2020-06-03 05:57, Manivannan Sadhasivam wrote: > On Tue, Jun 02, 2020 at 01:04:26PM -0700, Brian Norris wrote: >> On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> >> wrote: >> > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote: >> > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote: >> > > > >> > > > Ever since 5.7-rc1, if we call >> > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on >> > > > reboot, resulting in the device getting stuck in the usb crash >> > > > debug mode and not coming back up wihthout a hard power off. >> > > > >> > > > This hack avoids the issue by returning early in >> > > > ath10k_qmi_event_server_exit(). >> > > > >> > > > A better solution is very much desired! >> > > >> > > Any chance you can bisect what caused this? There are a lot of >> > > non-ath10k pieces involved in this stuff. >> > >> > Amit had spent some work on chasing it down to the in kernel qrtr-ns >> > work, and reported it here: >> > https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html >> > >> > But that discussion seemingly stalled out, so I came up with this hack >> > to workaround it for us. >> >> If I'm reading it right, then that means we should revert this stuff >> from v5.7-rc1: >> >> 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace >> >> At least, until people can resolve the tail end of that thread. New >> features (ath11k, etc.) are not a reason to break existing features >> (ath10k/wcn3990). > > I don't agree with this. If you read through the replies to the bug > report, > it is clear that NS migration uncovered a corner case or even a bug. So > we > should try to fix that indeed. > > Govind: Did you get chance to work on fixing this issue? > I have done basic testing by moving msa map/unmap from qmi service callbacks to init/de-init path. I will send patch for review. Reason for del_server needs to investigated from rproc side. > Thanks, > Mani > >> >> Brian Thanks, Govind
On 2020-06-03 15:37, govinds@codeaurora.org wrote: > Hi Mani, > > On 2020-06-03 05:57, Manivannan Sadhasivam wrote: >> On Tue, Jun 02, 2020 at 01:04:26PM -0700, Brian Norris wrote: >>> On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> >>> wrote: >>> > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote: >>> > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote: >>> > > > >>> > > > Ever since 5.7-rc1, if we call >>> > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on >>> > > > reboot, resulting in the device getting stuck in the usb crash >>> > > > debug mode and not coming back up wihthout a hard power off. >>> > > > >>> > > > This hack avoids the issue by returning early in >>> > > > ath10k_qmi_event_server_exit(). >>> > > > >>> > > > A better solution is very much desired! >>> > > >>> > > Any chance you can bisect what caused this? There are a lot of >>> > > non-ath10k pieces involved in this stuff. >>> > >>> > Amit had spent some work on chasing it down to the in kernel qrtr-ns >>> > work, and reported it here: >>> > https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html >>> > >>> > But that discussion seemingly stalled out, so I came up with this hack >>> > to workaround it for us. >>> >>> If I'm reading it right, then that means we should revert this stuff >>> from v5.7-rc1: >>> >>> 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace >>> >>> At least, until people can resolve the tail end of that thread. New >>> features (ath11k, etc.) are not a reason to break existing features >>> (ath10k/wcn3990). >> >> I don't agree with this. If you read through the replies to the bug >> report, >> it is clear that NS migration uncovered a corner case or even a bug. >> So we >> should try to fix that indeed. >> >> Govind: Did you get chance to work on fixing this issue? >> > > I have done basic testing by moving msa map/unmap from qmi service > callbacks to init/de-init path. > I will send patch for review. > Reason for del_server needs to investigated from rproc side. Govind, On receiving SIGTERM, rmtfs would try to perform a graceful shutdown of the modem, that should be the source of the del_server. > >> Thanks, >> Mani >> >>> >>> Brian > > Thanks, > Govind
John Stultz <john.stultz@linaro.org> writes: > Ever since 5.7-rc1, if we call > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on > reboot, resulting in the device getting stuck in the usb crash > debug mode and not coming back up wihthout a hard power off. > > This hack avoids the issue by returning early in > ath10k_qmi_event_server_exit(). > > A better solution is very much desired! > > Feedback and suggestions welcome! > > Cc: Rakesh Pillai <pillair@qti.qualcomm.com> > Cc: Govind Singh <govinds@codeaurora.org> > Cc: Bjorn Andersson <bjorn.andersson@linaro.org> > Cc: Niklas Cassel <niklas.cassel@linaro.org> > Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> > Cc: Amit Pundir <amit.pundir@linaro.org> > Cc: Brian Norris <briannorris@chromium.org> > Cc: Kalle Valo <kvalo@codeaurora.org> > Cc: ath10k@lists.infradead.org > Reported-by: Amit Pundir <amit.pundir@linaro.org> > Signed-off-by: John Stultz <john.stultz@linaro.org> Just so you know: as you didn't CC linux-wireless it's not on patchwork and hence not on my radar. But hopefully we find a better solution to fix this.
Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> writes: > On Tue, Jun 02, 2020 at 01:04:26PM -0700, Brian Norris wrote: >> On Tue, Jun 2, 2020 at 12:40 PM John Stultz <john.stultz@linaro.org> wrote: >> > On Tue, Jun 2, 2020 at 12:16 PM Brian Norris <briannorris@chromium.org> wrote: >> > > On Mon, Jun 1, 2020 at 10:25 PM John Stultz <john.stultz@linaro.org> wrote: >> > > > >> > > > Ever since 5.7-rc1, if we call >> > > > ath10k_qmi_remove_msa_permission(), the db845c hard crashes on >> > > > reboot, resulting in the device getting stuck in the usb crash >> > > > debug mode and not coming back up wihthout a hard power off. >> > > > >> > > > This hack avoids the issue by returning early in >> > > > ath10k_qmi_event_server_exit(). >> > > > >> > > > A better solution is very much desired! >> > > >> > > Any chance you can bisect what caused this? There are a lot of >> > > non-ath10k pieces involved in this stuff. >> > >> > Amit had spent some work on chasing it down to the in kernel qrtr-ns >> > work, and reported it here: >> > https://lists.infradead.org/pipermail/ath10k/2020-April/014970.html >> > >> > But that discussion seemingly stalled out, so I came up with this hack >> > to workaround it for us. >> >> If I'm reading it right, then that means we should revert this stuff >> from v5.7-rc1: >> >> 0c2204a4ad71 net: qrtr: Migrate nameservice to kernel from userspace >> >> At least, until people can resolve the tail end of that thread. New >> features (ath11k, etc.) are not a reason to break existing features >> (ath10k/wcn3990). > > I don't agree with this. If you read through the replies to the bug report, > it is clear that NS migration uncovered a corner case or even a bug. So we > should try to fix that indeed. I'm with Mani, we should try to fix ath10k instead. Hopefully we can find a fix soon. Forcing QCA6390 users to use the userspace qrtr-ns would be bad user experience, I really would want to avoid that.
On Mon, 8 Jun 2020 at 17:07, Kalle Valo <kvalo@codeaurora.org> wrote: > > I don't agree with this. If you read through the replies to the bug report, > > it is clear that NS migration uncovered a corner case or even a bug. So we > > should try to fix that indeed. > > I'm with Mani, we should try to fix ath10k instead. Hopefully we can > find a fix soon. Hi Team, Any updates on this? I can reproduce this hard crash on v5.9-rc1 as well. It is not a blocker for us because we switched to a userspace workaround, where we do not wait for modem to shutdown gracefully and SIGKILL it instead, during the shutdown/reboot process. But I'm happy to take a swing at any intermediate/in-progress solution available. Regards, Amit Pundir > > Forcing QCA6390 users to use the userspace qrtr-ns would be bad user > experience, I really would want to avoid that. > > -- > https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches
Amit Pundir <amit.pundir@linaro.org> writes: > On Mon, 8 Jun 2020 at 17:07, Kalle Valo <kvalo@codeaurora.org> wrote: >> > I don't agree with this. If you read through the replies to the bug report, >> > it is clear that NS migration uncovered a corner case or even a bug. So we >> > should try to fix that indeed. >> >> I'm with Mani, we should try to fix ath10k instead. Hopefully we can >> find a fix soon. > > Hi Team, > > Any updates on this? I can reproduce this hard crash on v5.9-rc1 as well. > > It is not a blocker for us because we switched to a userspace > workaround, where we do not wait for modem to shutdown gracefully and > SIGKILL it instead, during the shutdown/reboot process. But I'm happy > to take a swing at any intermediate/in-progress solution available. Govind submitted this patch and later he asked to drop it, but I think it would be a good idea to test it anyway: ath10k: Move msa region map/unmap to init/deinit path https://lkml.kernel.org/r/1591191231-31917-1-git-send-email-govinds@codeaurora.org (patchwork is down so I cannot give a patchwork link)
Hi Kalle, On 2020-08-28 18:22, Kalle Valo wrote: > Amit Pundir <amit.pundir@linaro.org> writes: > >> On Mon, 8 Jun 2020 at 17:07, Kalle Valo <kvalo@codeaurora.org> wrote: >>> > I don't agree with this. If you read through the replies to the bug report, >>> > it is clear that NS migration uncovered a corner case or even a bug. So we >>> > should try to fix that indeed. >>> >>> I'm with Mani, we should try to fix ath10k instead. Hopefully we can >>> find a fix soon. >> >> Hi Team, >> >> Any updates on this? I can reproduce this hard crash on v5.9-rc1 as >> well. >> >> It is not a blocker for us because we switched to a userspace >> workaround, where we do not wait for modem to shutdown gracefully and >> SIGKILL it instead, during the shutdown/reboot process. But I'm happy >> to take a swing at any intermediate/in-progress solution available. > > Govind submitted this patch and later he asked to drop it, but I think > it would be a good idea to test it anyway: > > ath10k: Move msa region map/unmap to init/deinit path > > https://lkml.kernel.org/r/1591191231-31917-1-git-send-email-govinds@codeaurora.org > > (patchwork is down so I cannot give a patchwork link) This patchwork is not fixing the issue and changing MSA mapping sequence is major design change. This issue is only seen with DB845 which uses SCM call, newer targets QCS404/SC7180/SM8150 will not have this issue as MSA mapping is hard-coded in TZ. Probably changes in qmi layer to give different indication for this scenario and changes in FW is required to mitigate this issue gracefully. BR, Govind
Govind Singh <govinds@codeaurora.org> writes: > On 2020-08-28 18:22, Kalle Valo wrote: >> Amit Pundir <amit.pundir@linaro.org> writes: >> >>> On Mon, 8 Jun 2020 at 17:07, Kalle Valo <kvalo@codeaurora.org> wrote: >>>> > I don't agree with this. If you read through the replies to the bug report, >>>> > it is clear that NS migration uncovered a corner case or even a bug. So we >>>> > should try to fix that indeed. >>>> >>>> I'm with Mani, we should try to fix ath10k instead. Hopefully we can >>>> find a fix soon. >>> >>> Hi Team, >>> >>> Any updates on this? I can reproduce this hard crash on v5.9-rc1 as >>> well. >>> >>> It is not a blocker for us because we switched to a userspace >>> workaround, where we do not wait for modem to shutdown gracefully and >>> SIGKILL it instead, during the shutdown/reboot process. But I'm happy >>> to take a swing at any intermediate/in-progress solution available. >> >> Govind submitted this patch and later he asked to drop it, but I think >> it would be a good idea to test it anyway: >> >> ath10k: Move msa region map/unmap to init/deinit path >> >> https://lkml.kernel.org/r/1591191231-31917-1-git-send-email-govinds@codeaurora.org >> >> (patchwork is down so I cannot give a patchwork link) > > This patchwork is not fixing the issue and changing MSA mapping > sequence is major design change. This issue is only seen with DB845 > which uses SCM call, newer targets QCS404/SC7180/SM8150 will not have > this issue as MSA mapping is hard-coded in TZ. Probably changes in qmi > layer to give different indication for this scenario and changes in FW > is required to mitigate this issue gracefully. Oh, bad news :/ Can anyone look at that in detail? Even a quick hack patch would get this forward.
diff --git a/drivers/net/wireless/ath/ath10k/qmi.c b/drivers/net/wireless/ath/ath10k/qmi.c index 85dce43c5439..ab38562ce1cb 100644 --- a/drivers/net/wireless/ath/ath10k/qmi.c +++ b/drivers/net/wireless/ath/ath10k/qmi.c @@ -854,6 +854,11 @@ static void ath10k_qmi_event_server_exit(struct ath10k_qmi *qmi) struct ath10k *ar = qmi->ar; struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar); + /* + * HACK: Calling ath10k_qmi_remove_msa_permission causes + * hardware to hard crash on reboot + */ + return; ath10k_qmi_remove_msa_permission(qmi); ath10k_core_free_board_files(ar); if (!test_bit(ATH10K_SNOC_FLAG_UNREGISTERING, &ar_snoc->flags))
Ever since 5.7-rc1, if we call ath10k_qmi_remove_msa_permission(), the db845c hard crashes on reboot, resulting in the device getting stuck in the usb crash debug mode and not coming back up wihthout a hard power off. This hack avoids the issue by returning early in ath10k_qmi_event_server_exit(). A better solution is very much desired! Feedback and suggestions welcome! Cc: Rakesh Pillai <pillair@qti.qualcomm.com> Cc: Govind Singh <govinds@codeaurora.org> Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: Niklas Cassel <niklas.cassel@linaro.org> Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Cc: Amit Pundir <amit.pundir@linaro.org> Cc: Brian Norris <briannorris@chromium.org> Cc: Kalle Valo <kvalo@codeaurora.org> Cc: ath10k@lists.infradead.org Reported-by: Amit Pundir <amit.pundir@linaro.org> Signed-off-by: John Stultz <john.stultz@linaro.org> --- drivers/net/wireless/ath/ath10k/qmi.c | 5 +++++ 1 file changed, 5 insertions(+)