Message ID | 20230324063357.1.Ifdf3625a3c5c9467bd87bfcdf726c884ad220a35@changeid (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS" | expand |
On Fri, Mar 24, 2023 at 06:34:06AM -0700, Douglas Anderson wrote: > This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > prompted the us to switch to synchronous probe showed that the root > cause was a missing "rootwait" in the kernel command line > arguments. Let's reinstate asynchronous probe. > > Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS") > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Signed-off-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Andrew Halaney <ahalaney@redhat.com> Thanks!
On Fri, 24 Mar 2023 06:34:06 -0700, Douglas Anderson wrote: > This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > prompted the us to switch to synchronous probe showed that the root > cause was a missing "rootwait" in the kernel command line > arguments. Let's reinstate asynchronous probe. > > > [...] Applied to broonie/regulator.git for-next Thanks! [1/1] regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS" commit: ad44ac082fdff7ee57fe125432f7d9d7cb610a23 All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: > > This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > prompted the us to switch to synchronous probe showed that the root > cause was a missing "rootwait" in the kernel command line > arguments. Let's reinstate asynchronous probe. Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) running AOSP (Android Open Source Project) with v6.4-rc1 https://bugs.linaro.org/show_bug.cgi?id=5975. Can we please go back to synchronous probe. AOSP do not make use of rootwait, IIRC, but it is added by the bootloader anyway. And the device fails to boot AOSP regardless of "rootwait" bootarg being present or not. FWIW I do not see this regression on RB5 (QRB5165/SM8250) running the same set of AOSP images. Regards, Amit Pundir > > Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS") > Cc: Marek Szyprowski <m.szyprowski@samsung.com> > Signed-off-by: Douglas Anderson <dianders@chromium.org> > --- > > drivers/regulator/qcom-rpmh-regulator.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c > index 903032b2875f..4826d60e5d95 100644 > --- a/drivers/regulator/qcom-rpmh-regulator.c > +++ b/drivers/regulator/qcom-rpmh-regulator.c > @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table); > static struct platform_driver rpmh_regulator_driver = { > .driver = { > .name = "qcom-rpmh-regulator", > - .probe_type = PROBE_FORCE_SYNCHRONOUS, > + .probe_type = PROBE_PREFER_ASYNCHRONOUS, > .of_match_table = of_match_ptr(rpmh_regulator_match_table), > }, > .probe = rpmh_regulator_probe, > -- > 2.40.0.348.gf938b09366-goog >
[CCing the regression list, as it should be in the loop for regressions: https://docs.kernel.org/admin-guide/reporting-regressions.html] [TLDR: I'm adding this report to the list of tracked Linux kernel regressions; the text you find below is based on a few templates paragraphs you might have encountered already in similar form. See link in footer if these mails annoy you.] On 13.05.23 20:08, Amit Pundir wrote: > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: >> >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that >> prompted the us to switch to synchronous probe showed that the root >> cause was a missing "rootwait" in the kernel command line >> arguments. Let's reinstate asynchronous probe. > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > running AOSP (Android Open Source Project) with v6.4-rc1 > https://bugs.linaro.org/show_bug.cgi?id=5975. > Can we please go back to synchronous probe. > > AOSP do not make use of rootwait, IIRC, but it is added by the > bootloader anyway. And the device fails to boot AOSP regardless of > "rootwait" bootarg being present or not. > > FWIW I do not see this regression on RB5 (QRB5165/SM8250) running the > same set of AOSP images. Thanks for the report. To be sure the issue doesn't fall through the cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression tracking bot: #regzbot ^introduced ad44ac082fd #regzbot title regulator: qcom-rpmh: Dragonboard 845c broken due to asynchronous probe #regzbot ignore-activity This isn't a regression? This issue or a fix for it are already discussed somewhere else? It was fixed already? You want to clarify when the regression started to happen? Or point out I got the title or something else totally wrong? Then just reply and tell me -- ideally while also telling regzbot about it, as explained by the page listed in the footer of this mail. Developers: When fixing the issue, remember to add 'Link:' tags pointing to the report (the parent of this mail). See page linked in footer for details. Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. >> Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS") >> Cc: Marek Szyprowski <m.szyprowski@samsung.com> >> Signed-off-by: Douglas Anderson <dianders@chromium.org> >> --- >> >> drivers/regulator/qcom-rpmh-regulator.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c >> index 903032b2875f..4826d60e5d95 100644 >> --- a/drivers/regulator/qcom-rpmh-regulator.c >> +++ b/drivers/regulator/qcom-rpmh-regulator.c >> @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table); >> static struct platform_driver rpmh_regulator_driver = { >> .driver = { >> .name = "qcom-rpmh-regulator", >> - .probe_type = PROBE_FORCE_SYNCHRONOUS, >> + .probe_type = PROBE_PREFER_ASYNCHRONOUS, >> .of_match_table = of_match_ptr(rpmh_regulator_match_table), >> }, >> .probe = rpmh_regulator_probe, >> -- >> 2.40.0.348.gf938b09366-goog >>
On 13/05/2023 18:08, Amit Pundir wrote: > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: >> >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that >> prompted the us to switch to synchronous probe showed that the root >> cause was a missing "rootwait" in the kernel command line >> arguments. Let's reinstate asynchronous probe. > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > running AOSP (Android Open Source Project) with v6.4-rc1 > https://bugs.linaro.org/show_bug.cgi?id=5975. > Can we please go back to synchronous probe. > > AOSP do not make use of rootwait, IIRC, but it is added by the > bootloader anyway. And the device fails to boot AOSP regardless of > "rootwait" bootarg being present or not. Could you try applying this diff to enable some log spam and let me know what you get? I'm keen to try and figure this one out. My mail client might crunch this a bit so I have pasted it here too https://p.calebs.dev/ab74b7@raw diff --git a/drivers/soc/qcom/rpmh-rsc.c b/drivers/soc/qcom/rpmh-rsc.c index f93544f6d796..67859f1bdb28 100644 --- a/drivers/soc/qcom/rpmh-rsc.c +++ b/drivers/soc/qcom/rpmh-rsc.c @@ -653,11 +653,23 @@ int rpmh_rsc_send_data(struct rsc_drv *drv, const struct tcs_request *msg) spin_lock_irqsave(&drv->lock, flags); + dev_info(drv->dev, "%s: %p tcs->type=%d state=%d, " + "wait_for_compl=%d, num_cmds=%d\n", + __func__, msg, tcs->type, msg->state, + msg->wait_for_compl, msg->num_cmds); + for (int i = 0; i < msg->num_cmds; i++) + dev_info(drv->dev, "%s: %p cmd[%d] " + "addr=0x%x data=0x%x\n", __func__, + msg, i, msg->cmds[i].addr, msg->cmds[i].data); + /* Wait forever for a free tcs. It better be there eventually! */ wait_event_lock_irq(drv->tcs_wait, (tcs_id = claim_tcs_for_req(drv, tcs, msg)) >= 0, drv->lock); + dev_info(drv->dev, "%s: %px GOT TCS! %d\n", + __func__, msg, tcs_id); + tcs->req[tcs_id - tcs->offset] = msg; set_bit(tcs_id, drv->tcs_in_use); if (msg->state == RPMH_ACTIVE_ONLY_STATE && tcs->type != ACTIVE_TCS) { > > FWIW I do not see this regression on RB5 (QRB5165/SM8250) running the > same set of AOSP images. > > Regards, > Amit Pundir > > > > >> >> Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS") >> Cc: Marek Szyprowski <m.szyprowski@samsung.com> >> Signed-off-by: Douglas Anderson <dianders@chromium.org> >> --- >> >> drivers/regulator/qcom-rpmh-regulator.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c >> index 903032b2875f..4826d60e5d95 100644 >> --- a/drivers/regulator/qcom-rpmh-regulator.c >> +++ b/drivers/regulator/qcom-rpmh-regulator.c >> @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table); >> static struct platform_driver rpmh_regulator_driver = { >> .driver = { >> .name = "qcom-rpmh-regulator", >> - .probe_type = PROBE_FORCE_SYNCHRONOUS, >> + .probe_type = PROBE_PREFER_ASYNCHRONOUS, >> .of_match_table = of_match_ptr(rpmh_regulator_match_table), >> }, >> .probe = rpmh_regulator_probe, >> -- >> 2.40.0.348.gf938b09366-goog >>
On Sat, May 13, 2023 at 11:38:22PM +0530, Amit Pundir wrote: > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > running AOSP (Android Open Source Project) with v6.4-rc1 > https://bugs.linaro.org/show_bug.cgi?id=5975. > Can we please go back to synchronous probe. Please submit a patch...
On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote: > > > > On 13/05/2023 18:08, Amit Pundir wrote: > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: > >> > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > >> prompted the us to switch to synchronous probe showed that the root > >> cause was a missing "rootwait" in the kernel command line > >> arguments. Let's reinstate asynchronous probe. > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > > running AOSP (Android Open Source Project) with v6.4-rc1 > > https://bugs.linaro.org/show_bug.cgi?id=5975. > > Can we please go back to synchronous probe. > > > > AOSP do not make use of rootwait, IIRC, but it is added by the > > bootloader anyway. And the device fails to boot AOSP regardless of > > "rootwait" bootarg being present or not. > > Could you try applying this diff to enable some log spam and let me know > what you get? I'm keen to try and figure this one out. My mail client > might crunch this a bit so I have pasted it here too > https://p.calebs.dev/ab74b7@raw These prints add just enough delay for the UFS probe to succeed that I can't reproduce the failure anymore. > > diff --git a/drivers/soc/qcom/rpmh-rsc.c b/drivers/soc/qcom/rpmh-rsc.c > index f93544f6d796..67859f1bdb28 100644 > --- a/drivers/soc/qcom/rpmh-rsc.c > +++ b/drivers/soc/qcom/rpmh-rsc.c > @@ -653,11 +653,23 @@ int rpmh_rsc_send_data(struct rsc_drv *drv, const > struct tcs_request *msg) > > spin_lock_irqsave(&drv->lock, flags); > > + dev_info(drv->dev, "%s: %p tcs->type=%d state=%d, " > + "wait_for_compl=%d, num_cmds=%d\n", > + __func__, msg, tcs->type, msg->state, > + msg->wait_for_compl, msg->num_cmds); > + for (int i = 0; i < msg->num_cmds; i++) > + dev_info(drv->dev, "%s: %p cmd[%d] " > + "addr=0x%x data=0x%x\n", __func__, > + msg, i, msg->cmds[i].addr, msg->cmds[i].data); > + > /* Wait forever for a free tcs. It better be there eventually! */ > wait_event_lock_irq(drv->tcs_wait, > (tcs_id = claim_tcs_for_req(drv, tcs, msg)) > >= 0, > drv->lock); > > + dev_info(drv->dev, "%s: %px GOT TCS! %d\n", > + __func__, msg, tcs_id); > + > tcs->req[tcs_id - tcs->offset] = msg; > set_bit(tcs_id, drv->tcs_in_use); > if (msg->state == RPMH_ACTIVE_ONLY_STATE && tcs->type != > ACTIVE_TCS) { > > > > > FWIW I do not see this regression on RB5 (QRB5165/SM8250) running the > > same set of AOSP images. > > > > Regards, > > Amit Pundir > > > > > > > > > >> > >> Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS") > >> Cc: Marek Szyprowski <m.szyprowski@samsung.com> > >> Signed-off-by: Douglas Anderson <dianders@chromium.org> > >> --- > >> > >> drivers/regulator/qcom-rpmh-regulator.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c > >> index 903032b2875f..4826d60e5d95 100644 > >> --- a/drivers/regulator/qcom-rpmh-regulator.c > >> +++ b/drivers/regulator/qcom-rpmh-regulator.c > >> @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table); > >> static struct platform_driver rpmh_regulator_driver = { > >> .driver = { > >> .name = "qcom-rpmh-regulator", > >> - .probe_type = PROBE_FORCE_SYNCHRONOUS, > >> + .probe_type = PROBE_PREFER_ASYNCHRONOUS, > >> .of_match_table = of_match_ptr(rpmh_regulator_match_table), > >> }, > >> .probe = rpmh_regulator_probe, > >> -- > >> 2.40.0.348.gf938b09366-goog > >> > > -- > Kind Regards, > Caleb (they/them)
Hi, On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote: > > > > > > > > On 13/05/2023 18:08, Amit Pundir wrote: > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: > > >> > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > > >> prompted the us to switch to synchronous probe showed that the root > > >> cause was a missing "rootwait" in the kernel command line > > >> arguments. Let's reinstate asynchronous probe. > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > > > running AOSP (Android Open Source Project) with v6.4-rc1 > > > https://bugs.linaro.org/show_bug.cgi?id=5975. > > > Can we please go back to synchronous probe. > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the > > > bootloader anyway. And the device fails to boot AOSP regardless of > > > "rootwait" bootarg being present or not. > > > > Could you try applying this diff to enable some log spam and let me know > > what you get? I'm keen to try and figure this one out. My mail client > > might crunch this a bit so I have pasted it here too > > https://p.calebs.dev/ab74b7@raw > > These prints add just enough delay for the UFS probe to succeed that I > can't reproduce the failure anymore. I'd prefer doing at least a little debugging before jumping to a revert. From looking at your dmesg [1], it looks as if the async probe is allowing RPMH to probe at the same time as "qcom-vadc-common". That's something that talks on the SPMI bus and is (potentially) talking to the same PMICs that RPMH-regulator is, right? I'm by no means an expert on how Qualcomm's PMICs work, but it seems plausible that the "qcom-vadc-common" is somehow causing problems and screwing up RPMH. Does that seem plausible to you? If so, one interesting way to track it down would be to move around delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that _won't_ fix the problem. Now put a ~500ms sleep at the beginning of vadc_probe(). Maybe that will fix the problem? If so, you can move the delay around to narrow down the conflict. My wild guess would be that vadc_reset() could be throwing things for a loop? If the above doesn't work, maybe we could add more tracing / printouts to see what is probing at the same time as RPMH? [1] https://bugs.linaro.org/attachment.cgi?id=1135
On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote: > > > > > > > > > > > > On 13/05/2023 18:08, Amit Pundir wrote: > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: > > > >> > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > > > >> prompted the us to switch to synchronous probe showed that the root > > > >> cause was a missing "rootwait" in the kernel command line > > > >> arguments. Let's reinstate asynchronous probe. > > > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > > > > running AOSP (Android Open Source Project) with v6.4-rc1 > > > > https://bugs.linaro.org/show_bug.cgi?id=5975. > > > > Can we please go back to synchronous probe. > > > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the > > > > bootloader anyway. And the device fails to boot AOSP regardless of > > > > "rootwait" bootarg being present or not. > > > > > > Could you try applying this diff to enable some log spam and let me know > > > what you get? I'm keen to try and figure this one out. My mail client > > > might crunch this a bit so I have pasted it here too > > > https://p.calebs.dev/ab74b7@raw > > > > These prints add just enough delay for the UFS probe to succeed that I > > can't reproduce the failure anymore. > > I'd prefer doing at least a little debugging before jumping to a > revert. From looking at your dmesg [1], it looks as if the async probe > is allowing RPMH to probe at the same time as "qcom-vadc-common". > That's something that talks on the SPMI bus and is (potentially) > talking to the same PMICs that RPMH-regulator is, right? I'm by no > means an expert on how Qualcomm's PMICs work, but it seems plausible > that the "qcom-vadc-common" is somehow causing problems and screwing > up RPMH. Does that seem plausible to you? > > If so, one interesting way to track it down would be to move around > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of > vadc_probe(). Maybe that will fix the problem? If so, you can move the > delay around to narrow down the conflict. My wild guess would be that > vadc_reset() could be throwing things for a loop? > > If the above doesn't work, maybe we could add more tracing / printouts > to see what is probing at the same time as RPMH? Tried out a few changes today but none of them worked or were effective enough to debug this crash further, other than setting fw_devlink=permissive. Adding more tracing / prints (BOOTTIME_TRACING and FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the crash either. They added just enough delay to boot the device successfully everytime. I tried to reason with the kernel modules which gets loaded before and after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC, SPMI_PMIC* etc) as suggested, but I run into the same crash even if I disable those driver modules. So I don't think that the other driver modules which gets loaded at around the same time as qcom-rpmh-regulator by default have any impact on this failure. The only way I can boot successfully everytime is if I boot with fw_devlink=permissive bootarg. So I'll have to check if there is any new dependency which got added recently in DT or somewhere else that is causing this breakage. Regards, Amit Pundir > > > [1] https://bugs.linaro.org/attachment.cgi?id=1135
Hi, On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote: > > > > > > > > > > > > > > > > On 13/05/2023 18:08, Amit Pundir wrote: > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: > > > > >> > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > > > > >> prompted the us to switch to synchronous probe showed that the root > > > > >> cause was a missing "rootwait" in the kernel command line > > > > >> arguments. Let's reinstate asynchronous probe. > > > > > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > > > > > running AOSP (Android Open Source Project) with v6.4-rc1 > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975. > > > > > Can we please go back to synchronous probe. > > > > > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the > > > > > bootloader anyway. And the device fails to boot AOSP regardless of > > > > > "rootwait" bootarg being present or not. > > > > > > > > Could you try applying this diff to enable some log spam and let me know > > > > what you get? I'm keen to try and figure this one out. My mail client > > > > might crunch this a bit so I have pasted it here too > > > > https://p.calebs.dev/ab74b7@raw > > > > > > These prints add just enough delay for the UFS probe to succeed that I > > > can't reproduce the failure anymore. > > > > I'd prefer doing at least a little debugging before jumping to a > > revert. From looking at your dmesg [1], it looks as if the async probe > > is allowing RPMH to probe at the same time as "qcom-vadc-common". > > That's something that talks on the SPMI bus and is (potentially) > > talking to the same PMICs that RPMH-regulator is, right? I'm by no > > means an expert on how Qualcomm's PMICs work, but it seems plausible > > that the "qcom-vadc-common" is somehow causing problems and screwing > > up RPMH. Does that seem plausible to you? > > > > If so, one interesting way to track it down would be to move around > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of > > vadc_probe(). Maybe that will fix the problem? If so, you can move the > > delay around to narrow down the conflict. My wild guess would be that > > vadc_reset() could be throwing things for a loop? > > > > If the above doesn't work, maybe we could add more tracing / printouts > > to see what is probing at the same time as RPMH? > > Tried out a few changes today but none of them worked or were > effective enough to debug this crash further, other than setting > fw_devlink=permissive. > > Adding more tracing / prints (BOOTTIME_TRACING and > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the > crash either. They added just enough delay to boot the device > successfully everytime. > > I tried to reason with the kernel modules which gets loaded before and > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC, > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I > disable those driver modules. So I don't think that the other driver > modules which gets loaded at around the same time as > qcom-rpmh-regulator by default have any impact on this failure. Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty much anything you can do to change the timing fixes you. That does make reverting the async probe of the regulator less appealing. If, as you said, it's not just some other driver loading at the same time that's interfering then the revert "fixes" you in the same way that a "msleep" would fix you. That doesn't seem like enough of a justification for the revert to me. It still feels to me like _something_ is happening at the same time as the RPMH regulator driver is loading, though, I'm just not sure how to suggest debugging it. I guess other thoughts: * When RPMH complains, is it always with the same regulator (lvs1), or sometimes different ones? Any clue there? * How much can you control module loading order? If rpmh-regulator module loads first, does it "fix" things? If it does, maybe you could bisect to find the place where problems start cropping up. Does that give any clues? > The only way I can boot successfully everytime is if I boot with > fw_devlink=permissive bootarg. So I'll have to check if there is any > new dependency which got added recently in DT or somewhere else that > is causing this breakage. I guess I'll assume that `fw_devlink=permissive` only fixes you because it changes the timing...
On Tue, May 16, 2023 at 02:24:06PM -0700, Doug Anderson wrote: > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > Tried out a few changes today but none of them worked or were > > effective enough to debug this crash further, other than setting > > fw_devlink=permissive. > It still feels to me like _something_ is happening at the same time as > the RPMH regulator driver is loading, though, I'm just not sure how to > suggest debugging it. I guess other thoughts: This discussion seems to have ground to a halt with no resolution so it looks like the best option here is to apply the revert unless there's some progress happened off list?
On Wed, 31 May 2023 at 18:00, Mark Brown <broonie@kernel.org> wrote: > > On Tue, May 16, 2023 at 02:24:06PM -0700, Doug Anderson wrote: > > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > Tried out a few changes today but none of them worked or were > > > effective enough to debug this crash further, other than setting > > > fw_devlink=permissive. > > > It still feels to me like _something_ is happening at the same time as > > the RPMH regulator driver is loading, though, I'm just not sure how to > > suggest debugging it. I guess other thoughts: > > This discussion seems to have ground to a halt with no resolution so it > looks like the best option here is to apply the revert unless there's > some progress happened off list? Sorry about that. I got stuck at other things. I'll get back to it this week. I'll try to change the module loading order and test run it to check if that helps. Regards, Amit Pundir
On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote: > > > > > > Hi, > > > > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote: > > > > > > > > > > > > > > > > > > > > On 13/05/2023 18:08, Amit Pundir wrote: > > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: > > > > > >> > > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > > > > > >> prompted the us to switch to synchronous probe showed that the root > > > > > >> cause was a missing "rootwait" in the kernel command line > > > > > >> arguments. Let's reinstate asynchronous probe. > > > > > > > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > > > > > > running AOSP (Android Open Source Project) with v6.4-rc1 > > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975. > > > > > > Can we please go back to synchronous probe. > > > > > > > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the > > > > > > bootloader anyway. And the device fails to boot AOSP regardless of > > > > > > "rootwait" bootarg being present or not. > > > > > > > > > > Could you try applying this diff to enable some log spam and let me know > > > > > what you get? I'm keen to try and figure this one out. My mail client > > > > > might crunch this a bit so I have pasted it here too > > > > > https://p.calebs.dev/ab74b7@raw > > > > > > > > These prints add just enough delay for the UFS probe to succeed that I > > > > can't reproduce the failure anymore. > > > > > > I'd prefer doing at least a little debugging before jumping to a > > > revert. From looking at your dmesg [1], it looks as if the async probe > > > is allowing RPMH to probe at the same time as "qcom-vadc-common". > > > That's something that talks on the SPMI bus and is (potentially) > > > talking to the same PMICs that RPMH-regulator is, right? I'm by no > > > means an expert on how Qualcomm's PMICs work, but it seems plausible > > > that the "qcom-vadc-common" is somehow causing problems and screwing > > > up RPMH. Does that seem plausible to you? > > > > > > If so, one interesting way to track it down would be to move around > > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that > > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of > > > vadc_probe(). Maybe that will fix the problem? If so, you can move the > > > delay around to narrow down the conflict. My wild guess would be that > > > vadc_reset() could be throwing things for a loop? > > > > > > If the above doesn't work, maybe we could add more tracing / printouts > > > to see what is probing at the same time as RPMH? > > > > Tried out a few changes today but none of them worked or were > > effective enough to debug this crash further, other than setting > > fw_devlink=permissive. > > > > Adding more tracing / prints (BOOTTIME_TRACING and > > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the > > crash either. They added just enough delay to boot the device > > successfully everytime. > > > > I tried to reason with the kernel modules which gets loaded before and > > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC, > > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I > > disable those driver modules. So I don't think that the other driver > > modules which gets loaded at around the same time as > > qcom-rpmh-regulator by default have any impact on this failure. > > Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty > much anything you can do to change the timing fixes you. That does > make reverting the async probe of the regulator less appealing. If, as > you said, it's not just some other driver loading at the same time > that's interfering then the revert "fixes" you in the same way that a > "msleep" would fix you. That doesn't seem like enough of a > justification for the revert to me. > > It still feels to me like _something_ is happening at the same time as > the RPMH regulator driver is loading, though, I'm just not sure how to > suggest debugging it. I guess other thoughts: > > * When RPMH complains, is it always with the same regulator (lvs1), or > sometimes different ones? Any clue there? It is always either lvs1 or lvs2. > > * How much can you control module loading order? If rpmh-regulator > module loads first, does it "fix" things? If it does, maybe you could > bisect to find the place where problems start cropping up. Does that > give any clues? Loading qcom-rpmh-regulator first does make it difficult to reproduce the crash. I tried a few combinations to narrow down the issue further and in one case, I managed to reproduce the crash (1 in 20+ reboots) while loading the qcom-rpmh-regulator (and the dependent cmd-db, qcom_rpmh) module alone https://bugs.linaro.org/attachment.cgi?id=1140. Regards, Amit Pundir > > > > The only way I can boot successfully everytime is if I boot with > > fw_devlink=permissive bootarg. So I'll have to check if there is any > > new dependency which got added recently in DT or somewhere else that > > is causing this breakage. > > I guess I'll assume that `fw_devlink=permissive` only fixes you > because it changes the timing...
Hi, On Wed, May 31, 2023 at 11:11 PM Amit Pundir <amit.pundir@linaro.org> wrote: > > On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote: > > > > > > > > Hi, > > > > > > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > > > > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 13/05/2023 18:08, Amit Pundir wrote: > > > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: > > > > > > >> > > > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > > > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > > > > > > >> prompted the us to switch to synchronous probe showed that the root > > > > > > >> cause was a missing "rootwait" in the kernel command line > > > > > > >> arguments. Let's reinstate asynchronous probe. > > > > > > > > > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > > > > > > > running AOSP (Android Open Source Project) with v6.4-rc1 > > > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975. > > > > > > > Can we please go back to synchronous probe. > > > > > > > > > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the > > > > > > > bootloader anyway. And the device fails to boot AOSP regardless of > > > > > > > "rootwait" bootarg being present or not. > > > > > > > > > > > > Could you try applying this diff to enable some log spam and let me know > > > > > > what you get? I'm keen to try and figure this one out. My mail client > > > > > > might crunch this a bit so I have pasted it here too > > > > > > https://p.calebs.dev/ab74b7@raw > > > > > > > > > > These prints add just enough delay for the UFS probe to succeed that I > > > > > can't reproduce the failure anymore. > > > > > > > > I'd prefer doing at least a little debugging before jumping to a > > > > revert. From looking at your dmesg [1], it looks as if the async probe > > > > is allowing RPMH to probe at the same time as "qcom-vadc-common". > > > > That's something that talks on the SPMI bus and is (potentially) > > > > talking to the same PMICs that RPMH-regulator is, right? I'm by no > > > > means an expert on how Qualcomm's PMICs work, but it seems plausible > > > > that the "qcom-vadc-common" is somehow causing problems and screwing > > > > up RPMH. Does that seem plausible to you? > > > > > > > > If so, one interesting way to track it down would be to move around > > > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that > > > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of > > > > vadc_probe(). Maybe that will fix the problem? If so, you can move the > > > > delay around to narrow down the conflict. My wild guess would be that > > > > vadc_reset() could be throwing things for a loop? > > > > > > > > If the above doesn't work, maybe we could add more tracing / printouts > > > > to see what is probing at the same time as RPMH? > > > > > > Tried out a few changes today but none of them worked or were > > > effective enough to debug this crash further, other than setting > > > fw_devlink=permissive. > > > > > > Adding more tracing / prints (BOOTTIME_TRACING and > > > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the > > > crash either. They added just enough delay to boot the device > > > successfully everytime. > > > > > > I tried to reason with the kernel modules which gets loaded before and > > > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC, > > > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I > > > disable those driver modules. So I don't think that the other driver > > > modules which gets loaded at around the same time as > > > qcom-rpmh-regulator by default have any impact on this failure. > > > > Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty > > much anything you can do to change the timing fixes you. That does > > make reverting the async probe of the regulator less appealing. If, as > > you said, it's not just some other driver loading at the same time > > that's interfering then the revert "fixes" you in the same way that a > > "msleep" would fix you. That doesn't seem like enough of a > > justification for the revert to me. > > > > It still feels to me like _something_ is happening at the same time as > > the RPMH regulator driver is loading, though, I'm just not sure how to > > suggest debugging it. I guess other thoughts: > > > > * When RPMH complains, is it always with the same regulator (lvs1), or > > sometimes different ones? Any clue there? > > It is always either lvs1 or lvs2. If you reorder the nodes in the device tree, I think it'll change the probe order. Does that affect anything? I'm wondering if there's some sort of delayed reaction from a previous regulator. > > * How much can you control module loading order? If rpmh-regulator > > module loads first, does it "fix" things? If it does, maybe you could > > bisect to find the place where problems start cropping up. Does that > > give any clues? > > Loading qcom-rpmh-regulator first does make it difficult to reproduce > the crash. I tried a few combinations to narrow down the issue further > and in one case, I managed to reproduce the crash (1 in 20+ reboots) > while loading the qcom-rpmh-regulator (and the dependent cmd-db, > qcom_rpmh) module alone > https://bugs.linaro.org/attachment.cgi?id=1140. > > Regards, > Amit Pundir OK, now that's even weirder. If loading the module alone reproduces the problem, I can't imagine why this would be different without "async" probe. In other words, If it reproduces 5% of the time when loading the module alone with async probe, I'd expect it to reproduce 5% of the time when loading the module alone _without_ async probe too. Note: make sure you're careful to collect more than one reproduction before asserting odds. If it reproduced once in 20 reboots, it might be 5%, it might be 20%, or it might be .01%. Ideally you could script this and let it run for a few hours to get a good reproduction rate? -Doug
On Thu, 1 Jun 2023 at 19:35, Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Wed, May 31, 2023 at 11:11 PM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@chromium.org> wrote: > > > > > > Hi, > > > > > > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > > > > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote: > > > > > > > > > > Hi, > > > > > > > > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > > > > > > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 13/05/2023 18:08, Amit Pundir wrote: > > > > > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@chromium.org> wrote: > > > > > > > >> > > > > > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use > > > > > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that > > > > > > > >> prompted the us to switch to synchronous probe showed that the root > > > > > > > >> cause was a missing "rootwait" in the kernel command line > > > > > > > >> arguments. Let's reinstate asynchronous probe. > > > > > > > > > > > > > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845) > > > > > > > > running AOSP (Android Open Source Project) with v6.4-rc1 > > > > > > > > https://bugs.linaro.org/show_bug.cgi?id=5975. > > > > > > > > Can we please go back to synchronous probe. > > > > > > > > > > > > > > > > AOSP do not make use of rootwait, IIRC, but it is added by the > > > > > > > > bootloader anyway. And the device fails to boot AOSP regardless of > > > > > > > > "rootwait" bootarg being present or not. > > > > > > > > > > > > > > Could you try applying this diff to enable some log spam and let me know > > > > > > > what you get? I'm keen to try and figure this one out. My mail client > > > > > > > might crunch this a bit so I have pasted it here too > > > > > > > https://p.calebs.dev/ab74b7@raw > > > > > > > > > > > > These prints add just enough delay for the UFS probe to succeed that I > > > > > > can't reproduce the failure anymore. > > > > > > > > > > I'd prefer doing at least a little debugging before jumping to a > > > > > revert. From looking at your dmesg [1], it looks as if the async probe > > > > > is allowing RPMH to probe at the same time as "qcom-vadc-common". > > > > > That's something that talks on the SPMI bus and is (potentially) > > > > > talking to the same PMICs that RPMH-regulator is, right? I'm by no > > > > > means an expert on how Qualcomm's PMICs work, but it seems plausible > > > > > that the "qcom-vadc-common" is somehow causing problems and screwing > > > > > up RPMH. Does that seem plausible to you? > > > > > > > > > > If so, one interesting way to track it down would be to move around > > > > > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that > > > > > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of > > > > > vadc_probe(). Maybe that will fix the problem? If so, you can move the > > > > > delay around to narrow down the conflict. My wild guess would be that > > > > > vadc_reset() could be throwing things for a loop? > > > > > > > > > > If the above doesn't work, maybe we could add more tracing / printouts > > > > > to see what is probing at the same time as RPMH? > > > > > > > > Tried out a few changes today but none of them worked or were > > > > effective enough to debug this crash further, other than setting > > > > fw_devlink=permissive. > > > > > > > > Adding more tracing / prints (BOOTTIME_TRACING and > > > > FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the > > > > crash either. They added just enough delay to boot the device > > > > successfully everytime. > > > > > > > > I tried to reason with the kernel modules which gets loaded before and > > > > after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC, > > > > SPMI_PMIC* etc) as suggested, but I run into the same crash even if I > > > > disable those driver modules. So I don't think that the other driver > > > > modules which gets loaded at around the same time as > > > > qcom-rpmh-regulator by default have any impact on this failure. > > > > > > Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty > > > much anything you can do to change the timing fixes you. That does > > > make reverting the async probe of the regulator less appealing. If, as > > > you said, it's not just some other driver loading at the same time > > > that's interfering then the revert "fixes" you in the same way that a > > > "msleep" would fix you. That doesn't seem like enough of a > > > justification for the revert to me. > > > > > > It still feels to me like _something_ is happening at the same time as > > > the RPMH regulator driver is loading, though, I'm just not sure how to > > > suggest debugging it. I guess other thoughts: > > > > > > * When RPMH complains, is it always with the same regulator (lvs1), or > > > sometimes different ones? Any clue there? > > > > It is always either lvs1 or lvs2. > > If you reorder the nodes in the device tree, I think it'll change the > probe order. Does that affect anything? I'm wondering if there's some > sort of delayed reaction from a previous regulator. Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work. I can't reproduce the crash in 125 reboots so far, while I'm still testing with only qcom-rpmh-regulator kernel module. I'll do some more testing with full system running and send this re-ordering fix I can't reproduce the crash further. > > > > > * How much can you control module loading order? If rpmh-regulator > > > module loads first, does it "fix" things? If it does, maybe you could > > > bisect to find the place where problems start cropping up. Does that > > > give any clues? > > > > Loading qcom-rpmh-regulator first does make it difficult to reproduce > > the crash. I tried a few combinations to narrow down the issue further > > and in one case, I managed to reproduce the crash (1 in 20+ reboots) > > while loading the qcom-rpmh-regulator (and the dependent cmd-db, > > qcom_rpmh) module alone > > https://bugs.linaro.org/attachment.cgi?id=1140. > > > > Regards, > > Amit Pundir > > OK, now that's even weirder. If loading the module alone reproduces > the problem, I can't imagine why this would be different without > "async" probe. In other words, If it reproduces 5% of the time when > loading the module alone with async probe, I'd expect it to reproduce > 5% of the time when loading the module alone _without_ async probe > too. I don't know the internal workings of forcing an asynchronous or synchronous probe, but loading this module alone crashed 60 times in 350 reboots with this async patch, while it didn't crash at all in about 250 reboots if I do PROBE_FORCE_SYNCHRONOUS. Regards, Amit Pundir > > Note: make sure you're careful to collect more than one reproduction > before asserting odds. If it reproduced once in 20 reboots, it might > be 5%, it might be 20%, or it might be .01%. Ideally you could script > this and let it run for a few hours to get a good reproduction rate? > > -Doug
On Fri, Jun 02, 2023 at 01:00:52PM +0530, Amit Pundir wrote: > On Thu, 1 Jun 2023 at 19:35, Doug Anderson <dianders@chromium.org> wrote: > > On Wed, May 31, 2023 at 11:11 PM Amit Pundir <amit.pundir@linaro.org> wrote: > > > On Wed, 17 May 2023 at 02:54, Doug Anderson <dianders@chromium.org> wrote: > > > > On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > > On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@chromium.org> wrote: > > > > > > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > > > > > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@linaro.org> wrote: Please delete unneeded context from mails when replying. Doing this makes it much easier to find your reply in the message, helping ensure it won't be missed by people scrolling through the irrelevant quoted material. > > If you reorder the nodes in the device tree, I think it'll change the > > probe order. Does that affect anything? I'm wondering if there's some > > sort of delayed reaction from a previous regulator. > Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the > DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work. > I can't reproduce the crash in 125 reboots so far, while I'm still > testing with only qcom-rpmh-regulator kernel module. I'll do some more > testing with full system running and send this re-ordering fix I can't > reproduce the crash further. So whatever the issue is here it's a timing/race condition - this seems like a workaround which works just now but it's not getting to whatever the actual issue is and that could come back.
On Fri, 2 Jun 2023 at 13:00, Amit Pundir <amit.pundir@linaro.org> wrote: > > On Thu, 1 Jun 2023 at 19:35, Doug Anderson <dianders@chromium.org> wrote: > > > > If you reorder the nodes in the device tree, I think it'll change the > > probe order. Does that affect anything? I'm wondering if there's some > > sort of delayed reaction from a previous regulator. > > Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the > DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work. > I can't reproduce the crash in 125 reboots so far, while I'm still > testing with only qcom-rpmh-regulator kernel module. I'll do some more > testing with full system running and send this re-ordering fix I can't > reproduce the crash further. Hi, successfully rebooted AOSP with v6.4-rc4 on DB845c about 100+ times with this above mentioned lvs nodes reordering in the device tree. I don't see any obvious functionality breakage in my limited smoke testing so far either. I'll post this workaround/fix for review on the lkml. Regards, Amit Pundir
On Fri, 2 Jun 2023 at 18:07, Mark Brown <broonie@kernel.org> wrote: > > > > If you reorder the nodes in the device tree, I think it'll change the > > > probe order. Does that affect anything? I'm wondering if there's some > > > sort of delayed reaction from a previous regulator. > > > Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the > > DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work. > > I can't reproduce the crash in 125 reboots so far, while I'm still > > testing with only qcom-rpmh-regulator kernel module. I'll do some more > > testing with full system running and send this re-ordering fix I can't > > reproduce the crash further. > > So whatever the issue is here it's a timing/race condition - this seems > like a workaround which works just now but it's not getting to whatever > the actual issue is and that could come back. Hi, I'm happy to debug this issue further or test run any patches/ideas if that helps. Regards, Amit Pundir
Hi, On Fri, Jun 2, 2023 at 6:13 AM Amit Pundir <amit.pundir@linaro.org> wrote: > > On Fri, 2 Jun 2023 at 18:07, Mark Brown <broonie@kernel.org> wrote: > > > > > > If you reorder the nodes in the device tree, I think it'll change the > > > > probe order. Does that affect anything? I'm wondering if there's some > > > > sort of delayed reaction from a previous regulator. > > > > > Hi, Bumping lvs1 and lvs2 regulators up to the top of the list in the > > > DTS https://bugs.linaro.org/show_bug.cgi?id=5975#c4 does seem to work. > > > I can't reproduce the crash in 125 reboots so far, while I'm still > > > testing with only qcom-rpmh-regulator kernel module. I'll do some more > > > testing with full system running and send this re-ordering fix I can't > > > reproduce the crash further. > > > > So whatever the issue is here it's a timing/race condition - this seems > > like a workaround which works just now but it's not getting to whatever > > the actual issue is and that could come back. > > Hi, I'm happy to debug this issue further or test run any > patches/ideas if that helps. I guess it's a better workaround than reverting the async patch. ...at least it has a chance of having a real effect. That being said, it's still a bit of a hack. Ideally you'd be able to somehow get information about RPMH's state when things fail. Maybe this might be possible with ramdumps or maybe with JTAG. Unfortunately, I'm not super familiar with either of them. I assume you aren't either or you would have already tried them... From a black box perspective, I guess the things I could think of would be to keep poking around with things that you control. Best ideas I have: 1. Use "bisect" style techniques to figure out how much you really need to move the "lvs" regulators. Try moving it halfway up the list. If that works, move it closer to the bottom. If that doesn't work, move it closer to the top. Eventually you'd find out which regulator it's important to be before. 2. Try adding some delays to some of the regulators with "regulator-enable-ramp-delay" and/or "regulator-settling-time-us". Without a scope, it'll be tricky to figure out exactly which regulators might need delays, but you could at least confirm if the "overkill" approach of having all the regulators have some delay helps... I guess you could also try putting a big delay for "ldo26". If that works, you could try moving it up (again using a bisect style approach) to see where the delay matters? Currently, I guess my mental model of what might be going wrong is that regulators are all turning on / adjusting really quickly. Maybe they aren't switching into "high power mode" quickly enough, maybe they are busy ramping up or down, or maybe there's simply some other issue, but I suppose that something happening could be causing the voltage to droop down (or be too high) and then that's making RPMH upset. Changing the order could be helping avoid this droop, but the more proper fix would be to actually account for it with regulator constraints. My mental model still doesn't really explain what async probe has to do with anything, at least if you're loading _just_ rpmh-regulator all on its own. Assuming you're loading rpmh-regulator all on its own then async probe should do almost nothing. Unless I'm mistaken, async probe won't cause all the RPMH regulators to initialize in parallel. They still initialize synchronously in the rpmh-regulator probe routine. Async probe would _just_ allow some other driver to run its probe at the same time. ...but if no other drivers being loaded at the same time then I'm perplexed about how it could make any difference. -Doug
On Tue, Jun 06, 2023 at 04:29:29PM -0700, Doug Anderson wrote: > 2. Try adding some delays to some of the regulators with > "regulator-enable-ramp-delay" and/or "regulator-settling-time-us". > Without a scope, it'll be tricky to figure out exactly which > regulators might need delays, but you could at least confirm if the > "overkill" approach of having all the regulators have some delay > helps... I guess you could also try putting a big delay for "ldo26". > If that works, you could try moving it up (again using a bisect style > approach) to see where the delay matters? This is information which should be in the datasheets for the part. > Currently, I guess my mental model of what might be going wrong is > that regulators are all turning on / adjusting really quickly. Maybe > they aren't switching into "high power mode" quickly enough, maybe > they are busy ramping up or down, or maybe there's simply some other > issue, but I suppose that something happening could be causing the > voltage to droop down (or be too high) and then that's making RPMH > upset. Changing the order could be helping avoid this droop, but the > more proper fix would be to actually account for it with regulator > constraints. There could potentially be inrush issues, though I'd not expect reordering to help much there.
Hi, On Wed, Jun 7, 2023 at 6:18 AM Mark Brown <broonie@kernel.org> wrote: > > On Tue, Jun 06, 2023 at 04:29:29PM -0700, Doug Anderson wrote: > > > 2. Try adding some delays to some of the regulators with > > "regulator-enable-ramp-delay" and/or "regulator-settling-time-us". > > Without a scope, it'll be tricky to figure out exactly which > > regulators might need delays, but you could at least confirm if the > > "overkill" approach of having all the regulators have some delay > > helps... I guess you could also try putting a big delay for "ldo26". > > If that works, you could try moving it up (again using a bisect style > > approach) to see where the delay matters? > > This is information which should be in the datasheets for the part. I was thinking more of something board-specific, not part specific. In theory with RPMH this is also all supposed to be abstracted out into the firmware code that sets up RPMH which magically takes care of things like this, but it certainly could be wrong.
Hi, Thorsten here, the Linux kernel's regression tracker. On 07.06.23 15:47, Doug Anderson wrote: > > On Wed, Jun 7, 2023 at 6:18 AM Mark Brown <broonie@kernel.org> wrote: >> >> On Tue, Jun 06, 2023 at 04:29:29PM -0700, Doug Anderson wrote: >> >>> 2. Try adding some delays to some of the regulators with >>> "regulator-enable-ramp-delay" and/or "regulator-settling-time-us". >>> Without a scope, it'll be tricky to figure out exactly which >>> regulators might need delays, but you could at least confirm if the >>> "overkill" approach of having all the regulators have some delay >>> helps... I guess you could also try putting a big delay for "ldo26". >>> If that works, you could try moving it up (again using a bisect style >>> approach) to see where the delay matters? >> >> This is information which should be in the datasheets for the part. > > I was thinking more of something board-specific, not part specific. In > theory with RPMH this is also all supposed to be abstracted out into > the firmware code that sets up RPMH which magically takes care of > things like this, but it certainly could be wrong. /me waves friendly That afaics was the last mail in this thread about a regression caused by ad44ac082fd ("regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"") from Doug; Amit's attempt to patch it ( https://lore.kernel.org/lkml/20230602161246.1855448-1-amit.pundir@linaro.org/ ) also wasn't welcomed. Just like his earlier revert attempt (https://lore.kernel.org/lkml/20230515145323.1693044-1-amit.pundir@linaro.org/ ). Does this mean this regression won't be addressed before 6.4 is released? Or was there some progress and I just missed it? What should I tell Linus in my next report? Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr If I did something stupid, please tell me, as explained on that page. #regzbot poke
Hi, On Wed, Jun 14, 2023 at 8:37 AM Linux regression tracking (Thorsten Leemhuis) <regressions@leemhuis.info> wrote: > > Hi, Thorsten here, the Linux kernel's regression tracker. > > On 07.06.23 15:47, Doug Anderson wrote: > > > > On Wed, Jun 7, 2023 at 6:18 AM Mark Brown <broonie@kernel.org> wrote: > >> > >> On Tue, Jun 06, 2023 at 04:29:29PM -0700, Doug Anderson wrote: > >> > >>> 2. Try adding some delays to some of the regulators with > >>> "regulator-enable-ramp-delay" and/or "regulator-settling-time-us". > >>> Without a scope, it'll be tricky to figure out exactly which > >>> regulators might need delays, but you could at least confirm if the > >>> "overkill" approach of having all the regulators have some delay > >>> helps... I guess you could also try putting a big delay for "ldo26". > >>> If that works, you could try moving it up (again using a bisect style > >>> approach) to see where the delay matters? > >> > >> This is information which should be in the datasheets for the part. > > > > I was thinking more of something board-specific, not part specific. In > > theory with RPMH this is also all supposed to be abstracted out into > > the firmware code that sets up RPMH which magically takes care of > > things like this, but it certainly could be wrong. > > /me waves friendly > > That afaics was the last mail in this thread about a regression caused > by ad44ac082fd ("regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use > PROBE_FORCE_SYNCHRONOUS"") from Doug; Amit's attempt to patch it ( > https://lore.kernel.org/lkml/20230602161246.1855448-1-amit.pundir@linaro.org/ > ) also wasn't welcomed. Just like his earlier revert attempt > (https://lore.kernel.org/lkml/20230515145323.1693044-1-amit.pundir@linaro.org/ > ). > > Does this mean this regression won't be addressed before 6.4 is > released? Or was there some progress and I just missed it? What should I > tell Linus in my next report? Of the two proposals made (the revert vs. the reordering of the dts), the reordering of the dts seems better. It only affects the one buggy board (rather than preventing us to move to async probe for everyone) and it also has a chance of actually fixing something (changing the order that regulators probe in rpmh-regulator might legitimately work around the problem). That being said, just like the revert the dts reordering is still just papering over the problem and is fragile / not guaranteed to work forever. -Doug
On Wed, 7 Jun 2023 at 04:59, Doug Anderson <dianders@chromium.org> wrote: > > From a black box perspective, I guess the things I could think of > would be to keep poking around with things that you control. Best > ideas I have: > > 1. Use "bisect" style techniques to figure out how much you really > need to move the "lvs" regulators. Try moving it halfway up the list. > If that works, move it closer to the bottom. If that doesn't work, > move it closer to the top. Eventually you'd find out which regulator > it's important to be before. Hi, I tried this bisect style technique to move lvs regulators up in the list gradually and I found that they need to be enabled atleast before ldo12 and the ldo regulators which follow the ldo12 in the list. > > 2. Try adding some delays to some of the regulators with > "regulator-enable-ramp-delay" and/or "regulator-settling-time-us". > Without a scope, it'll be tricky to figure out exactly which > regulators might need delays, but you could at least confirm if the > "overkill" approach of having all the regulators have some delay > helps... I guess you could also try putting a big delay for "ldo26". > If that works, you could try moving it up (again using a bisect style > approach) to see where the delay matters? I tried this approach as well earlier today but I don't know how big "the big" delay can be. The device fails to boot if I add a settling time of as much as 2sec per each ldo and lvs regulator too. I didn't try increasing the delay further. Regards, Amit Pundir
Hi, On Wed, Jun 14, 2023 at 12:03 PM Amit Pundir <amit.pundir@linaro.org> wrote: > > On Wed, 7 Jun 2023 at 04:59, Doug Anderson <dianders@chromium.org> wrote: > > > > From a black box perspective, I guess the things I could think of > > would be to keep poking around with things that you control. Best > > ideas I have: > > > > 1. Use "bisect" style techniques to figure out how much you really > > need to move the "lvs" regulators. Try moving it halfway up the list. > > If that works, move it closer to the bottom. If that doesn't work, > > move it closer to the top. Eventually you'd find out which regulator > > it's important to be before. > > Hi, I tried this bisect style technique to move lvs regulators up in > the list gradually and I found that they need to be enabled atleast > before ldo12 and the ldo regulators which follow the ldo12 in the > list. Super weird. I was hoping that something would jump out, but nothing does. :( I don't understand how lvs1 / lvs2 could have any impact on ldo12. :( > > 2. Try adding some delays to some of the regulators with > > "regulator-enable-ramp-delay" and/or "regulator-settling-time-us". > > Without a scope, it'll be tricky to figure out exactly which > > regulators might need delays, but you could at least confirm if the > > "overkill" approach of having all the regulators have some delay > > helps... I guess you could also try putting a big delay for "ldo26". > > If that works, you could try moving it up (again using a bisect style > > approach) to see where the delay matters? > > I tried this approach as well earlier today but I don't know how big > "the big" delay can be. The device fails to boot if I add a settling > time of as much as 2sec per each ldo and lvs regulator too. I didn't > try increasing the delay further. Yeah, 2 seconds is plenty big. If that doesn't fix it then it's not a timing issue. I guess with the above results, I'm still super confused about why the async probe has any impact at all on this. It sounds like the _ordering_ of the rpmh-regulators init matters but not the timing, and I'd expect the ordering to be the same between normal probe and async probe. Specifically, I think: a) There is exactly one rpmh-regulator driver instance in your system, right? b) Regulator initialization happens in rpmh_regulator_probe(). c) The rpmh_regulator_probe() function is itself synchronous. That is, it sets up one regulator at a time and, I believe, nothing about the behavior of rpmh_regulator_probe() changes for async vs. sync probe. ...so I'm left baffled...
On Wed, Jun 14, 2023 at 12:35:08PM -0700, Doug Anderson wrote: > Super weird. I was hoping that something would jump out, but nothing > does. :( I don't understand how lvs1 / lvs2 could have any impact on > ldo12. :( Does a device connected to some combination of these regulators have power sequencing issues?
diff --git a/drivers/regulator/qcom-rpmh-regulator.c b/drivers/regulator/qcom-rpmh-regulator.c index 903032b2875f..4826d60e5d95 100644 --- a/drivers/regulator/qcom-rpmh-regulator.c +++ b/drivers/regulator/qcom-rpmh-regulator.c @@ -1462,7 +1462,7 @@ MODULE_DEVICE_TABLE(of, rpmh_regulator_match_table); static struct platform_driver rpmh_regulator_driver = { .driver = { .name = "qcom-rpmh-regulator", - .probe_type = PROBE_FORCE_SYNCHRONOUS, + .probe_type = PROBE_PREFER_ASYNCHRONOUS, .of_match_table = of_match_ptr(rpmh_regulator_match_table), }, .probe = rpmh_regulator_probe,
This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that prompted the us to switch to synchronous probe showed that the root cause was a missing "rootwait" in the kernel command line arguments. Let's reinstate asynchronous probe. Fixes: 58973046c1bf ("regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS") Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> --- drivers/regulator/qcom-rpmh-regulator.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)