mbox series

[0/3] soc: qcom: pmic_glink: v6.11-rc bug fixes

Message ID 20240818-pmic-glink-v6-11-races-v1-0-f87c577e0bc9@quicinc.com (mailing list archive)
Headers show
Series soc: qcom: pmic_glink: v6.11-rc bug fixes | expand

Message

Bjorn Andersson Aug. 18, 2024, 11:17 p.m. UTC
Amit and Johan both reported a NULL pointer dereference in the
pmic_glink client code during initialization, and Stephen Boyd pointed
out the problem (race condition).

While investigating, and writing the fix, I noticed that
ucsi_unregister() is called in atomic context but tries to sleep, and I
also noticed that the condition for when to inform the pmic_glink client
drivers when the remote has gone down is just wrong.

So, let's fix all three.

As mentioned in the commit message for the UCSI fix, I have a series in
the works that makes the GLINK callback happen in a sleepable context,
which would remove the need for the clients list to be protected by a
spinlock, and removing the work scheduling. This is however not -rc
material...

In addition to the NULL pointer dereference, there is the -ECANCELED
issue reported here:
https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/
I have not yet been able to either reproduce this or convince myself
that this is the same issue.

Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
---
Bjorn Andersson (3):
      soc: qcom: pmic_glink: Fix race during initialization
      usb: typec: ucsi: Move unregister out of atomic section
      soc: qcom: pmic_glink: Actually communicate with remote goes down

 drivers/power/supply/qcom_battmgr.c   | 16 ++++++++-----
 drivers/soc/qcom/pmic_glink.c         | 40 +++++++++++++++++++++----------
 drivers/soc/qcom/pmic_glink_altmode.c | 17 +++++++++-----
 drivers/usb/typec/ucsi/ucsi_glink.c   | 44 ++++++++++++++++++++++++++---------
 include/linux/soc/qcom/pmic_glink.h   | 11 +++++----
 5 files changed, 88 insertions(+), 40 deletions(-)
---
base-commit: 296c871d2904cff2b4742702ef94512ab467a8e3
change-id: 20240818-pmic-glink-v6-11-races-363f5964c339

Best regards,

Comments

Amit Pundir Aug. 19, 2024, 10:12 a.m. UTC | #1
On Mon, 19 Aug 2024 at 04:47, Bjorn Andersson <quic_bjorande@quicinc.com> wrote:
>
> Amit and Johan both reported a NULL pointer dereference in the
> pmic_glink client code during initialization, and Stephen Boyd pointed
> out the problem (race condition).
>
> While investigating, and writing the fix, I noticed that
> ucsi_unregister() is called in atomic context but tries to sleep, and I
> also noticed that the condition for when to inform the pmic_glink client
> drivers when the remote has gone down is just wrong.
>
> So, let's fix all three.
>
> As mentioned in the commit message for the UCSI fix, I have a series in
> the works that makes the GLINK callback happen in a sleepable context,
> which would remove the need for the clients list to be protected by a
> spinlock, and removing the work scheduling. This is however not -rc
> material...
>
> In addition to the NULL pointer dereference, there is the -ECANCELED
> issue reported here:
> https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> I have not yet been able to either reproduce this or convince myself
> that this is the same issue.
>

Thank you for the fixes Bjorn. I'm not able to reproduce that
pmic_glink kernel panic on SM8550-HDK anymore.

Tested-by: Amit Pundir <amit.pundir@linaro.org>

> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
> ---
> Bjorn Andersson (3):
>       soc: qcom: pmic_glink: Fix race during initialization
>       usb: typec: ucsi: Move unregister out of atomic section
>       soc: qcom: pmic_glink: Actually communicate with remote goes down
>
>  drivers/power/supply/qcom_battmgr.c   | 16 ++++++++-----
>  drivers/soc/qcom/pmic_glink.c         | 40 +++++++++++++++++++++----------
>  drivers/soc/qcom/pmic_glink_altmode.c | 17 +++++++++-----
>  drivers/usb/typec/ucsi/ucsi_glink.c   | 44 ++++++++++++++++++++++++++---------
>  include/linux/soc/qcom/pmic_glink.h   | 11 +++++----
>  5 files changed, 88 insertions(+), 40 deletions(-)
> ---
> base-commit: 296c871d2904cff2b4742702ef94512ab467a8e3
> change-id: 20240818-pmic-glink-v6-11-races-363f5964c339
>
> Best regards,
> --
> Bjorn Andersson <quic_bjorande@quicinc.com>
>
Greg KH Aug. 19, 2024, 2:07 p.m. UTC | #2
On Sun, Aug 18, 2024 at 04:17:36PM -0700, Bjorn Andersson wrote:
> Amit and Johan both reported a NULL pointer dereference in the
> pmic_glink client code during initialization, and Stephen Boyd pointed
> out the problem (race condition).
> 
> While investigating, and writing the fix, I noticed that
> ucsi_unregister() is called in atomic context but tries to sleep, and I
> also noticed that the condition for when to inform the pmic_glink client
> drivers when the remote has gone down is just wrong.
> 
> So, let's fix all three.
> 
> As mentioned in the commit message for the UCSI fix, I have a series in
> the works that makes the GLINK callback happen in a sleepable context,
> which would remove the need for the clients list to be protected by a
> spinlock, and removing the work scheduling. This is however not -rc
> material...
> 
> In addition to the NULL pointer dereference, there is the -ECANCELED
> issue reported here:
> https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> I have not yet been able to either reproduce this or convince myself
> that this is the same issue.
> 
> Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>

What tree are these to go through?  I can take them through mine, but if
someone else wants to, feel free to route them some other way.

thanks,

greg k-h
Bjorn Andersson Aug. 19, 2024, 2:56 p.m. UTC | #3
On Mon, Aug 19, 2024 at 04:07:41PM +0200, Greg Kroah-Hartman wrote:
> On Sun, Aug 18, 2024 at 04:17:36PM -0700, Bjorn Andersson wrote:
> > Amit and Johan both reported a NULL pointer dereference in the
> > pmic_glink client code during initialization, and Stephen Boyd pointed
> > out the problem (race condition).
> > 
> > While investigating, and writing the fix, I noticed that
> > ucsi_unregister() is called in atomic context but tries to sleep, and I
> > also noticed that the condition for when to inform the pmic_glink client
> > drivers when the remote has gone down is just wrong.
> > 
> > So, let's fix all three.
> > 
> > As mentioned in the commit message for the UCSI fix, I have a series in
> > the works that makes the GLINK callback happen in a sleepable context,
> > which would remove the need for the clients list to be protected by a
> > spinlock, and removing the work scheduling. This is however not -rc
> > material...
> > 
> > In addition to the NULL pointer dereference, there is the -ECANCELED
> > issue reported here:
> > https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> > I have not yet been able to either reproduce this or convince myself
> > that this is the same issue.
> > 
> > Signed-off-by: Bjorn Andersson <quic_bjorande@quicinc.com>
> 
> What tree are these to go through?  I can take them through mine, but if
> someone else wants to, feel free to route them some other way.
> 

It's primarily soc/qcom content, so I can pick them through the qcom soc
tree.

Regards,
Bjorn

> thanks,
> 
> greg k-h
>
Johan Hovold Aug. 19, 2024, 3:48 p.m. UTC | #4
On Sun, Aug 18, 2024 at 04:17:36PM -0700, Bjorn Andersson wrote:
> Amit and Johan both reported a NULL pointer dereference in the
> pmic_glink client code during initialization, and Stephen Boyd pointed
> out the problem (race condition).

> In addition to the NULL pointer dereference, there is the -ECANCELED
> issue reported here:
> https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> I have not yet been able to either reproduce this or convince myself
> that this is the same issue.

I can confirm that I still see the -ECANCELED issue with this series
applied:

[    8.979329] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
[    9.004735] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125

Johan
Bjorn Andersson Aug. 19, 2024, 4:53 p.m. UTC | #5
On Mon, Aug 19, 2024 at 05:48:26PM +0200, Johan Hovold wrote:
> On Sun, Aug 18, 2024 at 04:17:36PM -0700, Bjorn Andersson wrote:
> > Amit and Johan both reported a NULL pointer dereference in the
> > pmic_glink client code during initialization, and Stephen Boyd pointed
> > out the problem (race condition).
> 
> > In addition to the NULL pointer dereference, there is the -ECANCELED
> > issue reported here:
> > https://lore.kernel.org/all/Zqet8iInnDhnxkT9@hovoldconsulting.com/
> > I have not yet been able to either reproduce this or convince myself
> > that this is the same issue.
> 
> I can confirm that I still see the -ECANCELED issue with this series
> applied:
> 
> [    8.979329] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
> [    9.004735] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
> 

Could you confirm that you're seeing a call to
qcom_glink_handle_intent_req_ack() with granted == 0, leading to the
transfer failing.

It would also be nice, just for completeness sake to rule out that you
do not get a call to qcom_glink_intent_req_abort() here.

Regards,
Bjorn

> Johan
>
Johan Hovold Aug. 20, 2024, 7:31 a.m. UTC | #6
On Mon, Aug 19, 2024 at 09:53:09AM -0700, Bjorn Andersson wrote:
> On Mon, Aug 19, 2024 at 05:48:26PM +0200, Johan Hovold wrote:

> > I can confirm that I still see the -ECANCELED issue with this series
> > applied:
> > 
> > [    8.979329] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
> > [    9.004735] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125
> 
> Could you confirm that you're seeing a call to
> qcom_glink_handle_intent_req_ack() with granted == 0, leading to the
> transfer failing.

It appears so:

[    9.539415]  30000000.remoteproc:glink-edge: qcom_glink_handle_intent_req_ack - cid = 9, granted = 0
[    9.561750] qcom_battmgr.pmic_glink_power_supply pmic_glink.power-supply.0: failed to request power notifications

[    9.448945]  30000000.remoteproc:glink-edge: qcom_glink_handle_intent_req_ack - cid = 9, granted = 0
[    9.461267] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to send altmode request: 0x10 (-125)
[    9.469241] qcom,apr 30000000.remoteproc:glink-edge.adsp_apps.-1.-1: Adding APR/GPR dev: gprsvc:service:2:1
[    9.478968] pmic_glink_altmode.pmic_glink_altmode pmic_glink.altmode.0: failed to request altmode notifications: -125

> It would also be nice, just for completeness sake to rule out that you
> do not get a call to qcom_glink_intent_req_abort() here.

And I'm not seeing this function being called.

Johan