diff mbox series

[RFC,v2] soc: qcom: pmic_glink: Fix device access from worker during suspend

Message ID 20250129-soc-qcom-pmic-glink-fix-device-access-on-worker-while-suspended-v2-1-de2a3eca514e@linaro.org (mailing list archive)
State New
Headers show
Series [RFC,v2] soc: qcom: pmic_glink: Fix device access from worker during suspend | expand

Commit Message

Abel Vesa Jan. 29, 2025, 11:46 a.m. UTC
For historical reasons, the GLINK smem interrupt is registered with
IRRQF_NO_SUSPEND flag set, which is the underlying problem here, since the
incoming messages can be delivered during late suspend and early
resume.

In this specific case, the pmic_glink_altmode_worker() currently gets
scheduled on the system_wq which can be scheduled to run while devices
are still suspended. This proves to be a problem when a Type-C retimer,
switch or mux that is controlled over a bus like I2C, because the I2C
controller is suspended.

This has been proven to be the case on the X Elite boards where such
retimers (ParadeTech PS8830) are used in order to handle Type-C
orientation and altmode configuration. The following warning is thrown:

[   35.134876] i2c i2c-4: Transfer while suspended
[   35.143865] WARNING: CPU: 0 PID: 99 at drivers/i2c/i2c-core.h:56 __i2c_transfer+0xb4/0x57c [i2c_core]
[   35.352879] Workqueue: events pmic_glink_altmode_worker [pmic_glink_altmode]
[   35.360179] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[   35.455242] Call trace:
[   35.457826]  __i2c_transfer+0xb4/0x57c [i2c_core] (P)
[   35.463086]  i2c_transfer+0x98/0xf0 [i2c_core]
[   35.467713]  i2c_transfer_buffer_flags+0x54/0x88 [i2c_core]
[   35.473502]  regmap_i2c_write+0x20/0x48 [regmap_i2c]
[   35.478659]  _regmap_raw_write_impl+0x780/0x944
[   35.483401]  _regmap_bus_raw_write+0x60/0x7c
[   35.487848]  _regmap_write+0x134/0x184
[   35.491773]  regmap_write+0x54/0x78
[   35.495418]  ps883x_set+0x58/0xec [ps883x]
[   35.499688]  ps883x_sw_set+0x60/0x84 [ps883x]
[   35.504223]  typec_switch_set+0x48/0x74 [typec]
[   35.508952]  pmic_glink_altmode_worker+0x44/0x1fc [pmic_glink_altmode]
[   35.515712]  process_scheduled_works+0x1a0/0x2d0
[   35.520525]  worker_thread+0x2a8/0x3c8
[   35.524449]  kthread+0xfc/0x184
[   35.527749]  ret_from_fork+0x10/0x20

The proper solution here should be to not deliver these kind of messages
during system suspend at all, or at least make it configurable per glink
client. But simply dropping the IRQF_NO_SUSPEND flag entirely will break
other clients. The final shape of the rework of the pmic glink driver in
order to fulfill both the filtering of the messages that need to be able
to wake-up the system and the queueing of these messages until the system
has properly resumed is still being discussed and it is planned as a
future effort.

Meanwhile, the stop-gap fix here is to schedule the pmic glink altmode
worker on the system_freezable_wq instead of the system_wq. This will
result in the altmode worker not being scheduled to run until the
devices are resumed first, which will give the controllers like I2C a
chance to resume before the transfer is requested.

Reported-by: Johan Hovold <johan+linaro@kernel.org>
Closes: https://lore.kernel.org/lkml/Z1CCVjEZMQ6hJ-wK@hovoldconsulting.com/
Fixes: 080b4e24852b ("soc: qcom: pmic_glink: Introduce altmode support")
Cc: stable@vger.kernel.org    # 6.3
Reviewed-by: Caleb Connolly <caleb.connolly@linaro.org>
Signed-off-by: Abel Vesa <abel.vesa@linaro.org>
---
Changes in v2:
- Re-worded the commit to explain the underlying problem and how
  this fix is just a stop-gap for the pmic glink client for now.
- Added Johan's Reported-by tag and link
- Added Caleb's Reviewed-by tag
- Link to v1: https://lore.kernel.org/r/20250110-soc-qcom-pmic-glink-fix-device-access-on-worker-while-suspended-v1-1-e32fd6bf322e@linaro.org
---
 drivers/soc/qcom/pmic_glink_altmode.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


---
base-commit: da7e6047a6264af16d2cb82bed9b6caa33eaf56a
change-id: 20250110-soc-qcom-pmic-glink-fix-device-access-on-worker-while-suspended-af54c5e43ed6

Best regards,

Comments

Bjorn Andersson Jan. 29, 2025, 8:57 p.m. UTC | #1
On Wed, Jan 29, 2025 at 01:46:15PM +0200, Abel Vesa wrote:
> For historical reasons, the GLINK smem interrupt is registered with
> IRRQF_NO_SUSPEND flag set, which is the underlying problem here, since the
> incoming messages can be delivered during late suspend and early
> resume.
> 
> In this specific case, the pmic_glink_altmode_worker() currently gets
> scheduled on the system_wq which can be scheduled to run while devices
> are still suspended. This proves to be a problem when a Type-C retimer,
> switch or mux that is controlled over a bus like I2C, because the I2C
> controller is suspended.
> 
> This has been proven to be the case on the X Elite boards where such
> retimers (ParadeTech PS8830) are used in order to handle Type-C
> orientation and altmode configuration. The following warning is thrown:
> 
> [   35.134876] i2c i2c-4: Transfer while suspended
> [   35.143865] WARNING: CPU: 0 PID: 99 at drivers/i2c/i2c-core.h:56 __i2c_transfer+0xb4/0x57c [i2c_core]
> [   35.352879] Workqueue: events pmic_glink_altmode_worker [pmic_glink_altmode]
> [   35.360179] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [   35.455242] Call trace:
> [   35.457826]  __i2c_transfer+0xb4/0x57c [i2c_core] (P)
> [   35.463086]  i2c_transfer+0x98/0xf0 [i2c_core]
> [   35.467713]  i2c_transfer_buffer_flags+0x54/0x88 [i2c_core]
> [   35.473502]  regmap_i2c_write+0x20/0x48 [regmap_i2c]
> [   35.478659]  _regmap_raw_write_impl+0x780/0x944
> [   35.483401]  _regmap_bus_raw_write+0x60/0x7c
> [   35.487848]  _regmap_write+0x134/0x184
> [   35.491773]  regmap_write+0x54/0x78
> [   35.495418]  ps883x_set+0x58/0xec [ps883x]
> [   35.499688]  ps883x_sw_set+0x60/0x84 [ps883x]
> [   35.504223]  typec_switch_set+0x48/0x74 [typec]
> [   35.508952]  pmic_glink_altmode_worker+0x44/0x1fc [pmic_glink_altmode]
> [   35.515712]  process_scheduled_works+0x1a0/0x2d0
> [   35.520525]  worker_thread+0x2a8/0x3c8
> [   35.524449]  kthread+0xfc/0x184
> [   35.527749]  ret_from_fork+0x10/0x20
> 
> The proper solution here should be to not deliver these kind of messages
> during system suspend at all, or at least make it configurable per glink
> client. But simply dropping the IRQF_NO_SUSPEND flag entirely will break
> other clients. The final shape of the rework of the pmic glink driver in
> order to fulfill both the filtering of the messages that need to be able
> to wake-up the system and the queueing of these messages until the system
> has properly resumed is still being discussed and it is planned as a
> future effort.
> 
> Meanwhile, the stop-gap fix here is to schedule the pmic glink altmode
> worker on the system_freezable_wq instead of the system_wq. This will
> result in the altmode worker not being scheduled to run until the
> devices are resumed first, which will give the controllers like I2C a
> chance to resume before the transfer is requested.
> 
> Reported-by: Johan Hovold <johan+linaro@kernel.org>
> Closes: https://lore.kernel.org/lkml/Z1CCVjEZMQ6hJ-wK@hovoldconsulting.com/
> Fixes: 080b4e24852b ("soc: qcom: pmic_glink: Introduce altmode support")
> Cc: stable@vger.kernel.org    # 6.3
> Reviewed-by: Caleb Connolly <caleb.connolly@linaro.org>
> Signed-off-by: Abel Vesa <abel.vesa@linaro.org>

Reviewed-by: Bjorn Andersson <andersson@kernel.org>

Regards,
Bjorn

> ---
> Changes in v2:
> - Re-worded the commit to explain the underlying problem and how
>   this fix is just a stop-gap for the pmic glink client for now.
> - Added Johan's Reported-by tag and link
> - Added Caleb's Reviewed-by tag
> - Link to v1: https://lore.kernel.org/r/20250110-soc-qcom-pmic-glink-fix-device-access-on-worker-while-suspended-v1-1-e32fd6bf322e@linaro.org
> ---
>  drivers/soc/qcom/pmic_glink_altmode.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/soc/qcom/pmic_glink_altmode.c b/drivers/soc/qcom/pmic_glink_altmode.c
> index bd06ce16180411059e9efb14d9aeccda27744280..bde129aa7d90a39becaa720376c0539bcaa492fb 100644
> --- a/drivers/soc/qcom/pmic_glink_altmode.c
> +++ b/drivers/soc/qcom/pmic_glink_altmode.c
> @@ -295,7 +295,7 @@ static void pmic_glink_altmode_sc8180xp_notify(struct pmic_glink_altmode *altmod
>  	alt_port->mode = mode;
>  	alt_port->hpd_state = hpd_state;
>  	alt_port->hpd_irq = hpd_irq;
> -	schedule_work(&alt_port->work);
> +	queue_work(system_freezable_wq, &alt_port->work);
>  }
>  
>  #define SC8280XP_DPAM_MASK	0x3f
> @@ -338,7 +338,7 @@ static void pmic_glink_altmode_sc8280xp_notify(struct pmic_glink_altmode *altmod
>  	alt_port->mode = mode;
>  	alt_port->hpd_state = hpd_state;
>  	alt_port->hpd_irq = hpd_irq;
> -	schedule_work(&alt_port->work);
> +	queue_work(system_freezable_wq, &alt_port->work);
>  }
>  
>  static void pmic_glink_altmode_callback(const void *data, size_t len, void *priv)
> 
> ---
> base-commit: da7e6047a6264af16d2cb82bed9b6caa33eaf56a
> change-id: 20250110-soc-qcom-pmic-glink-fix-device-access-on-worker-while-suspended-af54c5e43ed6
> 
> Best regards,
> -- 
> Abel Vesa <abel.vesa@linaro.org>
>
Johan Hovold Jan. 30, 2025, 9:43 a.m. UTC | #2
On Wed, Jan 29, 2025 at 01:46:15PM +0200, Abel Vesa wrote:
> For historical reasons, the GLINK smem interrupt is registered with

Please spell out these reasons so that we can determine if they are
still valid or not.

> IRRQF_NO_SUSPEND flag set, which is the underlying problem here, since the
> incoming messages can be delivered during late suspend and early
> resume.
> 
> In this specific case, the pmic_glink_altmode_worker() currently gets
> scheduled on the system_wq which can be scheduled to run while devices
> are still suspended. This proves to be a problem when a Type-C retimer,
> switch or mux that is controlled over a bus like I2C, because the I2C
> controller is suspended.

This is not just an issue with an i2c retimer, this is a plain generic
bug in that the glink code which is calling drivers while their devices
are suspended.

> This has been proven to be the case on the X Elite boards where such
> retimers (ParadeTech PS8830) are used in order to handle Type-C
> orientation and altmode configuration. The following warning is thrown:

> The proper solution here should be to not deliver these kind of messages
> during system suspend at all, or at least make it configurable per glink
> client. But simply dropping the IRQF_NO_SUSPEND flag entirely will break
> other clients.

Which clients? Do we care enough about them to be papering over this
very real bug which can potentially crash the system (e.g. due to
unclocked register accesses)?

> The final shape of the rework of the pmic glink driver in
> order to fulfill both the filtering of the messages that need to be able
> to wake-up the system and the queueing of these messages until the system
> has properly resumed is still being discussed and it is planned as a
> future effort.
> 
> Meanwhile, the stop-gap fix here is to schedule the pmic glink altmode
> worker on the system_freezable_wq instead of the system_wq. This will
> result in the altmode worker not being scheduled to run until the
> devices are resumed first, which will give the controllers like I2C a
> chance to resume before the transfer is requested.

This is incomplete at best, even as a stop gap. I just threw a
dump_stack() in qmp_combo_typec_switch_set() and see that it is being
called four times (don't ask me why) from two different workers on
hotplug.

Even if you use the freezable workqueue for altmode notifications, you
can still end up here via the ucsi notifications, and that now also
seems to generate a warning in the PM code during resume.

[   25.684039] Call trace:
[   25.686633]  show_stack+0x18/0x24 (C)
[   25.690492]  dump_stack_lvl+0xc0/0xd0
[   25.694350]  dump_stack+0x18/0x24
[   25.697851]  qmp_combo_typec_switch_set+0x28/0x210 [phy_qcom_qmp_combo]
[   25.704764]  typec_switch_set+0x58/0x90 [typec]
[   25.709525]  ps883x_sw_set+0x2c/0x98 [ps883x]
[   25.714092]  typec_switch_set+0x58/0x90 [typec]
[   25.718861]  typec_set_orientation+0x24/0x6c [typec]
[   25.724068]  pmic_glink_ucsi_connector_status+0x5c/0x88 [ucsi_glink]
[   25.730726]  ucsi_handle_connector_change+0xc4/0x448 [typec_ucsi]
[   25.737099]  process_one_work+0x20c/0x610
[   25.741322]  worker_thread+0x23c/0x378
[   25.745274]  kthread+0x124/0x128
[   25.748680]  ret_from_fork+0x10/0x20
[   25.753631] typec port4-partner: PM: parent port4 should not be sleeping

When you first brought up the possible workaround of using a freezable
workqueue, I mistakenly thought this would apply to all glink messages
(I realise that would defeat the purpose of allowing some early
notifications).

You've also found that the glink interrupts are waking up the system
unconditionally on battery notifications, which would be yet another
reason for disabling the interrupts (or implementing some kind of
masking).

I'm leaning towards just disabling the glink interrupts during suspend,
but that depends a bit on what (historical?) functionality actually need
them.

Johan
Abel Vesa Jan. 30, 2025, 10:53 a.m. UTC | #3
On 25-01-30 10:43:26, Johan Hovold wrote:
> On Wed, Jan 29, 2025 at 01:46:15PM +0200, Abel Vesa wrote:
> > For historical reasons, the GLINK smem interrupt is registered with
> 
> Please spell out these reasons so that we can determine if they are
> still valid or not.

Sure.

Bjorn, Caleb, please correct me if I'm wrong, but IIRC the historical
reasons here have to do with notifications from mdsp and sdsp being
needed on mobile platforms. Was there something else?

> 
> > IRRQF_NO_SUSPEND flag set, which is the underlying problem here, since the
> > incoming messages can be delivered during late suspend and early
> > resume.
> > 
> > In this specific case, the pmic_glink_altmode_worker() currently gets
> > scheduled on the system_wq which can be scheduled to run while devices
> > are still suspended. This proves to be a problem when a Type-C retimer,
> > switch or mux that is controlled over a bus like I2C, because the I2C
> > controller is suspended.
> 
> This is not just an issue with an i2c retimer, this is a plain generic
> bug in that the glink code which is calling drivers while their devices
> are suspended.

Yes, that's true. I merely used the i2c scenario since this is the first real
usecase where this proves to be a problem. I do agree that this is a
problem in the glink code design though and needs to be reworked
properly.

> 
> > This has been proven to be the case on the X Elite boards where such
> > retimers (ParadeTech PS8830) are used in order to handle Type-C
> > orientation and altmode configuration. The following warning is thrown:
> 
> > The proper solution here should be to not deliver these kind of messages
> > during system suspend at all, or at least make it configurable per glink
> > client. But simply dropping the IRQF_NO_SUSPEND flag entirely will break
> > other clients.
> 
> Which clients? Do we care enough about them to be papering over this
> very real bug which can potentially crash the system (e.g. due to
> unclocked register accesses)?

Well, you just pointed out the ucsi below, which I totally missed.

I guess this fix would have to be replicated to ucsi as well in this
case. But then it is not just a one-off anymore, like Bjorn suggested.

Since this hasn't proved to be a problem until now, I guess all the
other clients are fine with scheduling work while suspended. But this
might be a wrong assumption as well.

> 
> > The final shape of the rework of the pmic glink driver in
> > order to fulfill both the filtering of the messages that need to be able
> > to wake-up the system and the queueing of these messages until the system
> > has properly resumed is still being discussed and it is planned as a
> > future effort.
> > 
> > Meanwhile, the stop-gap fix here is to schedule the pmic glink altmode
> > worker on the system_freezable_wq instead of the system_wq. This will
> > result in the altmode worker not being scheduled to run until the
> > devices are resumed first, which will give the controllers like I2C a
> > chance to resume before the transfer is requested.
> 
> This is incomplete at best, even as a stop gap. I just threw a
> dump_stack() in qmp_combo_typec_switch_set() and see that it is being
> called four times (don't ask me why) from two different workers on
> hotplug.

Yes, that is expected. From the pmic glink altmode worker and then the
ucsi worker (for the gpio based orientation).

> 
> Even if you use the freezable workqueue for altmode notifications, you
> can still end up here via the ucsi notifications, and that now also
> seems to generate a warning in the PM code during resume.
> 
> [   25.684039] Call trace:
> [   25.686633]  show_stack+0x18/0x24 (C)
> [   25.690492]  dump_stack_lvl+0xc0/0xd0
> [   25.694350]  dump_stack+0x18/0x24
> [   25.697851]  qmp_combo_typec_switch_set+0x28/0x210 [phy_qcom_qmp_combo]
> [   25.704764]  typec_switch_set+0x58/0x90 [typec]
> [   25.709525]  ps883x_sw_set+0x2c/0x98 [ps883x]
> [   25.714092]  typec_switch_set+0x58/0x90 [typec]
> [   25.718861]  typec_set_orientation+0x24/0x6c [typec]
> [   25.724068]  pmic_glink_ucsi_connector_status+0x5c/0x88 [ucsi_glink]
> [   25.730726]  ucsi_handle_connector_change+0xc4/0x448 [typec_ucsi]
> [   25.737099]  process_one_work+0x20c/0x610
> [   25.741322]  worker_thread+0x23c/0x378
> [   25.745274]  kthread+0x124/0x128
> [   25.748680]  ret_from_fork+0x10/0x20
> [   25.753631] typec port4-partner: PM: parent port4 should not be sleeping

Yes, totally missed this one. I guess in the end we could apply the same
fix, considering the alternative of reworking the glink code now.

Or I can just start reworking the glink code now but that is a broader
effort since we will be trying to fix the unconditionally wake-up glink
interrupts as well in one go. And will impact all platforms ...

> 
> When you first brought up the possible workaround of using a freezable
> workqueue, I mistakenly thought this would apply to all glink messages
> (I realise that would defeat the purpose of allowing some early
> notifications).

No, only the altmode worker. In fact, everything stays the same with
respect to the other glink messages as we don't want to break other
clients (hopefully Caleb and Bjorn can describe better the ones that
they know about, IIRC, mobile platforms related).

That was the whole point of doing the altmode only (for now).

But now I'm not sure this is a good idea anymore, since we are starting
to play wack-a-mole if we add ucsi to the list.

> 
> You've also found that the glink interrupts are waking up the system
> unconditionally on battery notifications, which would be yet another
> reason for disabling the interrupts (or implementing some kind of
> masking).

Yes, that is something that needs to be addressed. But IIUC it involves
telling adsp to keep quiet about some notifications, which I don't know
if it's possible and how should be done.

For example, with some chargers, I do get "unknown notification: 0x283"
from the pmic_glink.power-supply and it wakes up from suspend even.

> 
> I'm leaning towards just disabling the glink interrupts during suspend,
> but that depends a bit on what (historical?) functionality actually need
> them.

So the disabling needs to be at least per platform if not per client (or
even notification type). This is what the proper solution discussed with
Bjorn off-list was, IIRC.

> 
> Johan

Thanks for reviewing and reporting this new ucsi case.

Abel
diff mbox series

Patch

diff --git a/drivers/soc/qcom/pmic_glink_altmode.c b/drivers/soc/qcom/pmic_glink_altmode.c
index bd06ce16180411059e9efb14d9aeccda27744280..bde129aa7d90a39becaa720376c0539bcaa492fb 100644
--- a/drivers/soc/qcom/pmic_glink_altmode.c
+++ b/drivers/soc/qcom/pmic_glink_altmode.c
@@ -295,7 +295,7 @@  static void pmic_glink_altmode_sc8180xp_notify(struct pmic_glink_altmode *altmod
 	alt_port->mode = mode;
 	alt_port->hpd_state = hpd_state;
 	alt_port->hpd_irq = hpd_irq;
-	schedule_work(&alt_port->work);
+	queue_work(system_freezable_wq, &alt_port->work);
 }
 
 #define SC8280XP_DPAM_MASK	0x3f
@@ -338,7 +338,7 @@  static void pmic_glink_altmode_sc8280xp_notify(struct pmic_glink_altmode *altmod
 	alt_port->mode = mode;
 	alt_port->hpd_state = hpd_state;
 	alt_port->hpd_irq = hpd_irq;
-	schedule_work(&alt_port->work);
+	queue_work(system_freezable_wq, &alt_port->work);
 }
 
 static void pmic_glink_altmode_callback(const void *data, size_t len, void *priv)