diff mbox series

[v2] usb: dwc3: core: Avoid resume dwc3 if already suspended in pm resume

Message ID 20230911033112.3321-1-william.wu@rock-chips.com (mailing list archive)
State New, archived
Headers show
Series [v2] usb: dwc3: core: Avoid resume dwc3 if already suspended in pm resume | expand

Commit Message

William Wu Sept. 11, 2023, 3:31 a.m. UTC
If we enable PM runtime auto suspend for dwc3 on rockchip
platforms (e.g. RK3562), it allows the dwc3 controller to
enter runtime suspend if usb cable detached and power off
the power domain of the controller. When system resume, if
the dwc3 already in runtime suspended, it Shouldn't access
the dwc3 registers in dwc3_resume() because its power domain
maybe power off.

Test on RK3562 tablet, this patch can help to avoid kernel
panic when accessing the dwc3 registers in dwc3_resume() if
the dwc3 is in runtime suspended and it's power domain is
power off.

Kernel panic - not syncing: Asynchronous SError Interrupt
Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board (DT)
Call trace:
dump_backtrace.cfi_jt+0x0/0x8
  dump_stack_lvl+0xc0/0x13c
  panic+0x174/0x468
  arm64_serror_panic+0x1b0/0x200
  do_serror+0x184/0x1e4
  el1_error+0x94/0x118
  el1_abort+0x40/0x68
  el1_sync_handler+0x58/0x88
  el1_sync+0x8c/0x140
  dwc3_readl+0x30/0x1a0
  dwc3_phy_setup+0x38/0x510
  dwc3_core_init+0x68/0xcd4
  dwc3_core_init_for_resume+0x10c/0x25c
  dwc3_resume_common+0x44/0x3d0
  dwc3_resume+0x5c/0xb8
  dpm_run_callback+0x70/0x488
  device_resume+0x250/0x2f8
  dpm_resume+0x258/0x9dc

Signed-off-by: William Wu <william.wu@rock-chips.com>
---
Changes in v2:
- Remove Change-Id.

 drivers/usb/dwc3/core.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Thinh Nguyen Sept. 12, 2023, 12:08 a.m. UTC | #1
Hi,

On Mon, Sep 11, 2023, William Wu wrote:
> If we enable PM runtime auto suspend for dwc3 on rockchip
> platforms (e.g. RK3562), it allows the dwc3 controller to
> enter runtime suspend if usb cable detached and power off
> the power domain of the controller. When system resume, if
> the dwc3 already in runtime suspended, it Shouldn't access
> the dwc3 registers in dwc3_resume() because its power domain
> maybe power off.
> 
> Test on RK3562 tablet, this patch can help to avoid kernel
> panic when accessing the dwc3 registers in dwc3_resume() if
> the dwc3 is in runtime suspended and it's power domain is
> power off.

The controller should be woken up before this step. Can you provide more
detail on what led to this?

e.g. some questions:
Who handles the waking up of the controller? Is it the phy driver? Is
the phy driver not detecting a resume? Or did the resume fail? Does this
occur consistently?

Thanks,
Thinh

> 
> Kernel panic - not syncing: Asynchronous SError Interrupt
> Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board (DT)
> Call trace:
> dump_backtrace.cfi_jt+0x0/0x8
>   dump_stack_lvl+0xc0/0x13c
>   panic+0x174/0x468
>   arm64_serror_panic+0x1b0/0x200
>   do_serror+0x184/0x1e4
>   el1_error+0x94/0x118
>   el1_abort+0x40/0x68
>   el1_sync_handler+0x58/0x88
>   el1_sync+0x8c/0x140
>   dwc3_readl+0x30/0x1a0
>   dwc3_phy_setup+0x38/0x510
>   dwc3_core_init+0x68/0xcd4
>   dwc3_core_init_for_resume+0x10c/0x25c
>   dwc3_resume_common+0x44/0x3d0
>   dwc3_resume+0x5c/0xb8
>   dpm_run_callback+0x70/0x488
>   device_resume+0x250/0x2f8
>   dpm_resume+0x258/0x9dc
> 
> Signed-off-by: William Wu <william.wu@rock-chips.com>
> ---
> Changes in v2:
> - Remove Change-Id.
> 
>  drivers/usb/dwc3/core.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index 9c6bf054f15d..8274a44f2d6a 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -2185,9 +2185,11 @@ static int dwc3_resume(struct device *dev)
>  
>  	pinctrl_pm_select_default_state(dev);
>  
> -	ret = dwc3_resume_common(dwc, PMSG_RESUME);
> -	if (ret)
> -		return ret;
> +	if (!pm_runtime_suspended(dwc->dev)) {
> +		ret = dwc3_resume_common(dwc, PMSG_RESUME);
> +		if (ret)
> +			return ret;
> +	}
>  
>  	pm_runtime_disable(dev);
>  	pm_runtime_set_active(dev);
> -- 
> 2.17.1
>
William Wu Sept. 15, 2023, 3:53 p.m. UTC | #2
Hi Thinh,

On 2023/9/12 8:08, Thinh Nguyen wrote:
> Hi,
>
> On Mon, Sep 11, 2023, William Wu wrote:
>> If we enable PM runtime auto suspend for dwc3 on rockchip
>> platforms (e.g. RK3562), it allows the dwc3 controller to
>> enter runtime suspend if usb cable detached and power off
>> the power domain of the controller. When system resume, if
>> the dwc3 already in runtime suspended, it Shouldn't access
>> the dwc3 registers in dwc3_resume() because its power domain
>> maybe power off.
>>
>> Test on RK3562 tablet, this patch can help to avoid kernel
>> panic when accessing the dwc3 registers in dwc3_resume() if
>> the dwc3 is in runtime suspended and it's power domain is
>> power off.
> The controller should be woken up before this step. Can you provide more
> detail on what led to this?

Yes, the power domain of the usb controller will be enabled by the 
framework of  the pm generic domain before dwc3 resume if the system 
enter suspend and exit suspend normally. However, in my test case,if the 
system fail to enter suspend because of some devices's problem, and then 
goto recovery process, the power domain of the usb controller will not 
be enable before dwc3 resume.

> e.g. some questions:
> Who handles the waking up of the controller? Is it the phy driver? Is
> the phy driver not detecting a resume? Or did the resume fail? Does this
> occur consistently?
>
> Thanks,
> Thinh

This issue occurs occasionally on RK3562 EVB with Type-C USB, and enable 
autosuspend for dwc3 controller.

Here is the test steps:

1. Power on the RK3562 EVB and the Type-C USB interface is in 
unconnected state.

2. Makesure the dwc3 controller enter runtime suspend, and its power 
domain is disabled.

3. Do system suspend/resume stress test.

4. The issue occurs occasionally  with the following log:

[  251.681091][ T4331] PM: suspend entry (deep)
[  251.778975][ T4331] Filesystems sync: 0.097 seconds
[  251.779025][ T4331] Freezing user space processes ... (elapsed 0.005 
seconds) done.
[  251.784819][ T4331] OOM killer disabled.
[  251.784851][ T4331] Freezing remaining freezable tasks ... (elapsed 
0.004 seconds) done.
[  251.792719][  T503] [SKWIFI DBG] skw_suspend: WoW: enabled, skw 
flags: 0x302
[  251.803701][ T4331] PM: dpm_run_callback(): 
platform_pm_suspend.cfi_jt+0x0/0x8 returns -16
[  251.803779][   T75] PM: PM: Pending Wakeup Sources: alarmtimer.0.auto
[  251.803789][ T4331] PM: Device alarmtimer.0.a
[  251.803928][ T4331] PM: Some devices failed to suspend, or early wake 
event detected
[  251.804141][   T75] [SKWIFI DBG] skw_resume: skw flags: 0x300
[  251.804715][    C2] SError Interrupt on CPU2, code 0xbf000000 -- SError
[  251.804725][    C2] CPU: 2 PID: 4331 Comm: binder:251_4 Tainted: 
G        WC  E 5.10.157-android13-4-00006-g73f337804fbc-ab9881769 #1
[  251.804732][    C2] Hardware name: Rockchip RK3562 RK817 TABLET LP4 
Board (DT)
[  251.804738][    C2] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
[  251.804743][    C2] pc : el1_abort+0x40/0x68
[  251.804748][    C2] lr : el1_abort+0x28/0x68

......

[  251.804965][    C2] Kernel panic - not syncing: Asynchronous SError 
Interrupt
[  251.804974][    C2] CPU: 2 PID: 4331 Comm: binder:251_4 Tainted: 
G        WC  E 5.10.157-android13-4-00006-g73f337804fbc-ab9881769 #1
[  251.804980][    C2] Hardware name: Rockchip RK3562 RK817 TABLET LP4 
Board (DT)
[  251.804984][    C2] Call trace:
[  251.804990][    C2]  dump_backtrace.cfi_jt+0x0/0x8
[  251.804995][    C2]  dump_stack_lvl+0xc0/0x13c
[  251.805000][    C2]  panic+0x174/0x468
[  251.805006][    C2]  arm64_serror_panic+0x1b0/0x200
[  251.805010][    C2]  do_serror+0x184/0x1e4
[  251.805016][    C2]  el1_error+0x94/0x118
[  251.805020][    C2]  el1_abort+0x40/0x68
[  251.805026][    C2]  el1_sync_handler+0x58/0x88
[  251.805031][    C2]  el1_sync+0x8c/0x140
[  251.805035][    C2]  dwc3_readl+0x30/0x1a0
[  251.805040][    C2]  dwc3_phy_setup+0x38/0x510
[  251.805045][    C2]  dwc3_core_init+0x68/0xcd4
[  251.805051][    C2]  dwc3_core_init_for_resume+0x10c/0x25c
[  251.805056][    C2]  dwc3_resume_common+0x44/0x3d0
[  251.805061][    C2]  dwc3_resume+0x5c/0xb8
[  251.805067][    C2]  dpm_run_callback+0x70/0x488
[  251.805071][    C2]  device_resume+0x250/0x2f8
[  251.805077][    C2]  dpm_resume+0x258/0x9dc
[  251.805082][    C2]  suspend_devices_and_enter+0x850/0xcac

In this case, during suspend process, because the device alarmtimer 
failed to suspend, it break the system suspend in the funciton 
suspend_devices_and_enter(), and goto platform_recover() directly 
without enable the power domain of the controller, then trigger the 
Kernel panic in dwc3_resume().


For a comparison, in the normal case, if the system enter suspend 
normally, and after the system wakeup, the power domain of the 
controller will be enable by the framework of  the pm generic domain 
before dwc3 resume.

The function call stack like this:

suspend_devices_and_enter -->

     suspend_enter -->

          dpm_resume_noirq --> dpm_noirq_resume_devices --> 
device_resume_noirq --> genpd_resume_noirq --> rockchip_pd_power (enable 
the power domain of the controller)

     dpm_resume_end -->

          dpm_resume --> device_resume --> dpm_run_callback --> 
dwc3_resume (access the controller safely)

          dpm_complete --> genpd_complete --> genpd_queue_power_off_work

suspend_finish --> suspend_thaw_processes --> genpd_power_off_work_fn 
--> (diable the power domain of the controller to maintain the original 
runtime  suspend state)


Thanks,

William

>
>> Kernel panic - not syncing: Asynchronous SError Interrupt
>> Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board (DT)
>> Call trace:
>> dump_backtrace.cfi_jt+0x0/0x8
>>    dump_stack_lvl+0xc0/0x13c
>>    panic+0x174/0x468
>>    arm64_serror_panic+0x1b0/0x200
>>    do_serror+0x184/0x1e4
>>    el1_error+0x94/0x118
>>    el1_abort+0x40/0x68
>>    el1_sync_handler+0x58/0x88
>>    el1_sync+0x8c/0x140
>>    dwc3_readl+0x30/0x1a0
>>    dwc3_phy_setup+0x38/0x510
>>    dwc3_core_init+0x68/0xcd4
>>    dwc3_core_init_for_resume+0x10c/0x25c
>>    dwc3_resume_common+0x44/0x3d0
>>    dwc3_resume+0x5c/0xb8
>>    dpm_run_callback+0x70/0x488
>>    device_resume+0x250/0x2f8
>>    dpm_resume+0x258/0x9dc
>>
>> Signed-off-by: William Wu <william.wu@rock-chips.com>
>> ---
>> Changes in v2:
>> - Remove Change-Id.
>>
>>   drivers/usb/dwc3/core.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>> index 9c6bf054f15d..8274a44f2d6a 100644
>> --- a/drivers/usb/dwc3/core.c
>> +++ b/drivers/usb/dwc3/core.c
>> @@ -2185,9 +2185,11 @@ static int dwc3_resume(struct device *dev)
>>   
>>   	pinctrl_pm_select_default_state(dev);
>>   
>> -	ret = dwc3_resume_common(dwc, PMSG_RESUME);
>> -	if (ret)
>> -		return ret;
>> +	if (!pm_runtime_suspended(dwc->dev)) {
>> +		ret = dwc3_resume_common(dwc, PMSG_RESUME);
>> +		if (ret)
>> +			return ret;
>> +	}
>>   
>>   	pm_runtime_disable(dev);
>>   	pm_runtime_set_active(dev);
>> -- 
>> 2.17.1
Wesley Cheng Sept. 15, 2023, 8:17 p.m. UTC | #3
Hi William,

On 9/10/2023 8:31 PM, William Wu wrote:
> If we enable PM runtime auto suspend for dwc3 on rockchip
> platforms (e.g. RK3562), it allows the dwc3 controller to
> enter runtime suspend if usb cable detached and power off
> the power domain of the controller. When system resume, if
> the dwc3 already in runtime suspended, it Shouldn't access
> the dwc3 registers in dwc3_resume() because its power domain
> maybe power off.
> 
> Test on RK3562 tablet, this patch can help to avoid kernel
> panic when accessing the dwc3 registers in dwc3_resume() if
> the dwc3 is in runtime suspended and it's power domain is
> power off.
> 
> Kernel panic - not syncing: Asynchronous SError Interrupt
> Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board (DT)
> Call trace:
> dump_backtrace.cfi_jt+0x0/0x8
>    dump_stack_lvl+0xc0/0x13c
>    panic+0x174/0x468
>    arm64_serror_panic+0x1b0/0x200
>    do_serror+0x184/0x1e4
>    el1_error+0x94/0x118
>    el1_abort+0x40/0x68
>    el1_sync_handler+0x58/0x88
>    el1_sync+0x8c/0x140
>    dwc3_readl+0x30/0x1a0
>    dwc3_phy_setup+0x38/0x510
>    dwc3_core_init+0x68/0xcd4
>    dwc3_core_init_for_resume+0x10c/0x25c
>    dwc3_resume_common+0x44/0x3d0
>    dwc3_resume+0x5c/0xb8
>    dpm_run_callback+0x70/0x488
>    device_resume+0x250/0x2f8
>    dpm_resume+0x258/0x9dc
> 
> Signed-off-by: William Wu <william.wu@rock-chips.com>
> ---
> Changes in v2:
> - Remove Change-Id.
> 
>   drivers/usb/dwc3/core.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> index 9c6bf054f15d..8274a44f2d6a 100644
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -2185,9 +2185,11 @@ static int dwc3_resume(struct device *dev)
>   
>   	pinctrl_pm_select_default_state(dev);
>   
> -	ret = dwc3_resume_common(dwc, PMSG_RESUME);
> -	if (ret)
> -		return ret;
> +	if (!pm_runtime_suspended(dwc->dev)) {
> +		ret = dwc3_resume_common(dwc, PMSG_RESUME);
> +		if (ret)
> +			return ret;
> +	}
>   
>   	pm_runtime_disable(dev);
>   	pm_runtime_set_active(dev);

In case DWC3 is runtime suspended, then we will avoid the 
dwc3_resume_common() call, but the current flow would also set the RPM 
state to RPM_ACTIVE. (from pm_runtime_set_active())  In this case, what 
happens if there is a pm_runtime_get/resume on the DWC3 device?

I think it would avoid calling rpm_resume() since the RPM state is 
already active, so we wouldn't be calling dwc3_runtime_resume().  Do you 
want to also extend the RPM suspended check to cover if we need to force 
PM runtime state back to active?

Thanks
Wesley Cheng
Thinh Nguyen Oct. 2, 2023, 6:18 p.m. UTC | #4
On Fri, Sep 15, 2023, wuliangfeng wrote:
> Hi Thinh,
> 
> On 2023/9/12 8:08, Thinh Nguyen wrote:
> > Hi,
> > 
> > On Mon, Sep 11, 2023, William Wu wrote:
> > > If we enable PM runtime auto suspend for dwc3 on rockchip
> > > platforms (e.g. RK3562), it allows the dwc3 controller to
> > > enter runtime suspend if usb cable detached and power off
> > > the power domain of the controller. When system resume, if
> > > the dwc3 already in runtime suspended, it Shouldn't access
> > > the dwc3 registers in dwc3_resume() because its power domain
> > > maybe power off.
> > > 
> > > Test on RK3562 tablet, this patch can help to avoid kernel
> > > panic when accessing the dwc3 registers in dwc3_resume() if
> > > the dwc3 is in runtime suspended and it's power domain is
> > > power off.
> > The controller should be woken up before this step. Can you provide more
> > detail on what led to this?
> 
> Yes, the power domain of the usb controller will be enabled by the framework
> of  the pm generic domain before dwc3 resume if the system enter suspend and
> exit suspend normally. However, in my test case,if the system fail to enter
> suspend because of some devices's problem, and then goto recovery process,
> the power domain of the usb controller will not be enable before dwc3
> resume.

Ok.

> 
> > e.g. some questions:
> > Who handles the waking up of the controller? Is it the phy driver? Is
> > the phy driver not detecting a resume? Or did the resume fail? Does this
> > occur consistently?
> > 
> > Thanks,
> > Thinh
> 
> This issue occurs occasionally on RK3562 EVB with Type-C USB, and enable
> autosuspend for dwc3 controller.
> 
> Here is the test steps:
> 
> 1. Power on the RK3562 EVB and the Type-C USB interface is in unconnected
> state.
> 
> 2. Makesure the dwc3 controller enter runtime suspend, and its power domain
> is disabled.
> 
> 3. Do system suspend/resume stress test.
> 
> 4. The issue occurs occasionally  with the following log:
> 
> [  251.681091][ T4331] PM: suspend entry (deep)
> [  251.778975][ T4331] Filesystems sync: 0.097 seconds
> [  251.779025][ T4331] Freezing user space processes ... (elapsed 0.005
> seconds) done.
> [  251.784819][ T4331] OOM killer disabled.
> [  251.784851][ T4331] Freezing remaining freezable tasks ... (elapsed 0.004
> seconds) done.
> [  251.792719][  T503] [SKWIFI DBG] skw_suspend: WoW: enabled, skw flags:
> 0x302
> [  251.803701][ T4331] PM: dpm_run_callback():
> platform_pm_suspend.cfi_jt+0x0/0x8 returns -16
> [  251.803779][   T75] PM: PM: Pending Wakeup Sources: alarmtimer.0.auto
> [  251.803789][ T4331] PM: Device alarmtimer.0.a
> [  251.803928][ T4331] PM: Some devices failed to suspend, or early wake
> event detected
> [  251.804141][   T75] [SKWIFI DBG] skw_resume: skw flags: 0x300
> [  251.804715][    C2] SError Interrupt on CPU2, code 0xbf000000 -- SError
> [  251.804725][    C2] CPU: 2 PID: 4331 Comm: binder:251_4 Tainted: G       
> WC  E 5.10.157-android13-4-00006-g73f337804fbc-ab9881769 #1
> [  251.804732][    C2] Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board
> (DT)
> [  251.804738][    C2] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
> [  251.804743][    C2] pc : el1_abort+0x40/0x68
> [  251.804748][    C2] lr : el1_abort+0x28/0x68
> 
> ......
> 
> [  251.804965][    C2] Kernel panic - not syncing: Asynchronous SError
> Interrupt
> [  251.804974][    C2] CPU: 2 PID: 4331 Comm: binder:251_4 Tainted: G       
> WC  E 5.10.157-android13-4-00006-g73f337804fbc-ab9881769 #1
> [  251.804980][    C2] Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board
> (DT)
> [  251.804984][    C2] Call trace:
> [  251.804990][    C2]  dump_backtrace.cfi_jt+0x0/0x8
> [  251.804995][    C2]  dump_stack_lvl+0xc0/0x13c
> [  251.805000][    C2]  panic+0x174/0x468
> [  251.805006][    C2]  arm64_serror_panic+0x1b0/0x200
> [  251.805010][    C2]  do_serror+0x184/0x1e4
> [  251.805016][    C2]  el1_error+0x94/0x118
> [  251.805020][    C2]  el1_abort+0x40/0x68
> [  251.805026][    C2]  el1_sync_handler+0x58/0x88
> [  251.805031][    C2]  el1_sync+0x8c/0x140
> [  251.805035][    C2]  dwc3_readl+0x30/0x1a0
> [  251.805040][    C2]  dwc3_phy_setup+0x38/0x510
> [  251.805045][    C2]  dwc3_core_init+0x68/0xcd4
> [  251.805051][    C2]  dwc3_core_init_for_resume+0x10c/0x25c
> [  251.805056][    C2]  dwc3_resume_common+0x44/0x3d0
> [  251.805061][    C2]  dwc3_resume+0x5c/0xb8
> [  251.805067][    C2]  dpm_run_callback+0x70/0x488
> [  251.805071][    C2]  device_resume+0x250/0x2f8
> [  251.805077][    C2]  dpm_resume+0x258/0x9dc
> [  251.805082][    C2]  suspend_devices_and_enter+0x850/0xcac
> 
> In this case, during suspend process, because the device alarmtimer failed
> to suspend, it break the system suspend in the funciton
> suspend_devices_and_enter(), and goto platform_recover() directly without
> enable the power domain of the controller, then trigger the Kernel panic in
> dwc3_resume().
> 

Thanks for the details.

> 
> For a comparison, in the normal case, if the system enter suspend normally,
> and after the system wakeup, the power domain of the controller will be
> enable by the framework of  the pm generic domain before dwc3 resume.
> 
> The function call stack like this:
> 
> suspend_devices_and_enter -->
> 
>     suspend_enter -->
> 
>          dpm_resume_noirq --> dpm_noirq_resume_devices -->
> device_resume_noirq --> genpd_resume_noirq --> rockchip_pd_power (enable
> the power domain of the controller)
> 
>     dpm_resume_end -->
> 
>          dpm_resume --> device_resume --> dpm_run_callback --> dwc3_resume
> (access the controller safely)
> 
>          dpm_complete --> genpd_complete --> genpd_queue_power_off_work
> 
> suspend_finish --> suspend_thaw_processes --> genpd_power_off_work_fn -->
> (diable the power domain of the controller to maintain the original runtime 
> suspend state)
> 

At what step do we restore the power domain when this happen? Looks like
there's a missing step in the suspend failure recovery to recover the
power domain. What we're doing here seems more like a workaround to
that, which unfortunately makes the code logic looks unclear IMO.

Can this be fixed in the lower layer?

Thanks,
Thinh
William Wu Oct. 10, 2023, 7:28 a.m. UTC | #5
Hi Thinh,

Sorry for my late reply because the national day holiday.

在 2023/10/3 2:18, Thinh Nguyen 写道:
> On Fri, Sep 15, 2023, wuliangfeng wrote:
>> Hi Thinh,
>>
>> On 2023/9/12 8:08, Thinh Nguyen wrote:
>>> Hi,
>>>
>>> On Mon, Sep 11, 2023, William Wu wrote:
>>>> If we enable PM runtime auto suspend for dwc3 on rockchip
>>>> platforms (e.g. RK3562), it allows the dwc3 controller to
>>>> enter runtime suspend if usb cable detached and power off
>>>> the power domain of the controller. When system resume, if
>>>> the dwc3 already in runtime suspended, it Shouldn't access
>>>> the dwc3 registers in dwc3_resume() because its power domain
>>>> maybe power off.
>>>>
>>>> Test on RK3562 tablet, this patch can help to avoid kernel
>>>> panic when accessing the dwc3 registers in dwc3_resume() if
>>>> the dwc3 is in runtime suspended and it's power domain is
>>>> power off.
>>> The controller should be woken up before this step. Can you provide more
>>> detail on what led to this?
>> Yes, the power domain of the usb controller will be enabled by the framework
>> of  the pm generic domain before dwc3 resume if the system enter suspend and
>> exit suspend normally. However, in my test case,if the system fail to enter
>> suspend because of some devices's problem, and then goto recovery process,
>> the power domain of the usb controller will not be enable before dwc3
>> resume.
> Ok.
>
>>> e.g. some questions:
>>> Who handles the waking up of the controller? Is it the phy driver? Is
>>> the phy driver not detecting a resume? Or did the resume fail? Does this
>>> occur consistently?
>>>
>>> Thanks,
>>> Thinh
>> This issue occurs occasionally on RK3562 EVB with Type-C USB, and enable
>> autosuspend for dwc3 controller.
>>
>> Here is the test steps:
>>
>> 1. Power on the RK3562 EVB and the Type-C USB interface is in unconnected
>> state.
>>
>> 2. Makesure the dwc3 controller enter runtime suspend, and its power domain
>> is disabled.
>>
>> 3. Do system suspend/resume stress test.
>>
>> 4. The issue occurs occasionally  with the following log:
>>
>> [  251.681091][ T4331] PM: suspend entry (deep)
>> [  251.778975][ T4331] Filesystems sync: 0.097 seconds
>> [  251.779025][ T4331] Freezing user space processes ... (elapsed 0.005
>> seconds) done.
>> [  251.784819][ T4331] OOM killer disabled.
>> [  251.784851][ T4331] Freezing remaining freezable tasks ... (elapsed 0.004
>> seconds) done.
>> [  251.792719][  T503] [SKWIFI DBG] skw_suspend: WoW: enabled, skw flags:
>> 0x302
>> [  251.803701][ T4331] PM: dpm_run_callback():
>> platform_pm_suspend.cfi_jt+0x0/0x8 returns -16
>> [  251.803779][   T75] PM: PM: Pending Wakeup Sources: alarmtimer.0.auto
>> [  251.803789][ T4331] PM: Device alarmtimer.0.a
>> [  251.803928][ T4331] PM: Some devices failed to suspend, or early wake
>> event detected
>> [  251.804141][   T75] [SKWIFI DBG] skw_resume: skw flags: 0x300
>> [  251.804715][    C2] SError Interrupt on CPU2, code 0xbf000000 -- SError
>> [  251.804725][    C2] CPU: 2 PID: 4331 Comm: binder:251_4 Tainted: G
>> WC  E 5.10.157-android13-4-00006-g73f337804fbc-ab9881769 #1
>> [  251.804732][    C2] Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board
>> (DT)
>> [  251.804738][    C2] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
>> [  251.804743][    C2] pc : el1_abort+0x40/0x68
>> [  251.804748][    C2] lr : el1_abort+0x28/0x68
>>
>> ......
>>
>> [  251.804965][    C2] Kernel panic - not syncing: Asynchronous SError
>> Interrupt
>> [  251.804974][    C2] CPU: 2 PID: 4331 Comm: binder:251_4 Tainted: G
>> WC  E 5.10.157-android13-4-00006-g73f337804fbc-ab9881769 #1
>> [  251.804980][    C2] Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board
>> (DT)
>> [  251.804984][    C2] Call trace:
>> [  251.804990][    C2]  dump_backtrace.cfi_jt+0x0/0x8
>> [  251.804995][    C2]  dump_stack_lvl+0xc0/0x13c
>> [  251.805000][    C2]  panic+0x174/0x468
>> [  251.805006][    C2]  arm64_serror_panic+0x1b0/0x200
>> [  251.805010][    C2]  do_serror+0x184/0x1e4
>> [  251.805016][    C2]  el1_error+0x94/0x118
>> [  251.805020][    C2]  el1_abort+0x40/0x68
>> [  251.805026][    C2]  el1_sync_handler+0x58/0x88
>> [  251.805031][    C2]  el1_sync+0x8c/0x140
>> [  251.805035][    C2]  dwc3_readl+0x30/0x1a0
>> [  251.805040][    C2]  dwc3_phy_setup+0x38/0x510
>> [  251.805045][    C2]  dwc3_core_init+0x68/0xcd4
>> [  251.805051][    C2]  dwc3_core_init_for_resume+0x10c/0x25c
>> [  251.805056][    C2]  dwc3_resume_common+0x44/0x3d0
>> [  251.805061][    C2]  dwc3_resume+0x5c/0xb8
>> [  251.805067][    C2]  dpm_run_callback+0x70/0x488
>> [  251.805071][    C2]  device_resume+0x250/0x2f8
>> [  251.805077][    C2]  dpm_resume+0x258/0x9dc
>> [  251.805082][    C2]  suspend_devices_and_enter+0x850/0xcac
>>
>> In this case, during suspend process, because the device alarmtimer failed
>> to suspend, it break the system suspend in the funciton
>> suspend_devices_and_enter(), and goto platform_recover() directly without
>> enable the power domain of the controller, then trigger the Kernel panic in
>> dwc3_resume().
>>
> Thanks for the details.
>
>> For a comparison, in the normal case, if the system enter suspend normally,
>> and after the system wakeup, the power domain of the controller will be
>> enable by the framework of  the pm generic domain before dwc3 resume.
>>
>> The function call stack like this:
>>
>> suspend_devices_and_enter -->
>>
>>      suspend_enter -->
>>
>>           dpm_resume_noirq --> dpm_noirq_resume_devices -->
>> device_resume_noirq --> genpd_resume_noirq --> rockchip_pd_power (enable
>> the power domain of the controller)
>>
>>      dpm_resume_end -->
>>
>>           dpm_resume --> device_resume --> dpm_run_callback --> dwc3_resume
>> (access the controller safely)
>>
>>           dpm_complete --> genpd_complete --> genpd_queue_power_off_work
>>
>> suspend_finish --> suspend_thaw_processes --> genpd_power_off_work_fn -->
>> (diable the power domain of the controller to maintain the original runtime
>> suspend state)
>>
> At what step do we restore the power domain when this happen? Looks like
> there's a missing step in the suspend failure recovery to recover the
> power domain. What we're doing here seems more like a workaround to
> that, which unfortunately makes the code logic looks unclear IMO.

When then suspend successful, the power framework call 
dpm_resume_noirq() to enable on the power domain and then call 
dpm_complete() -> genpd_power_off_work_fn() to restore the original 
power domain state.

When the suspend failure happens,it seems that nobody recover the power 
domain in the power framework and rockchip platform pm_domains driver.

>
> Can this be fixed in the lower layer?

For the time being, I haven't found a solution in the lower layer to fix 
this issue. I have to disable the dwc3 runtime feature on RK3562 GKI 
project.

>
> Thanks,
> Thinh
Thinh Nguyen Oct. 12, 2023, 5:38 p.m. UTC | #6
On Tue, Oct 10, 2023, wuliangfeng wrote:
> Hi Thinh,
> 
> Sorry for my late reply because the national day holiday.
> 
> 在 2023/10/3 2:18, Thinh Nguyen 写道:
> > On Fri, Sep 15, 2023, wuliangfeng wrote:
> > > Hi Thinh,
> > > 
> > > On 2023/9/12 8:08, Thinh Nguyen wrote:
> > > > Hi,
> > > > 
> > > > On Mon, Sep 11, 2023, William Wu wrote:
> > > > > If we enable PM runtime auto suspend for dwc3 on rockchip
> > > > > platforms (e.g. RK3562), it allows the dwc3 controller to
> > > > > enter runtime suspend if usb cable detached and power off
> > > > > the power domain of the controller. When system resume, if
> > > > > the dwc3 already in runtime suspended, it Shouldn't access
> > > > > the dwc3 registers in dwc3_resume() because its power domain
> > > > > maybe power off.
> > > > > 
> > > > > Test on RK3562 tablet, this patch can help to avoid kernel
> > > > > panic when accessing the dwc3 registers in dwc3_resume() if
> > > > > the dwc3 is in runtime suspended and it's power domain is
> > > > > power off.
> > > > The controller should be woken up before this step. Can you provide more
> > > > detail on what led to this?
> > > Yes, the power domain of the usb controller will be enabled by the framework
> > > of  the pm generic domain before dwc3 resume if the system enter suspend and
> > > exit suspend normally. However, in my test case,if the system fail to enter
> > > suspend because of some devices's problem, and then goto recovery process,
> > > the power domain of the usb controller will not be enable before dwc3
> > > resume.
> > Ok.
> > 
> > > > e.g. some questions:
> > > > Who handles the waking up of the controller? Is it the phy driver? Is
> > > > the phy driver not detecting a resume? Or did the resume fail? Does this
> > > > occur consistently?
> > > > 
> > > > Thanks,
> > > > Thinh
> > > This issue occurs occasionally on RK3562 EVB with Type-C USB, and enable
> > > autosuspend for dwc3 controller.
> > > 
> > > Here is the test steps:
> > > 
> > > 1. Power on the RK3562 EVB and the Type-C USB interface is in unconnected
> > > state.
> > > 
> > > 2. Makesure the dwc3 controller enter runtime suspend, and its power domain
> > > is disabled.
> > > 
> > > 3. Do system suspend/resume stress test.
> > > 
> > > 4. The issue occurs occasionally  with the following log:
> > > 
> > > [  251.681091][ T4331] PM: suspend entry (deep)
> > > [  251.778975][ T4331] Filesystems sync: 0.097 seconds
> > > [  251.779025][ T4331] Freezing user space processes ... (elapsed 0.005
> > > seconds) done.
> > > [  251.784819][ T4331] OOM killer disabled.
> > > [  251.784851][ T4331] Freezing remaining freezable tasks ... (elapsed 0.004
> > > seconds) done.
> > > [  251.792719][  T503] [SKWIFI DBG] skw_suspend: WoW: enabled, skw flags:
> > > 0x302
> > > [  251.803701][ T4331] PM: dpm_run_callback():
> > > platform_pm_suspend.cfi_jt+0x0/0x8 returns -16
> > > [  251.803779][   T75] PM: PM: Pending Wakeup Sources: alarmtimer.0.auto
> > > [  251.803789][ T4331] PM: Device alarmtimer.0.a
> > > [  251.803928][ T4331] PM: Some devices failed to suspend, or early wake
> > > event detected
> > > [  251.804141][   T75] [SKWIFI DBG] skw_resume: skw flags: 0x300
> > > [  251.804715][    C2] SError Interrupt on CPU2, code 0xbf000000 -- SError
> > > [  251.804725][    C2] CPU: 2 PID: 4331 Comm: binder:251_4 Tainted: G
> > > WC  E 5.10.157-android13-4-00006-g73f337804fbc-ab9881769 #1
> > > [  251.804732][    C2] Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board
> > > (DT)
> > > [  251.804738][    C2] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO BTYPE=--)
> > > [  251.804743][    C2] pc : el1_abort+0x40/0x68
> > > [  251.804748][    C2] lr : el1_abort+0x28/0x68
> > > 
> > > ......
> > > 
> > > [  251.804965][    C2] Kernel panic - not syncing: Asynchronous SError
> > > Interrupt
> > > [  251.804974][    C2] CPU: 2 PID: 4331 Comm: binder:251_4 Tainted: G
> > > WC  E 5.10.157-android13-4-00006-g73f337804fbc-ab9881769 #1
> > > [  251.804980][    C2] Hardware name: Rockchip RK3562 RK817 TABLET LP4 Board
> > > (DT)
> > > [  251.804984][    C2] Call trace:
> > > [  251.804990][    C2]  dump_backtrace.cfi_jt+0x0/0x8
> > > [  251.804995][    C2]  dump_stack_lvl+0xc0/0x13c
> > > [  251.805000][    C2]  panic+0x174/0x468
> > > [  251.805006][    C2]  arm64_serror_panic+0x1b0/0x200
> > > [  251.805010][    C2]  do_serror+0x184/0x1e4
> > > [  251.805016][    C2]  el1_error+0x94/0x118
> > > [  251.805020][    C2]  el1_abort+0x40/0x68
> > > [  251.805026][    C2]  el1_sync_handler+0x58/0x88
> > > [  251.805031][    C2]  el1_sync+0x8c/0x140
> > > [  251.805035][    C2]  dwc3_readl+0x30/0x1a0
> > > [  251.805040][    C2]  dwc3_phy_setup+0x38/0x510
> > > [  251.805045][    C2]  dwc3_core_init+0x68/0xcd4
> > > [  251.805051][    C2]  dwc3_core_init_for_resume+0x10c/0x25c
> > > [  251.805056][    C2]  dwc3_resume_common+0x44/0x3d0
> > > [  251.805061][    C2]  dwc3_resume+0x5c/0xb8
> > > [  251.805067][    C2]  dpm_run_callback+0x70/0x488
> > > [  251.805071][    C2]  device_resume+0x250/0x2f8
> > > [  251.805077][    C2]  dpm_resume+0x258/0x9dc
> > > [  251.805082][    C2]  suspend_devices_and_enter+0x850/0xcac
> > > 
> > > In this case, during suspend process, because the device alarmtimer failed
> > > to suspend, it break the system suspend in the funciton
> > > suspend_devices_and_enter(), and goto platform_recover() directly without
> > > enable the power domain of the controller, then trigger the Kernel panic in
> > > dwc3_resume().
> > > 
> > Thanks for the details.
> > 
> > > For a comparison, in the normal case, if the system enter suspend normally,
> > > and after the system wakeup, the power domain of the controller will be
> > > enable by the framework of  the pm generic domain before dwc3 resume.
> > > 
> > > The function call stack like this:
> > > 
> > > suspend_devices_and_enter -->
> > > 
> > >      suspend_enter -->
> > > 
> > >           dpm_resume_noirq --> dpm_noirq_resume_devices -->
> > > device_resume_noirq --> genpd_resume_noirq --> rockchip_pd_power (enable
> > > the power domain of the controller)
> > > 
> > >      dpm_resume_end -->
> > > 
> > >           dpm_resume --> device_resume --> dpm_run_callback --> dwc3_resume
> > > (access the controller safely)
> > > 
> > >           dpm_complete --> genpd_complete --> genpd_queue_power_off_work
> > > 
> > > suspend_finish --> suspend_thaw_processes --> genpd_power_off_work_fn -->
> > > (diable the power domain of the controller to maintain the original runtime
> > > suspend state)
> > > 
> > At what step do we restore the power domain when this happen? Looks like
> > there's a missing step in the suspend failure recovery to recover the
> > power domain. What we're doing here seems more like a workaround to
> > that, which unfortunately makes the code logic looks unclear IMO.
> 
> When then suspend successful, the power framework call dpm_resume_noirq() to
> enable on the power domain and then call dpm_complete() ->
> genpd_power_off_work_fn() to restore the original power domain state.
> 
> When the suspend failure happens,it seems that nobody recover the power
> domain in the power framework and rockchip platform pm_domains driver.
> 
> > 
> > Can this be fixed in the lower layer?
> 
> For the time being, I haven't found a solution in the lower layer to fix
> this issue. I have to disable the dwc3 runtime feature on RK3562 GKI
> project.
> 

While this patch may avoid the kernel panic, but it's not fixing the
real issue here. I think we need to look into enhancing the suspend
failure recovery instead.

Thanks,
Thinh
diff mbox series

Patch

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index 9c6bf054f15d..8274a44f2d6a 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -2185,9 +2185,11 @@  static int dwc3_resume(struct device *dev)
 
 	pinctrl_pm_select_default_state(dev);
 
-	ret = dwc3_resume_common(dwc, PMSG_RESUME);
-	if (ret)
-		return ret;
+	if (!pm_runtime_suspended(dwc->dev)) {
+		ret = dwc3_resume_common(dwc, PMSG_RESUME);
+		if (ret)
+			return ret;
+	}
 
 	pm_runtime_disable(dev);
 	pm_runtime_set_active(dev);