diff mbox

Since Linux 4.13 tlp or powertop usage cause "xHCI host controller not responding, assume dead" on Dell 5855

Message ID 16a67206-6dce-01f1-1074-ee5d3b7e2602@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mathias Nyman April 16, 2018, 11:55 a.m. UTC
On 10.04.2018 12:15, russianneuromancer@ya.ru wrote:
> Hello!
> 
> On Dell Venue 8 Pro 5855 tablet installing tlp or running "powertop --
> auto-tune" cause "xHCI host controller not responding, assume dead"
> error, when error happen two integrated USB devices (Bluetooth adapter
> and LTE modem) disappear until reboot. First time this issue was
> observer in Linux 4.13 and still present in Linux 4.16. Blacklisting
> both "Linux Foundation 3.0 root hub" from autosuspend in tlp
> configuration is workaround for this issue, however on other devices
> tlp works fine without blacklisting usb hub autosuspend, and on this
> tablet there was no such issue before (at least in Linux ~4.8-4.12
> range) so I assume there is regression somewhere.
> 
> Is there any related commits between 4.12 and 4.13 that I could try to
> revert?
> 

In 4.12 there was a added sensitivity to react to hotplug removed
xhc controllers, i.e. if we read 0xffffffff from a xhci register
we assume host is removed and start cleaning up.

commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
     xhci: Rework how we handle unresponsive or hoptlug removed hosts

You can try to revert that, but as a final solution we should
find the real rootcause

> How issue looks like in logs:
> 
> [  227.258385] xhci_hcd 0000:00:14.0: xHC is not running.
> [  329.671544] xhci_hcd 0000:00:14.0: xHC is not running.
> [  416.695796] xhci_hcd 0000:00:14.0: xHC is not running.

The "xHC is not running" is the xhci driver handing a port event
interrupt for a resuming port, but whole host controller is not running.
We stop the host controller in xhci_suspend(), and start it in xhci_resume()

Attaching a patch that improves preventing xhci host suspend during
USB2 resume signaling.
Could help, worth a shot.

> [  416.695862] xhci_hcd 0000:00:14.0: xHCI host controller not
> responding, assume dead

This means xhci_hc_died() was called, many possible places.
Adding the code below could give a hint:


> [  416.695900] xhci_hcd 0000:00:14.0: HC died; cleaning up
> [  416.696052] usb 1-3: USB disconnect, device number 2
> [  416.815610] cdc_mbim 1-3:1.12 wwp0s20u3i12: unregister 'cdc_mbim'
> usb-0000:00:14.0-3, CDC MBIM
> [  416.847934] usb 1-4: USB disconnect, device number 3
> 
> After that Bluetooth adapter and LTE modem disappear from lsusb output,
> while xHCI controller itself remain visible.

we stop the host activity in xhci_hc_died(), no usb devices under this host will work.

> Complete dmesg: https://paste.fedoraproject.org/paste/7aMpVGLfZ82zppdGs
> 56Oqg
> lsusb -v: https://paste.fedoraproject.org/paste/c7y8GisC13YdzcYE9B-JIw
> dsdt.dsl: https://paste.fedoraproject.org/paste/8g6mp2dafypUkFT4sa43iA

xhci traces and dynamic debug could help:

mount -t debugfs none /sys/kernel/debug
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable

echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control

-Mathias

Comments

russianneuromancer@ya.ru April 22, 2018, 6:29 a.m. UTC | #1
Hello!

So far I tested attached patch but didn't tried to revert commit yet,
will do next week.

Result of running patched kernel with recommended debug options: 
https://paste.fedoraproject.org/paste/UpezexD~tDmQthoxV2BFbg

16/04/2018 14:55 +0300, Mathias Nyman:
> On 10.04.2018 12:15, russianneuromancer@ya.ru wrote:
> > Hello!
> > 
> > On Dell Venue 8 Pro 5855 tablet installing tlp or running "powertop
> > --
> > auto-tune" cause "xHCI host controller not responding, assume dead"
> > error, when error happen two integrated USB devices (Bluetooth
> > adapter
> > and LTE modem) disappear until reboot. First time this issue was
> > observer in Linux 4.13 and still present in Linux 4.16.
> > Blacklisting
> > both "Linux Foundation 3.0 root hub" from autosuspend in tlp
> > configuration is workaround for this issue, however on other
> > devices
> > tlp works fine without blacklisting usb hub autosuspend, and on
> > this
> > tablet there was no such issue before (at least in Linux ~4.8-4.12
> > range) so I assume there is regression somewhere.
> > 
> > Is there any related commits between 4.12 and 4.13 that I could try
> > to revert?
> > 
> 
> In 4.12 there was a added sensitivity to react to hotplug removed
> xhc controllers, i.e. if we read 0xffffffff from a xhci register
> we assume host is removed and start cleaning up.
> 
> commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>      xhci: Rework how we handle unresponsive or hoptlug removed hosts
> 
> You can try to revert that, but as a final solution we should
> find the real rootcause
> 
> > How issue looks like in logs:
> > 
> > [  227.258385] xhci_hcd 0000:00:14.0: xHC is not running.
> > [  329.671544] xhci_hcd 0000:00:14.0: xHC is not running.
> > [  416.695796] xhci_hcd 0000:00:14.0: xHC is not running.
> 
> The "xHC is not running" is the xhci driver handing a port event
> interrupt for a resuming port, but whole host controller is not
> running.
> We stop the host controller in xhci_suspend(), and start it in
> xhci_resume()
> 
> Attaching a patch that improves preventing xhci host suspend during
> USB2 resume signaling.
> Could help, worth a shot.
> 
> > [  416.695862] xhci_hcd 0000:00:14.0: xHCI host controller not
> > responding, assume dead
> 
> This means xhci_hc_died() was called, many possible places.
> Adding the code below could give a hint:
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-
> ring.c
> index daa94c3..51fb3d0 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -900,7 +900,8 @@ void xhci_hc_died(struct xhci_hcd *xhci)
>          if (xhci->xhc_state & XHCI_STATE_DYING)
>                  return;
>   
> -       xhci_err(xhci, "xHCI host controller not responding, assume
> dead\n");
> +       xhci_err(xhci, "%ps: xHCI host controller not responding,
> assume dead\n",
> +                __builtin_return_address(0));
>          xhci->xhc_state |= XHCI_STATE_DYING;
>   
>          xhci_cleanup_command_queue(xhci);
> 
> > [  416.695900] xhci_hcd 0000:00:14.0: HC died; cleaning up
> > [  416.696052] usb 1-3: USB disconnect, device number 2
> > [  416.815610] cdc_mbim 1-3:1.12 wwp0s20u3i12: unregister
> > 'cdc_mbim'
> > usb-0000:00:14.0-3, CDC MBIM
> > [  416.847934] usb 1-4: USB disconnect, device number 3
> > 
> > After that Bluetooth adapter and LTE modem disappear from lsusb
> > output,
> > while xHCI controller itself remain visible.
> 
> we stop the host activity in xhci_hc_died(), no usb devices under
> this host will work.
> 
> > Complete dmesg: https://paste.fedoraproject.org/paste/7aMpVGLfZ82zp
> > pdGs
> > 56Oqg
> > lsusb -v: https://paste.fedoraproject.org/paste/c7y8GisC13YdzcYE9B-
> > JIw
> > dsdt.dsl: https://paste.fedoraproject.org/paste/8g6mp2dafypUkFT4sa4
> > 3iA
> 
> xhci traces and dynamic debug could help:
> 
> mount -t debugfs none /sys/kernel/debug
> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
> 
> echo -n 'module xhci_hcd =p' >
> /sys/kernel/debug/dynamic_debug/control
> 
> -Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Nyman April 23, 2018, 2:52 p.m. UTC | #2
On 22.04.2018 09:29, russianneuromancer@ya.ru wrote:
> Hello!
> 
> So far I tested attached patch but didn't tried to revert commit yet,
> will do next week.
> 
> Result of running patched kernel with recommended debug options:
> https://paste.fedoraproject.org/paste/UpezexD~tDmQthoxV2BFbg
> 

Logs show there is a race, controller is suspended, then resumed,
but no interrupt is pending in xhci_resume so roothubs are not resumed,
and host starts to suspend again.

We get the interrupt only after we already started suspending xhci
controller again.

My guess is that when we handle the interrupt we queue work to resume the roothub,
but controller is probably put to D3 suspended state by then,
returning 0xffffffff from some register reads, which driver understands as a dead host.

I need to look into this a bit more.

[  268.144527] xhci_hcd 0000:00:14.0: xhci_suspend: stopping port polling.
[  268.144543] xhci_hcd 0000:00:14.0: // Setting command ring address to 0x349bd001
[  268.520802] xhci_hcd 0000:00:14.0: // Setting command ring address to 0x349bd001
[  268.520969] xhci_hcd 0000:00:14.0: xhci_resume: starting port polling.
[  268.520985] xhci_hcd 0000:00:14.0: xhci_hub_status_data: stopping port polling.
[  268.521030] xhci_hcd 0000:00:14.0: xhci_suspend: stopping port polling.
[  268.521040] xhci_hcd 0000:00:14.0: // Setting command ring address to 0x349bd001
[  268.521139] xhci_hcd 0000:00:14.0: Port Status Change Event for port 3
[  268.521149] xhci_hcd 0000:00:14.0: resume root hub
[  268.521163] xhci_hcd 0000:00:14.0: port resume event for port 3
[  268.521168] xhci_hcd 0000:00:14.0: xHC is not running.
[  268.521174] xhci_hcd 0000:00:14.0: handle_port_status: starting port polling.
[  268.596322] xhci_hcd 0000:00:14.0: xhci_hc_died: xHCI host controller not responding, assume dead
[  268.596340] xhci_hcd 0000:00:14.0: Killing URBs for slot ID 1, ep index 0

-Mathias

> 16/04/2018 14:55 +0300, Mathias Nyman:
>> On 10.04.2018 12:15, russianneuromancer@ya.ru wrote:
>>> Hello!
>>>
>>> On Dell Venue 8 Pro 5855 tablet installing tlp or running "powertop
>>> --
>>> auto-tune" cause "xHCI host controller not responding, assume dead"
>>> error, when error happen two integrated USB devices (Bluetooth
>>> adapter
>>> and LTE modem) disappear until reboot. First time this issue was
>>> observer in Linux 4.13 and still present in Linux 4.16.
>>> Blacklisting
>>> both "Linux Foundation 3.0 root hub" from autosuspend in tlp
>>> configuration is workaround for this issue, however on other
>>> devices
>>> tlp works fine without blacklisting usb hub autosuspend, and on
>>> this
>>> tablet there was no such issue before (at least in Linux ~4.8-4.12
>>> range) so I assume there is regression somewhere.
>>>
>>> Is there any related commits between 4.12 and 4.13 that I could try
>>> to revert?
>>>
>>
>> In 4.12 there was a added sensitivity to react to hotplug removed
>> xhc controllers, i.e. if we read 0xffffffff from a xhci register
>> we assume host is removed and start cleaning up.
>>
>> commit d9f11ba9f107aa335091ab8d7ba5eea714e46e8b
>>       xhci: Rework how we handle unresponsive or hoptlug removed hosts
>>
>> You can try to revert that, but as a final solution we should
>> find the real rootcause
>>
>>> How issue looks like in logs:
>>>
>>> [  227.258385] xhci_hcd 0000:00:14.0: xHC is not running.
>>> [  329.671544] xhci_hcd 0000:00:14.0: xHC is not running.
>>> [  416.695796] xhci_hcd 0000:00:14.0: xHC is not running.
>>
>> The "xHC is not running" is the xhci driver handing a port event
>> interrupt for a resuming port, but whole host controller is not
>> running.
>> We stop the host controller in xhci_suspend(), and start it in
>> xhci_resume()
>>
>> Attaching a patch that improves preventing xhci host suspend during
>> USB2 resume signaling.
>> Could help, worth a shot.
>>
>>> [  416.695862] xhci_hcd 0000:00:14.0: xHCI host controller not
>>> responding, assume dead
>>
>> This means xhci_hc_died() was called, many possible places.
>> Adding the code below could give a hint:
>>
>> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-
>> ring.c
>> index daa94c3..51fb3d0 100644
>> --- a/drivers/usb/host/xhci-ring.c
>> +++ b/drivers/usb/host/xhci-ring.c
>> @@ -900,7 +900,8 @@ void xhci_hc_died(struct xhci_hcd *xhci)
>>           if (xhci->xhc_state & XHCI_STATE_DYING)
>>                   return;
>>    
>> -       xhci_err(xhci, "xHCI host controller not responding, assume
>> dead\n");
>> +       xhci_err(xhci, "%ps: xHCI host controller not responding,
>> assume dead\n",
>> +                __builtin_return_address(0));
>>           xhci->xhc_state |= XHCI_STATE_DYING;
>>    
>>           xhci_cleanup_command_queue(xhci);
>>
>>> [  416.695900] xhci_hcd 0000:00:14.0: HC died; cleaning up
>>> [  416.696052] usb 1-3: USB disconnect, device number 2
>>> [  416.815610] cdc_mbim 1-3:1.12 wwp0s20u3i12: unregister
>>> 'cdc_mbim'
>>> usb-0000:00:14.0-3, CDC MBIM
>>> [  416.847934] usb 1-4: USB disconnect, device number 3
>>>
>>> After that Bluetooth adapter and LTE modem disappear from lsusb
>>> output,
>>> while xHCI controller itself remain visible.
>>
>> we stop the host activity in xhci_hc_died(), no usb devices under
>> this host will work.
>>
>>> Complete dmesg: https://paste.fedoraproject.org/paste/7aMpVGLfZ82zp
>>> pdGs
>>> 56Oqg
>>> lsusb -v: https://paste.fedoraproject.org/paste/c7y8GisC13YdzcYE9B-
>>> JIw
>>> dsdt.dsl: https://paste.fedoraproject.org/paste/8g6mp2dafypUkFT4sa4
>>> 3iA
>>
>> xhci traces and dynamic debug could help:
>>
>> mount -t debugfs none /sys/kernel/debug
>> echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
>> echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
>>
>> echo -n 'module xhci_hcd =p' >
>> /sys/kernel/debug/dynamic_debug/control
>>
>> -Mathias
> --
> To unsubscribe from this list: send the line "unsubscribe linux-usb" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern April 23, 2018, 3:11 p.m. UTC | #3
On Mon, 23 Apr 2018, Mathias Nyman wrote:

> On 22.04.2018 09:29, russianneuromancer@ya.ru wrote:
> > Hello!
> > 
> > So far I tested attached patch but didn't tried to revert commit yet,
> > will do next week.
> > 
> > Result of running patched kernel with recommended debug options:
> > https://paste.fedoraproject.org/paste/UpezexD~tDmQthoxV2BFbg
> > 
> 
> Logs show there is a race, controller is suspended, then resumed,
> but no interrupt is pending in xhci_resume so roothubs are not resumed,
> and host starts to suspend again.
> 
> We get the interrupt only after we already started suspending xhci
> controller again.
> 
> My guess is that when we handle the interrupt we queue work to resume the roothub,
> but controller is probably put to D3 suspended state by then,
> returning 0xffffffff from some register reads, which driver understands as a dead host.
> 
> I need to look into this a bit more.

In this situation, the HCD_WAKEUP_PENDING(hcd) test in
hcd-pci.c:suspend_common() should prevent the controller from going
back into D3.  The WAKEUP_PENDING bit gets set in
usb_hcd_resume_root_hub() and it doesn't get cleared until
hcd_bus_resume() runs.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Nyman April 24, 2018, 1:15 p.m. UTC | #4
On 23.04.2018 18:11, Alan Stern wrote:
> On Mon, 23 Apr 2018, Mathias Nyman wrote:
> 
>> On 22.04.2018 09:29, russianneuromancer@ya.ru wrote:
>>> Hello!
>>>
>>> So far I tested attached patch but didn't tried to revert commit yet,
>>> will do next week.
>>>
>>> Result of running patched kernel with recommended debug options:
>>> https://paste.fedoraproject.org/paste/UpezexD~tDmQthoxV2BFbg
>>>
>>
>> Logs show there is a race, controller is suspended, then resumed,
>> but no interrupt is pending in xhci_resume so roothubs are not resumed,
>> and host starts to suspend again.
>>
>> We get the interrupt only after we already started suspending xhci
>> controller again.
>>
>> My guess is that when we handle the interrupt we queue work to resume the roothub,
>> but controller is probably put to D3 suspended state by then,
>> returning 0xffffffff from some register reads, which driver understands as a dead host.
>>
>> I need to look into this a bit more.
> 
> In this situation, the HCD_WAKEUP_PENDING(hcd) test in
> hcd-pci.c:suspend_common() should prevent the controller from going
> back into D3.  The WAKEUP_PENDING bit gets set in
> usb_hcd_resume_root_hub() and it doesn't get cleared until
> hcd_bus_resume() runs.
> 

I think xhci never calls usb_hcd_resume_root_hub() in xhci_resume() in this
specific failing case

xhci_resume() has a check:
/* Resume root hubs only when have pending events. */
   status = readl(&xhci->op_regs->status);
     if (status & STS_EINT) {
       usb_hcd_resume_root_hub(xhci->shared_hcd);
       usb_hcd_resume_root_hub(hcd);
     }

If the check fails, then WAKEUP_PENDING bit is not set, and runtime PM
can suspend host controller again. when xhci driver finally gets to handle the interrupt
the controller may be in D3 already

This should only happen if xhci_resume() is called before xhci driver sees a pending interrupt,
could be possible as xhci has interrupt moderation enabled.

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern April 24, 2018, 1:24 p.m. UTC | #5
On Tue, 24 Apr 2018, Mathias Nyman wrote:

> On 23.04.2018 18:11, Alan Stern wrote:
> > On Mon, 23 Apr 2018, Mathias Nyman wrote:
> > 
> >> On 22.04.2018 09:29, russianneuromancer@ya.ru wrote:
> >>> Hello!
> >>>
> >>> So far I tested attached patch but didn't tried to revert commit yet,
> >>> will do next week.
> >>>
> >>> Result of running patched kernel with recommended debug options:
> >>> https://paste.fedoraproject.org/paste/UpezexD~tDmQthoxV2BFbg
> >>>
> >>
> >> Logs show there is a race, controller is suspended, then resumed,
> >> but no interrupt is pending in xhci_resume so roothubs are not resumed,
> >> and host starts to suspend again.
> >>
> >> We get the interrupt only after we already started suspending xhci
> >> controller again.
> >>
> >> My guess is that when we handle the interrupt we queue work to resume the roothub,
> >> but controller is probably put to D3 suspended state by then,
> >> returning 0xffffffff from some register reads, which driver understands as a dead host.
> >>
> >> I need to look into this a bit more.
> > 
> > In this situation, the HCD_WAKEUP_PENDING(hcd) test in
> > hcd-pci.c:suspend_common() should prevent the controller from going
> > back into D3.  The WAKEUP_PENDING bit gets set in
> > usb_hcd_resume_root_hub() and it doesn't get cleared until
> > hcd_bus_resume() runs.
> > 
> 
> I think xhci never calls usb_hcd_resume_root_hub() in xhci_resume() in this
> specific failing case
> 
> xhci_resume() has a check:
> /* Resume root hubs only when have pending events. */
>    status = readl(&xhci->op_regs->status);
>      if (status & STS_EINT) {
>        usb_hcd_resume_root_hub(xhci->shared_hcd);
>        usb_hcd_resume_root_hub(hcd);
>      }
> 
> If the check fails, then WAKEUP_PENDING bit is not set, and runtime PM
> can suspend host controller again. when xhci driver finally gets to handle the interrupt
> the controller may be in D3 already
> 
> This should only happen if xhci_resume() is called before xhci driver sees a pending interrupt,
> could be possible as xhci has interrupt moderation enabled.

Then maybe that test should be removed.  Calling 
usb_hcd_resume_root_hub() for every wakeup shouldn't be too bad, 
because there probably are not very many times when the controller gets 
resumed without the root hub also being resumed.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Nyman April 24, 2018, 1:32 p.m. UTC | #6
On 24.04.2018 16:24, Alan Stern wrote:
> On Tue, 24 Apr 2018, Mathias Nyman wrote:
> 
>> On 23.04.2018 18:11, Alan Stern wrote:
>>> On Mon, 23 Apr 2018, Mathias Nyman wrote:
>>>
>>>> On 22.04.2018 09:29, russianneuromancer@ya.ru wrote:
>>>>> Hello!
>>>>>
>>>>> So far I tested attached patch but didn't tried to revert commit yet,
>>>>> will do next week.
>>>>>
>>>>> Result of running patched kernel with recommended debug options:
>>>>> https://paste.fedoraproject.org/paste/UpezexD~tDmQthoxV2BFbg
>>>>>
>>>>
>>>> Logs show there is a race, controller is suspended, then resumed,
>>>> but no interrupt is pending in xhci_resume so roothubs are not resumed,
>>>> and host starts to suspend again.
>>>>
>>>> We get the interrupt only after we already started suspending xhci
>>>> controller again.
>>>>
>>>> My guess is that when we handle the interrupt we queue work to resume the roothub,
>>>> but controller is probably put to D3 suspended state by then,
>>>> returning 0xffffffff from some register reads, which driver understands as a dead host.
>>>>
>>>> I need to look into this a bit more.
>>>
>>> In this situation, the HCD_WAKEUP_PENDING(hcd) test in
>>> hcd-pci.c:suspend_common() should prevent the controller from going
>>> back into D3.  The WAKEUP_PENDING bit gets set in
>>> usb_hcd_resume_root_hub() and it doesn't get cleared until
>>> hcd_bus_resume() runs.
>>>
>>
>> I think xhci never calls usb_hcd_resume_root_hub() in xhci_resume() in this
>> specific failing case
>>
>> xhci_resume() has a check:
>> /* Resume root hubs only when have pending events. */
>>     status = readl(&xhci->op_regs->status);
>>       if (status & STS_EINT) {
>>         usb_hcd_resume_root_hub(xhci->shared_hcd);
>>         usb_hcd_resume_root_hub(hcd);
>>       }
>>
>> If the check fails, then WAKEUP_PENDING bit is not set, and runtime PM
>> can suspend host controller again. when xhci driver finally gets to handle the interrupt
>> the controller may be in D3 already
>>
>> This should only happen if xhci_resume() is called before xhci driver sees a pending interrupt,
>> could be possible as xhci has interrupt moderation enabled.
> 
> Then maybe that test should be removed.  Calling
> usb_hcd_resume_root_hub() for every wakeup shouldn't be too bad,
> because there probably are not very many times when the controller gets
> resumed without the root hub also being resumed.
> 

The check was added to fix system suspend issue on a runtime suspended host:

commit d6236f6d1d885aa19d1cd7317346fe795227a3cc

     xhci: Fix runtime suspended xhci from blocking system suspend.
     
     The system suspend flow as following:
     1, Freeze all user processes and kenrel threads.
     
     2, Try to suspend all devices.
     
     2.1, If pci device is in RPM suspended state, then pci driver will try
     to resume it to RPM active state in the prepare stage.
     
     2.2, xhci_resume function calls usb_hcd_resume_root_hub to queue two
     workqueue items to resume usb2&usb3 roothub devices.
     
     2.3, Call suspend callbacks of devices.
     
     2.3.1, All suspend callbacks of all hcd's children, including
     roothub devices are called.
     
     2.3.2, Finally, hcd_pci_suspend callback is called.
     
     Due to workqueue threads were already frozen in step 1, the workqueue
     items can't be scheduled, and the roothub devices can't be resumed in
     this flow. The HCD_FLAG_WAKEUP_PENDING flag which is set in
     usb_hcd_resume_root_hub won't be cleared. Finally,
     hcd_pci_suspend will return -EBUSY, and system suspend fails.

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alan Stern April 24, 2018, 1:50 p.m. UTC | #7
On Tue, 24 Apr 2018, Mathias Nyman wrote:

> >>> In this situation, the HCD_WAKEUP_PENDING(hcd) test in
> >>> hcd-pci.c:suspend_common() should prevent the controller from going
> >>> back into D3.  The WAKEUP_PENDING bit gets set in
> >>> usb_hcd_resume_root_hub() and it doesn't get cleared until
> >>> hcd_bus_resume() runs.
> >>>
> >>
> >> I think xhci never calls usb_hcd_resume_root_hub() in xhci_resume() in this
> >> specific failing case
> >>
> >> xhci_resume() has a check:
> >> /* Resume root hubs only when have pending events. */
> >>     status = readl(&xhci->op_regs->status);
> >>       if (status & STS_EINT) {
> >>         usb_hcd_resume_root_hub(xhci->shared_hcd);
> >>         usb_hcd_resume_root_hub(hcd);
> >>       }
> >>
> >> If the check fails, then WAKEUP_PENDING bit is not set, and runtime PM
> >> can suspend host controller again. when xhci driver finally gets to handle the interrupt
> >> the controller may be in D3 already
> >>
> >> This should only happen if xhci_resume() is called before xhci driver sees a pending interrupt,
> >> could be possible as xhci has interrupt moderation enabled.
> > 
> > Then maybe that test should be removed.  Calling
> > usb_hcd_resume_root_hub() for every wakeup shouldn't be too bad,
> > because there probably are not very many times when the controller gets
> > resumed without the root hub also being resumed.
> > 
> 
> The check was added to fix system suspend issue on a runtime suspended host:
> 
> commit d6236f6d1d885aa19d1cd7317346fe795227a3cc
> 
>      xhci: Fix runtime suspended xhci from blocking system suspend.
>      
>      The system suspend flow as following:
>      1, Freeze all user processes and kenrel threads.
>      
>      2, Try to suspend all devices.
>      
>      2.1, If pci device is in RPM suspended state, then pci driver will try
>      to resume it to RPM active state in the prepare stage.
>      
>      2.2, xhci_resume function calls usb_hcd_resume_root_hub to queue two
>      workqueue items to resume usb2&usb3 roothub devices.
>      
>      2.3, Call suspend callbacks of devices.
>      
>      2.3.1, All suspend callbacks of all hcd's children, including
>      roothub devices are called.
>      
>      2.3.2, Finally, hcd_pci_suspend callback is called.
>      
>      Due to workqueue threads were already frozen in step 1, the workqueue
>      items can't be scheduled, and the roothub devices can't be resumed in
>      this flow. The HCD_FLAG_WAKEUP_PENDING flag which is set in
>      usb_hcd_resume_root_hub won't be cleared. Finally,
>      hcd_pci_suspend will return -EBUSY, and system suspend fails.

Hmmm.  I don't recall seeing this problem occur with ehci-hcd.  But 
then, I haven't tested it very much recently.

We could change to a different work queue, one that doesn't get 
frozen.  But there's no guarantee that the work items would run before 
your step 2.3.2.

Maybe we can avoid step 2.1.  I think there have been some recent
changes to the PM code in this area.  There may be a flag you can set
that will prevent the PCI core from resuming the host controller.

Or maybe we can change step 2.3.1, so that the root hub's suspend 
callback will first do a resume if the WAKEUP_PENDING flag is set.  
That might be the most reliable approach.

Alan Stern

--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mathias Nyman May 2, 2018, 2:47 p.m. UTC | #8
On 24.04.2018 16:50, Alan Stern wrote:
> On Tue, 24 Apr 2018, Mathias Nyman wrote:
> 
>>>>> In this situation, the HCD_WAKEUP_PENDING(hcd) test in
>>>>> hcd-pci.c:suspend_common() should prevent the controller from going
>>>>> back into D3.  The WAKEUP_PENDING bit gets set in
>>>>> usb_hcd_resume_root_hub() and it doesn't get cleared until
>>>>> hcd_bus_resume() runs.
>>>>>
>>>>
>>>> I think xhci never calls usb_hcd_resume_root_hub() in xhci_resume() in this
>>>> specific failing case
>>>>
>>>> xhci_resume() has a check:
>>>> /* Resume root hubs only when have pending events. */
>>>>      status = readl(&xhci->op_regs->status);
>>>>        if (status & STS_EINT) {
>>>>          usb_hcd_resume_root_hub(xhci->shared_hcd);
>>>>          usb_hcd_resume_root_hub(hcd);
>>>>        }
>>>>
>>>> If the check fails, then WAKEUP_PENDING bit is not set, and runtime PM
>>>> can suspend host controller again. when xhci driver finally gets to handle the interrupt
>>>> the controller may be in D3 already
>>>>
>>>> This should only happen if xhci_resume() is called before xhci driver sees a pending interrupt,
>>>> could be possible as xhci has interrupt moderation enabled.
>>>
>>> Then maybe that test should be removed.  Calling
>>> usb_hcd_resume_root_hub() for every wakeup shouldn't be too bad,
>>> because there probably are not very many times when the controller gets
>>> resumed without the root hub also being resumed.
>>>
>>
>> The check was added to fix system suspend issue on a runtime suspended host:
>>
>> commit d6236f6d1d885aa19d1cd7317346fe795227a3cc
>>
>>       xhci: Fix runtime suspended xhci from blocking system suspend.
>>       
>>       The system suspend flow as following:
>>       1, Freeze all user processes and kenrel threads.
>>       
>>       2, Try to suspend all devices.
>>       
>>       2.1, If pci device is in RPM suspended state, then pci driver will try
>>       to resume it to RPM active state in the prepare stage.
>>       
>>       2.2, xhci_resume function calls usb_hcd_resume_root_hub to queue two
>>       workqueue items to resume usb2&usb3 roothub devices.
>>       
>>       2.3, Call suspend callbacks of devices.
>>       
>>       2.3.1, All suspend callbacks of all hcd's children, including
>>       roothub devices are called.
>>       
>>       2.3.2, Finally, hcd_pci_suspend callback is called.
>>       
>>       Due to workqueue threads were already frozen in step 1, the workqueue
>>       items can't be scheduled, and the roothub devices can't be resumed in
>>       this flow. The HCD_FLAG_WAKEUP_PENDING flag which is set in
>>       usb_hcd_resume_root_hub won't be cleared. Finally,
>>       hcd_pci_suspend will return -EBUSY, and system suspend fails.
> 
> Hmmm.  I don't recall seeing this problem occur with ehci-hcd.  But
> then, I haven't tested it very much recently.
> 
> We could change to a different work queue, one that doesn't get
> frozen.  But there's no guarantee that the work items would run before
> your step 2.3.2.
> 
> Maybe we can avoid step 2.1.  I think there have been some recent
> changes to the PM code in this area.  There may be a flag you can set
> that will prevent the PCI core from resuming the host controller.
> 
> Or maybe we can change step 2.3.1, so that the root hub's suspend
> callback will first do a resume if the WAKEUP_PENDING flag is set.
> That might be the most reliable approach.
> 

I'm not sure I understand the last suggestion, could you open up how it
would work?

I started approaching this from another direction, mainly making sure we don't
immediately runtime suspend the host controller after resume. Adding a next_statechange
minimal time between resume and suspend, and checking for pending events in xhci_suspend().

I'll have some patches to test for russianneuromancer@ya.ru soon

These are generic checks that ehci_suspend() also does

-Mathias
--
To unsubscribe from this list: send the line "unsubscribe linux-usb" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From 090b13a6df3f489a9781223dd959e03c2f81347b Mon Sep 17 00:00:00 2001
From: Mathias Nyman <mathias.nyman@linux.intel.com>
Date: Thu, 1 Mar 2018 18:48:32 +0200
Subject: [PATCH] xhci: prevent USB 2 roothub autosuspend during port resume
 signaling

xhci USB 2 roothub tries to autosuspended itself again immediately after
being resumed by a remote wake. This can be avoided by calling the
usb_hcd_start_port_resume() and usb_hcd_end_port_resume() implemented
especially for this purpose.

Use them, and prevent roothub autosuspend during resume signaling.

Suggested-by: Anshuman Gupta <anshuman.gupta@intel.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-hub.c  | 3 +++
 drivers/usb/host/xhci-ring.c | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index 72ebbc9..671a336 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -905,6 +905,7 @@  static u32 xhci_get_port_status(struct usb_hcd *hcd,
 
 				set_bit(wIndex, &bus_state->resuming_ports);
 				bus_state->resume_done[wIndex] = timeout;
+				usb_hcd_start_port_resume(&hcd->self, wIndex);
 				mod_timer(&hcd->rh_timer, timeout);
 			}
 		/* Has resume been signalled for USB_RESUME_TIME yet? */
@@ -930,6 +931,7 @@  static u32 xhci_get_port_status(struct usb_hcd *hcd,
 					msecs_to_jiffies(
 						XHCI_MAX_REXIT_TIMEOUT));
 			spin_lock_irqsave(&xhci->lock, flags);
+			usb_hcd_end_port_resume(&hcd->self, wIndex);
 
 			if (time_left) {
 				slot_id = xhci_find_slot_id_by_port(hcd,
@@ -970,6 +972,7 @@  static u32 xhci_get_port_status(struct usb_hcd *hcd,
 	    (raw_port_status & PORT_PLS_MASK) != XDEV_RESUME) {
 		bus_state->resume_done[wIndex] = 0;
 		clear_bit(wIndex, &bus_state->resuming_ports);
+		usb_hcd_end_port_resume(&hcd->self, wIndex);
 	}
 
 
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index daa94c3..a1cffe9 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1666,6 +1666,8 @@  static void handle_port_status(struct xhci_hcd *xhci,
 			bus_state->resume_done[faked_port_index] = jiffies +
 				msecs_to_jiffies(USB_RESUME_TIMEOUT);
 			set_bit(faked_port_index, &bus_state->resuming_ports);
+			usb_hcd_start_port_resume(&hcd->self, faked_port_index);
+
 			/* Do the rest in GetPortStatus after resume time delay.
 			 * Avoid polling roothub status before that so that a
 			 * usb device auto-resume latency around ~40ms.
-- 
2.7.4