diff mbox series

[v2] xhci: reduce xhci_handshake timeout in xhci_reset

Message ID 1644841216-1468-1-git-send-email-quic_pkondeti@quicinc.com (mailing list archive)
State New, archived
Headers show
Series [v2] xhci: reduce xhci_handshake timeout in xhci_reset | expand

Commit Message

Pavan Kondeti Feb. 14, 2022, 12:20 p.m. UTC
From: Daehwan Jung <dh10.jung@samsung.com>

xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
controller reset and controller ready operations can be fatal to the
system when controller is timed out. Reduce the timeout to 1 second
and print a error message when the time out happens.

Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
Signed-off-by: Daehwan Jung <dh10.jung@samsung.com>
Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
---

v2:
- Add error print statements in the code that change log refers to

 drivers/usb/host/xhci.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

Comments

Greg Kroah-Hartman Feb. 14, 2022, 12:29 p.m. UTC | #1
On Mon, Feb 14, 2022 at 05:50:16PM +0530, Pavankumar Kondeti wrote:
> From: Daehwan Jung <dh10.jung@samsung.com>
> 
> xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> controller reset and controller ready operations can be fatal to the
> system when controller is timed out. Reduce the timeout to 1 second
> and print a error message when the time out happens.
> 
> Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> Signed-off-by: Daehwan Jung <dh10.jung@samsung.com>
> Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
> ---
> 
> v2:
> - Add error print statements in the code that change log refers to
> 
>  drivers/usb/host/xhci.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index dc357ca..bb9ea3f 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -196,9 +196,11 @@ int xhci_reset(struct xhci_hcd *xhci)
>  		udelay(1000);
>  
>  	ret = xhci_handshake(&xhci->op_regs->command,
> -			CMD_RESET, 0, 10 * 1000 * 1000);
> -	if (ret)
> +			CMD_RESET, 0, 1 * 1000 * 1000);
> +	if (ret) {
> +		xhci_err(xhci, "Host controller reset timed out\n");

A timeout is not the only error that could have happened here.  So why
claim that all errors are timeout errors?

How did you test this?

thanks,

greg k-h
Mathias Nyman Feb. 14, 2022, 12:51 p.m. UTC | #2
On 14.2.2022 14.20, Pavankumar Kondeti wrote:
> From: Daehwan Jung <dh10.jung@samsung.com>
> 
> xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> controller reset and controller ready operations can be fatal to the
> system when controller is timed out. Reduce the timeout to 1 second
> and print a error message when the time out happens.
> 
> Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")


The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
intentionally increased the timeout to 10 seconds as that host might take 9
seconds to complete reset. This was done almost 10 years ago so I don't know
if it really is an issue anymore.

Anyways, your patch might break Renesas 72021 instead of fixing it.

I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.

Lets see if there is another solution for your case.

- Does a "guard interval" after writing the reset help?
  For example Intel xHCI needs 1ms before touching xHC after writing the reset bit

- Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit? 
 
- we only disable local interrupts when xhci_reset() is called from xhci_stop(),
  and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
  Have you identified which one is the problematic case?

  I think we halt the host in the above case first, meaning there should be no
  xHC interrupts when xhci_reset() is called. So if we could guarantee xhci interrupt
  isn't handled on this cpu, maybe we could somehow enable local interrupt after 
  halting the host?

  haven't really thought this true yet, but something like this could e investigated:

  spin_lock_irqsave()
  xhci_halt()
  < enable interrupts, magically turn spin_lock_irqsave() to just keeping spin lock>
  xhci_reset()
  spin_unlock()

-Mathias
Pavan Kondeti Feb. 14, 2022, 1:19 p.m. UTC | #3
On Mon, Feb 14, 2022 at 01:29:12PM +0100, Greg Kroah-Hartman wrote:
> On Mon, Feb 14, 2022 at 05:50:16PM +0530, Pavankumar Kondeti wrote:
> > From: Daehwan Jung <dh10.jung@samsung.com>
> > 
> > xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> > controller reset and controller ready operations can be fatal to the
> > system when controller is timed out. Reduce the timeout to 1 second
> > and print a error message when the time out happens.
> > 
> > Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> > Signed-off-by: Daehwan Jung <dh10.jung@samsung.com>
> > Signed-off-by: Pavankumar Kondeti <quic_pkondeti@quicinc.com>
> > ---
> > 
> > v2:
> > - Add error print statements in the code that change log refers to
> > 
> >  drivers/usb/host/xhci.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> > index dc357ca..bb9ea3f 100644
> > --- a/drivers/usb/host/xhci.c
> > +++ b/drivers/usb/host/xhci.c
> > @@ -196,9 +196,11 @@ int xhci_reset(struct xhci_hcd *xhci)
> >  		udelay(1000);
> >  
> >  	ret = xhci_handshake(&xhci->op_regs->command,
> > -			CMD_RESET, 0, 10 * 1000 * 1000);
> > -	if (ret)
> > +			CMD_RESET, 0, 1 * 1000 * 1000);
> > +	if (ret) {
> > +		xhci_err(xhci, "Host controller reset timed out\n");
> 
> A timeout is not the only error that could have happened here.  So why
> claim that all errors are timeout errors?

Thanks for pointing this out. xhci_handshake() can return an error code
other than -ETIMEDOUT. ret == -ETIMEDOUT check needs to be added or
just print the ret error in the print message.
> 
> How did you test this?

This is a hard to reproduce issue. So I have hacked the code to use 1 usec
instead of 1 sec as timeout and see running into the timeout issue. I know
this is not a perfect test. The test patch is included below.

Thanks,
Pavan


diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index e95a5bc..6147544 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -37,6 +37,9 @@ static unsigned long long quirks;
 module_param(quirks, ullong, S_IRUGO);
 MODULE_PARM_DESC(quirks, "Bit flags for quirks to be enabled as default");
 
+static int reset_error;
+module_param(reset_error, int, S_IRUGO | S_IWUSR);
+
 static bool td_on_ring(struct xhci_td *td, struct xhci_ring *ring)
 {
 	struct xhci_segment *seg = ring->first_seg;
@@ -195,8 +198,15 @@ int xhci_reset(struct xhci_hcd *xhci)
 	if (xhci->quirks & XHCI_INTEL_HOST)
 		udelay(1000);
 
-	ret = xhci_handshake(&xhci->op_regs->command,
-			CMD_RESET, 0, 1 * 1000 * 1000);
+	if (reset_error) {
+		xhci_err(xhci, "forcing timeout\n");
+		ret = xhci_handshake(&xhci->op_regs->command,
+				CMD_RESET, 0, 1); /* 1 usec */
+	} else {
+		ret = xhci_handshake(&xhci->op_regs->command,
+				CMD_RESET, 0, 1 * 1000 * 1000);
+	}
+
 	if (ret) {
 		xhci_err(xhci, "Host controller reset timed out\n");
 		return ret;
Pavan Kondeti Feb. 14, 2022, 1:53 p.m. UTC | #4
Hi Mathias,

On Mon, Feb 14, 2022 at 02:51:54PM +0200, Mathias Nyman wrote:
> On 14.2.2022 14.20, Pavankumar Kondeti wrote:
> > From: Daehwan Jung <dh10.jung@samsung.com>
> > 
> > xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> > controller reset and controller ready operations can be fatal to the
> > system when controller is timed out. Reduce the timeout to 1 second
> > and print a error message when the time out happens.
> > 
> > Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> 
> 
> The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> intentionally increased the timeout to 10 seconds as that host might take 9
> seconds to complete reset. This was done almost 10 years ago so I don't know
> if it really is an issue anymore.
> 
> Anyways, your patch might break Renesas 72021 instead of fixing it.

Unfortunately, yes :-( . We have this reduced timeout patch in our previous
commercialized products so thought this would be a good time to fix this
once for all. Since this patch has been 10 years long, not sure if any other
controllers also need 10 sec timeout. It would probably better

> 
> I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.
> 
> Lets see if there is another solution for your case.
> 
> - Does a "guard interval" after writing the reset help?
>   For example Intel xHCI needs 1ms before touching xHC after writing the reset bit

I will ask this question to our hardware team. Setting that one quirk from
DWC3 host might require other changes like this [1].
> 
> - Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit? 

The RESET bit never gets cleared from USBCMD register.

>  
> - we only disable local interrupts when xhci_reset() is called from xhci_stop(),
>   and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
>   Have you identified which one is the problematic case?

The crash reports I have seen are pointing to

usb_remove_hcd()->xhci_stop()->xhci_reset()
> 
>   I think we halt the host in the above case first, meaning there should be no
>   xHC interrupts when xhci_reset() is called. So if we could guarantee xhci interrupt
>   isn't handled on this cpu, maybe we could somehow enable local interrupt after 
>   halting the host?
> 
>   haven't really thought this true yet, but something like this could e investigated:
> 
>   spin_lock_irqsave()
>   xhci_halt()
>   < enable interrupts, magically turn spin_lock_irqsave() to just keeping spin lock>
>   xhci_reset()
>   spin_unlock()

This is a very good suggestion. However, disabling preemption for 10 seconds
is also bad even on non-RT kernels like mobiles are using. Most of the SoCs
will have a watchdog which makes sure that all CPUs are schedulable and flag
this condition. The most important thread in the system could have just woken
on this CPU and it can't run until we drop the spin lock.
 
[1] https://lore.kernel.org/linux-usb/20220209055352.GA22550@hu-pkondeti-hyd.qualcomm.com/

Thanks,
Pavan
Jung Daehwan Feb. 15, 2022, 8:51 a.m. UTC | #5
On Mon, Feb 14, 2022 at 07:23:10PM +0530, Pavan Kondeti wrote:
> Hi Mathias,
> 
> On Mon, Feb 14, 2022 at 02:51:54PM +0200, Mathias Nyman wrote:
> > On 14.2.2022 14.20, Pavankumar Kondeti wrote:
> > > From: Daehwan Jung <dh10.jung@samsung.com>
> > > 
> > > xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> > > controller reset and controller ready operations can be fatal to the
> > > system when controller is timed out. Reduce the timeout to 1 second
> > > and print a error message when the time out happens.
> > > 
> > > Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> > 
> > 
> > The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> > intentionally increased the timeout to 10 seconds as that host might take 9
> > seconds to complete reset. This was done almost 10 years ago so I don't know
> > if it really is an issue anymore.
> > 
> > Anyways, your patch might break Renesas 72021 instead of fixing it.
> 
> Unfortunately, yes :-( . We have this reduced timeout patch in our previous
> commercialized products so thought this would be a good time to fix this
> once for all. Since this patch has been 10 years long, not sure if any other
> controllers also need 10 sec timeout. It would probably better
> 
> > 
> > I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.
> > 
> > Lets see if there is another solution for your case.
> > 
> > - Does a "guard interval" after writing the reset help?
> >   For example Intel xHCI needs 1ms before touching xHC after writing the reset bit
> 
> I will ask this question to our hardware team. Setting that one quirk from
> DWC3 host might require other changes like this [1].
> > 
> > - Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit? 
> 
> The RESET bit never gets cleared from USBCMD register.
> 
> >  
> > - we only disable local interrupts when xhci_reset() is called from xhci_stop(),
> >   and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
> >   Have you identified which one is the problematic case?
> 
> The crash reports I have seen are pointing to
> 
> usb_remove_hcd()->xhci_stop()->xhci_reset()
> > 
> >   I think we halt the host in the above case first, meaning there should be no
> >   xHC interrupts when xhci_reset() is called. So if we could guarantee xhci interrupt
> >   isn't handled on this cpu, maybe we could somehow enable local interrupt after 
> >   halting the host?
> > 
> >   haven't really thought this true yet, but something like this could e investigated:
> > 
> >   spin_lock_irqsave()
> >   xhci_halt()
> >   < enable interrupts, magically turn spin_lock_irqsave() to just keeping spin lock>
> >   xhci_reset()
> >   spin_unlock()
> 
> This is a very good suggestion. However, disabling preemption for 10 seconds
> is also bad even on non-RT kernels like mobiles are using. Most of the SoCs
> will have a watchdog which makes sure that all CPUs are schedulable and flag
> this condition. The most important thread in the system could have just woken
> on this CPU and it can't run until we drop the spin lock.
>  

Hi,

I also think it doesn't make sense 10 secs timeout with irqs disabled.
It could cause critical system problem as Pavan said. How about adding
new quirk for different timeout value?

Best Regards,
Jung Daehwan

> [1] https://protect2.fireeye.com/v1/url?k=31f83ce7-6e63042d-31f9b7a8-0cc47a31bee8-8c4a74e439cea40a&q=1&e=5359ff44-4a70-42b7-b593-e37b236454ca&u=https%3A%2F%2Flore.kernel.org%2Flinux-usb%2F20220209055352.GA22550%40hu-pkondeti-hyd.qualcomm.com%2F
> 
> Thanks,
> Pavan
>
Oliver Neukum Feb. 15, 2022, 9:43 a.m. UTC | #6
On 15.02.22 09:51, Jung Daehwan wrote:
>
> I also think it doesn't make sense 10 secs timeout with irqs disabled.
> It could cause critical system problem as Pavan said. How about adding
> new quirk for different timeout value?
>
Hi,

I am afraid this amounts to putting bandages onto a fundamental
flaw. We need a version of xhci_reset() that can sleep and if
you need interrupts to be off to reset the HC, you are doing a bad design.

    Regards
        Oliver
Mathias Nyman Feb. 15, 2022, 10:16 a.m. UTC | #7
On 14.2.2022 15.53, Pavan Kondeti wrote:
> Hi Mathias,
> 
> On Mon, Feb 14, 2022 at 02:51:54PM +0200, Mathias Nyman wrote:
>> On 14.2.2022 14.20, Pavankumar Kondeti wrote:
>>> From: Daehwan Jung <dh10.jung@samsung.com>
>>>
>>> xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
>>> controller reset and controller ready operations can be fatal to the
>>> system when controller is timed out. Reduce the timeout to 1 second
>>> and print a error message when the time out happens.
>>>
>>> Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
>>
>>
>> The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
>> intentionally increased the timeout to 10 seconds as that host might take 9
>> seconds to complete reset. This was done almost 10 years ago so I don't know
>> if it really is an issue anymore.
>>
>> Anyways, your patch might break Renesas 72021 instead of fixing it.
> 
> Unfortunately, yes :-( . We have this reduced timeout patch in our previous
> commercialized products so thought this would be a good time to fix this
> once for all. Since this patch has been 10 years long, not sure if any other
> controllers also need 10 sec timeout. It would probably better
> 
>>
>> I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.
>>
>> Lets see if there is another solution for your case.
>>
>> - Does a "guard interval" after writing the reset help?
>>   For example Intel xHCI needs 1ms before touching xHC after writing the reset bit
> 
> I will ask this question to our hardware team. Setting that one quirk from
> DWC3 host might require other changes like this [1].
>>
>> - Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit? 
> 
> The RESET bit never gets cleared from USBCMD register.
> 
>>  
>> - we only disable local interrupts when xhci_reset() is called from xhci_stop(),
>>   and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
>>   Have you identified which one is the problematic case?
> 
> The crash reports I have seen are pointing to
> 
> usb_remove_hcd()->xhci_stop()->xhci_reset()

Ok, so xhci_stop() and xhci_shutdown() both may call xhci_reset() with interrupts
disabled and spinlock held. In both these cases we're not that interested in the
outcome of xhci_reset().

But during probe we call xhci_reset() with interrupts enabled without spinlock,
and here we really care about it succeeding.
I'm also guessing reset could take a longer time during probe due to possible recent
BIOS handover, or firmware loading etc.

So how about passing a timeout value to xhci_reset()?
Give it 10 seconds during probe, and 250ms in the other cases.

Then dig into the reason why xhci_stop(), xhci_resume() and xhci_shutdown() call
xhci_reset() witch spin_lock_irq() held.

-Mathias
Pavan Kondeti Feb. 15, 2022, 10:49 a.m. UTC | #8
Hi Mathias,

On Tue, Feb 15, 2022 at 12:16:12PM +0200, Mathias Nyman wrote:
> On 14.2.2022 15.53, Pavan Kondeti wrote:
> > Hi Mathias,
> > 
> > On Mon, Feb 14, 2022 at 02:51:54PM +0200, Mathias Nyman wrote:
> >> On 14.2.2022 14.20, Pavankumar Kondeti wrote:
> >>> From: Daehwan Jung <dh10.jung@samsung.com>
> >>>
> >>> xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> >>> controller reset and controller ready operations can be fatal to the
> >>> system when controller is timed out. Reduce the timeout to 1 second
> >>> and print a error message when the time out happens.
> >>>
> >>> Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> >>
> >>
> >> The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> >> intentionally increased the timeout to 10 seconds as that host might take 9
> >> seconds to complete reset. This was done almost 10 years ago so I don't know
> >> if it really is an issue anymore.
> >>
> >> Anyways, your patch might break Renesas 72021 instead of fixing it.
> > 
> > Unfortunately, yes :-( . We have this reduced timeout patch in our previous
> > commercialized products so thought this would be a good time to fix this
> > once for all. Since this patch has been 10 years long, not sure if any other
> > controllers also need 10 sec timeout. It would probably better
> > 
> >>
> >> I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.
> >>
> >> Lets see if there is another solution for your case.
> >>
> >> - Does a "guard interval" after writing the reset help?
> >>   For example Intel xHCI needs 1ms before touching xHC after writing the reset bit
> > 
> > I will ask this question to our hardware team. Setting that one quirk from
> > DWC3 host might require other changes like this [1].
> >>
> >> - Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit? 
> > 
> > The RESET bit never gets cleared from USBCMD register.
> > 
> >>  
> >> - we only disable local interrupts when xhci_reset() is called from xhci_stop(),
> >>   and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
> >>   Have you identified which one is the problematic case?
> > 
> > The crash reports I have seen are pointing to
> > 
> > usb_remove_hcd()->xhci_stop()->xhci_reset()
> 
> Ok, so xhci_stop() and xhci_shutdown() both may call xhci_reset() with interrupts
> disabled and spinlock held. In both these cases we're not that interested in the
> outcome of xhci_reset().
> 
> But during probe we call xhci_reset() with interrupts enabled without spinlock,
> and here we really care about it succeeding.
> I'm also guessing reset could take a longer time during probe due to possible recent
> BIOS handover, or firmware loading etc.
> 
> So how about passing a timeout value to xhci_reset()?
> Give it 10 seconds during probe, and 250ms in the other cases.
> 

Thanks for this suggestion.

This sounds better compared to the quirks approach. xhci_resume() also seems
to be calling xhci_reset() in the hibernation path, I believe we should treat
this like probe()/startup case and give larger timeout.

Thanks,
Pavan
Pavan Kondeti Feb. 15, 2022, 5:07 p.m. UTC | #9
On Tue, Feb 15, 2022 at 04:19:20PM +0530, Pavan Kondeti wrote:
> Hi Mathias,
> 
> On Tue, Feb 15, 2022 at 12:16:12PM +0200, Mathias Nyman wrote:
> > On 14.2.2022 15.53, Pavan Kondeti wrote:
> > > Hi Mathias,
> > > 
> > > On Mon, Feb 14, 2022 at 02:51:54PM +0200, Mathias Nyman wrote:
> > >> On 14.2.2022 14.20, Pavankumar Kondeti wrote:
> > >>> From: Daehwan Jung <dh10.jung@samsung.com>
> > >>>
> > >>> xhci_reset() is called with interrupts disabled. Waiting 10 seconds for
> > >>> controller reset and controller ready operations can be fatal to the
> > >>> system when controller is timed out. Reduce the timeout to 1 second
> > >>> and print a error message when the time out happens.
> > >>>
> > >>> Fixes: 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> > >>
> > >>
> > >> The commit 22ceac191211 ("xhci: Increase reset timeout for Renesas 720201 host.")
> > >> intentionally increased the timeout to 10 seconds as that host might take 9
> > >> seconds to complete reset. This was done almost 10 years ago so I don't know
> > >> if it really is an issue anymore.
> > >>
> > >> Anyways, your patch might break Renesas 72021 instead of fixing it.
> > > 
> > > Unfortunately, yes :-( . We have this reduced timeout patch in our previous
> > > commercialized products so thought this would be a good time to fix this
> > > once for all. Since this patch has been 10 years long, not sure if any other
> > > controllers also need 10 sec timeout. It would probably better
> > > 
> > >>
> > >> I agree that busylooping up to 10 seconds with interrupts disabled doesn't make sense.
> > >>
> > >> Lets see if there is another solution for your case.
> > >>
> > >> - Does a "guard interval" after writing the reset help?
> > >>   For example Intel xHCI needs 1ms before touching xHC after writing the reset bit
> > > 
> > > I will ask this question to our hardware team. Setting that one quirk from
> > > DWC3 host might require other changes like this [1].
> > >>
> > >> - Is it the CNR bit or the RESET bit that fails? could be just stuck CNR bit? 
> > > 
> > > The RESET bit never gets cleared from USBCMD register.
> > > 
> > >>  
> > >> - we only disable local interrupts when xhci_reset() is called from xhci_stop(),
> > >>   and sometimes from xhci_shutdown() and xhci_resume() if some conditions are met.
> > >>   Have you identified which one is the problematic case?
> > > 
> > > The crash reports I have seen are pointing to
> > > 
> > > usb_remove_hcd()->xhci_stop()->xhci_reset()
> > 
> > Ok, so xhci_stop() and xhci_shutdown() both may call xhci_reset() with interrupts
> > disabled and spinlock held. In both these cases we're not that interested in the
> > outcome of xhci_reset().
> > 
> > But during probe we call xhci_reset() with interrupts enabled without spinlock,
> > and here we really care about it succeeding.
> > I'm also guessing reset could take a longer time during probe due to possible recent
> > BIOS handover, or firmware loading etc.
> > 
> > So how about passing a timeout value to xhci_reset()?
> > Give it 10 seconds during probe, and 250ms in the other cases.
> > 
> 
> Thanks for this suggestion.
> 
> This sounds better compared to the quirks approach. xhci_resume() also seems
> to be calling xhci_reset() in the hibernation path, I believe we should treat
> this like probe()/startup case and give larger timeout.
> 
I will test the below patch as per Mathias suggestion.

Thanks,
Pavan

diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
index df3522d..031fe90 100644
--- a/drivers/usb/host/xhci-hub.c
+++ b/drivers/usb/host/xhci-hub.c
@@ -762,7 +762,7 @@ static int xhci_exit_test_mode(struct xhci_hcd *xhci)
 	}
 	pm_runtime_allow(xhci_to_hcd(xhci)->self.controller);
 	xhci->test_mode = 0;
-	return xhci_reset(xhci);
+	return xhci_reset(xhci, false);
 }
 
 void xhci_set_link_state(struct xhci_hcd *xhci, struct xhci_port *port,
diff --git a/drivers/usb/host/xhci-mem.c b/drivers/usb/host/xhci-mem.c
index 0e31206..174dccd 100644
--- a/drivers/usb/host/xhci-mem.c
+++ b/drivers/usb/host/xhci-mem.c
@@ -2583,7 +2583,7 @@ int xhci_mem_init(struct xhci_hcd *xhci, gfp_t flags)
 
 fail:
 	xhci_halt(xhci);
-	xhci_reset(xhci);
+	xhci_reset(xhci, false);
 	xhci_mem_cleanup(xhci);
 	return -ENOMEM;
 }
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index f5b1bcc..2137581 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -162,7 +162,7 @@ int xhci_start(struct xhci_hcd *xhci)
  * Transactions will be terminated immediately, and operational registers
  * will be set to their defaults.
  */
-int xhci_reset(struct xhci_hcd *xhci)
+int xhci_reset(struct xhci_hcd *xhci, bool must_succeed)
 {
 	u32 command;
 	u32 state;
@@ -195,8 +195,16 @@ int xhci_reset(struct xhci_hcd *xhci)
 	if (xhci->quirks & XHCI_INTEL_HOST)
 		udelay(1000);
 
+	/*
+	 * xhci_reset() can be called with interrupts disabled, use
+	 * 10 seconds timeout during start/resume only. stop/shutdown
+	 * can get away with lower timeout as controller reset failure
+	 * is not fatal.
+	 */
 	ret = xhci_handshake(&xhci->op_regs->command,
-			CMD_RESET, 0, 10 * 1000 * 1000);
+			CMD_RESET, 0, must_succeed ?
+			10 * 1000 * 1000 :
+			250 * 1000);
 	if (ret)
 		return ret;
 
@@ -210,7 +218,9 @@ int xhci_reset(struct xhci_hcd *xhci)
 	 * than status until the "Controller Not Ready" flag is cleared.
 	 */
 	ret = xhci_handshake(&xhci->op_regs->status,
-			STS_CNR, 0, 10 * 1000 * 1000);
+			STS_CNR, 0, must_succeed ?
+			10 * 1000 * 1000 :
+			250 * 1000);
 
 	xhci->usb2_rhub.bus_state.port_c_suspend = 0;
 	xhci->usb2_rhub.bus_state.suspended_ports = 0;
@@ -731,7 +741,7 @@ static void xhci_stop(struct usb_hcd *hcd)
 	xhci->xhc_state |= XHCI_STATE_HALTED;
 	xhci->cmd_ring_state = CMD_RING_STATE_STOPPED;
 	xhci_halt(xhci);
-	xhci_reset(xhci);
+	xhci_reset(xhci, false);
 	spin_unlock_irq(&xhci->lock);
 
 	xhci_cleanup_msix(xhci);
@@ -784,7 +794,7 @@ void xhci_shutdown(struct usb_hcd *hcd)
 	xhci_halt(xhci);
 	/* Workaround for spurious wakeups at shutdown with HSW */
 	if (xhci->quirks & XHCI_SPURIOUS_WAKEUP)
-		xhci_reset(xhci);
+		xhci_reset(xhci, false);
 	spin_unlock_irq(&xhci->lock);
 
 	xhci_cleanup_msix(xhci);
@@ -1163,7 +1173,7 @@ int xhci_resume(struct xhci_hcd *xhci, bool hibernated)
 		xhci_dbg(xhci, "Stop HCD\n");
 		xhci_halt(xhci);
 		xhci_zero_64b_regs(xhci);
-		retval = xhci_reset(xhci);
+		retval = xhci_reset(xhci, true);
 		spin_unlock_irq(&xhci->lock);
 		if (retval)
 			return retval;
@@ -5308,7 +5318,7 @@ int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks)
 
 	xhci_dbg(xhci, "Resetting HCD\n");
 	/* Reset the internal HC memory state and registers. */
-	retval = xhci_reset(xhci);
+	retval = xhci_reset(xhci, true);
 	if (retval)
 		return retval;
 	xhci_dbg(xhci, "Reset complete\n");
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index 5a75fe5..331a38f 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -2087,7 +2087,7 @@ int xhci_handshake(void __iomem *ptr, u32 mask, u32 done, int usec);
 void xhci_quiesce(struct xhci_hcd *xhci);
 int xhci_halt(struct xhci_hcd *xhci);
 int xhci_start(struct xhci_hcd *xhci);
-int xhci_reset(struct xhci_hcd *xhci);
+int xhci_reset(struct xhci_hcd *xhci, bool must_succeed);
 int xhci_run(struct usb_hcd *hcd);
 int xhci_gen_setup(struct usb_hcd *hcd, xhci_get_quirks_t get_quirks);
 void xhci_shutdown(struct usb_hcd *hcd);
Mathias Nyman Feb. 16, 2022, 3:58 p.m. UTC | #10
On 15.2.2022 19.07, Pavan Kondeti wrote:
>>>>
>>>> The crash reports I have seen are pointing to
>>>>
>>>> usb_remove_hcd()->xhci_stop()->xhci_reset()
>>>
>>> Ok, so xhci_stop() and xhci_shutdown() both may call xhci_reset() with interrupts
>>> disabled and spinlock held. In both these cases we're not that interested in the
>>> outcome of xhci_reset().
>>>
>>> But during probe we call xhci_reset() with interrupts enabled without spinlock,
>>> and here we really care about it succeeding.
>>> I'm also guessing reset could take a longer time during probe due to possible recent
>>> BIOS handover, or firmware loading etc.
>>>
>>> So how about passing a timeout value to xhci_reset()?
>>> Give it 10 seconds during probe, and 250ms in the other cases.
>>>
>>
>> Thanks for this suggestion.
>>
>> This sounds better compared to the quirks approach. xhci_resume() also seems
>> to be calling xhci_reset() in the hibernation path, I believe we should treat
>> this like probe()/startup case and give larger timeout.
>>
> I will test the below patch as per Mathias suggestion.
> 
> Thanks,
> Pavan
> 
> diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
> index df3522d..031fe90 100644
> --- a/drivers/usb/host/xhci-hub.c
> +++ b/drivers/usb/host/xhci-hub.c
> @@ -762,7 +762,7 @@ static int xhci_exit_test_mode(struct xhci_hcd *xhci)
>  	}
>  	pm_runtime_allow(xhci_to_hcd(xhci)->self.controller);
>  	xhci->test_mode = 0;
> -	return xhci_reset(xhci);
> +	return xhci_reset(xhci, false);

Maybe just pass the timeout value directly to xhci_reset().
Looks like readl_poll_timeout_atomic() uses u64 for timeout_us,
makes sense to use the same.

Sergey also pointed out xhci_handshake() incorrectly uses a signed integer for timeouts.
This could be changed to u64 as well.

I'll write a patch that does all above

-Mathias
Sergey Shtylyov Feb. 16, 2022, 4:44 p.m. UTC | #11
On 2/16/22 6:58 PM, Mathias Nyman wrote:

>>>>> The crash reports I have seen are pointing to
>>>>>
>>>>> usb_remove_hcd()->xhci_stop()->xhci_reset()
>>>>
>>>> Ok, so xhci_stop() and xhci_shutdown() both may call xhci_reset() with interrupts
>>>> disabled and spinlock held. In both these cases we're not that interested in the
>>>> outcome of xhci_reset().
>>>>
>>>> But during probe we call xhci_reset() with interrupts enabled without spinlock,
>>>> and here we really care about it succeeding.
>>>> I'm also guessing reset could take a longer time during probe due to possible recent
>>>> BIOS handover, or firmware loading etc.
>>>>
>>>> So how about passing a timeout value to xhci_reset()?
>>>> Give it 10 seconds during probe, and 250ms in the other cases.
>>>>
>>>
>>> Thanks for this suggestion.
>>>
>>> This sounds better compared to the quirks approach. xhci_resume() also seems
>>> to be calling xhci_reset() in the hibernation path, I believe we should treat
>>> this like probe()/startup case and give larger timeout.
>>>
>> I will test the below patch as per Mathias suggestion.
>>
>> Thanks,
>> Pavan
>>
>> diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
>> index df3522d..031fe90 100644
>> --- a/drivers/usb/host/xhci-hub.c
>> +++ b/drivers/usb/host/xhci-hub.c
>> @@ -762,7 +762,7 @@ static int xhci_exit_test_mode(struct xhci_hcd *xhci)
>>  	}
>>  	pm_runtime_allow(xhci_to_hcd(xhci)->self.controller);
>>  	xhci->test_mode = 0;
>> -	return xhci_reset(xhci);
>> +	return xhci_reset(xhci, false);
> 
> Maybe just pass the timeout value directly to xhci_reset().
> Looks like readl_poll_timeout_atomic() uses u64 for timeout_us,
> makes sense to use the same.
> 
> Sergey also pointed out xhci_handshake() incorrectly uses a signed integer for timeouts.
> This could be changed to u64 as well.
> 
> I'll write a patch that does all above

   You mean I don't need to respin my xhci_handshake() patch?
   I'm happy to do that if that's a prevailing opinion. :-)

> -Mathias

MBR, Sergey
Pavan Kondeti Feb. 17, 2022, 3:03 a.m. UTC | #12
Hi Mathias,

On Wed, Feb 16, 2022 at 05:58:15PM +0200, Mathias Nyman wrote:
> On 15.2.2022 19.07, Pavan Kondeti wrote:
> >>>>
> >>>> The crash reports I have seen are pointing to
> >>>>
> >>>> usb_remove_hcd()->xhci_stop()->xhci_reset()
> >>>
> >>> Ok, so xhci_stop() and xhci_shutdown() both may call xhci_reset() with interrupts
> >>> disabled and spinlock held. In both these cases we're not that interested in the
> >>> outcome of xhci_reset().
> >>>
> >>> But during probe we call xhci_reset() with interrupts enabled without spinlock,
> >>> and here we really care about it succeeding.
> >>> I'm also guessing reset could take a longer time during probe due to possible recent
> >>> BIOS handover, or firmware loading etc.
> >>>
> >>> So how about passing a timeout value to xhci_reset()?
> >>> Give it 10 seconds during probe, and 250ms in the other cases.
> >>>
> >>
> >> Thanks for this suggestion.
> >>
> >> This sounds better compared to the quirks approach. xhci_resume() also seems
> >> to be calling xhci_reset() in the hibernation path, I believe we should treat
> >> this like probe()/startup case and give larger timeout.
> >>
> > I will test the below patch as per Mathias suggestion.
> > 
> > Thanks,
> > Pavan
> > 
> > diff --git a/drivers/usb/host/xhci-hub.c b/drivers/usb/host/xhci-hub.c
> > index df3522d..031fe90 100644
> > --- a/drivers/usb/host/xhci-hub.c
> > +++ b/drivers/usb/host/xhci-hub.c
> > @@ -762,7 +762,7 @@ static int xhci_exit_test_mode(struct xhci_hcd *xhci)
> >  	}
> >  	pm_runtime_allow(xhci_to_hcd(xhci)->self.controller);
> >  	xhci->test_mode = 0;
> > -	return xhci_reset(xhci);
> > +	return xhci_reset(xhci, false);
> 
> Maybe just pass the timeout value directly to xhci_reset().
> Looks like readl_poll_timeout_atomic() uses u64 for timeout_us,
> makes sense to use the same.
> 
> Sergey also pointed out xhci_handshake() incorrectly uses a signed integer for timeouts.
> This could be changed to u64 as well.
> 
> I'll write a patch that does all above
> 
Thank you. I will look forward to your patch and provide the test results with
it.

Thanks,
Pavan
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index dc357ca..bb9ea3f 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -196,9 +196,11 @@  int xhci_reset(struct xhci_hcd *xhci)
 		udelay(1000);
 
 	ret = xhci_handshake(&xhci->op_regs->command,
-			CMD_RESET, 0, 10 * 1000 * 1000);
-	if (ret)
+			CMD_RESET, 0, 1 * 1000 * 1000);
+	if (ret) {
+		xhci_err(xhci, "Host controller reset timed out\n");
 		return ret;
+	}
 
 	if (xhci->quirks & XHCI_ASMEDIA_MODIFY_FLOWCONTROL)
 		usb_asmedia_modifyflowcontrol(to_pci_dev(xhci_to_hcd(xhci)->self.controller));
@@ -210,7 +212,11 @@  int xhci_reset(struct xhci_hcd *xhci)
 	 * than status until the "Controller Not Ready" flag is cleared.
 	 */
 	ret = xhci_handshake(&xhci->op_regs->status,
-			STS_CNR, 0, 10 * 1000 * 1000);
+			STS_CNR, 0, 1 * 1000 * 1000);
+	if (ret) {
+		xhci_err(xhci, "Host controller is not ready within timeout\n");
+		return ret;
+	}
 
 	xhci->usb2_rhub.bus_state.port_c_suspend = 0;
 	xhci->usb2_rhub.bus_state.suspended_ports = 0;