diff mbox series

[v2] xhci: dbc: fix handling ClearFeature Halt requests

Message ID 20240725074857.623299-1-ukaszb@chromium.org (mailing list archive)
State New
Headers show
Series [v2] xhci: dbc: fix handling ClearFeature Halt requests | expand

Commit Message

Łukasz Bartosik July 25, 2024, 7:48 a.m. UTC
DbC driver does not handle ClearFeature Halt requests correctly
which in turn may lead to dropping packets on the receive path.

Below is a trace capture where due to incorrect handling of
ClearFeature Halt packet gets dropped on the receive path.

/sys/kernel/debug/tracing # cat trace
1) kworker/10:3-514   [010] d..1.  2925.601843: xhci_dbc_handle_event:
	EVENT: TRB 000000019588c0e0 status 'Stall Error' len 0 slot 1 ep 2
	type 'Transfer Event' flags e:C

2) kworker/10:3-514   [010] d..1.  2925.613285: xhci_dbc_handle_event:
	EVENT: TRB 000000019588bc80 status 'Stall Error' len 1024 slot 1
	ep 3 type 'Transfer Event' flags e:C

3) kworker/10:3-514   [010] d..1.  2925.619024: xhci_dbc_handle_transfer:
	BULK: Buffer 0000000244b49800 length 1024 TD size 0 intr 0 type
	'Normal' flags b:i:I:c:s:i:e:C

4) kworker/10:3-514   [010] d..1.  2925.619025: xhci_dbc_giveback_request:
	bulk-in: req 00000000a70b5ad2 length 0/1024 ==> -6

5) kworker/10:3-514   [010] dNs2.  2925.623820: xhci_dbc_gadget_ep_queue:
	BULK: Buffer 0000000244b49800 length 1024 TD size 0 intr 0 type
	'Normal' flags b:i:I:c:s:i:e:c

6) kworker/10:3-514   [010] dNs1.  2925.623823: xhci_dbc_queue_request:
	bulk-in: req 00000000a70b5ad2 length 0/1024 ==> -115

7) kworker/10:3-514   [010] d..1.  2925.629865: xhci_dbc_handle_event:
	EVENT: TRB 000000019588bc80 status 'Short Packet' len 1000 slot 1
	ep 3 type 'Transfer Event' flags e:C

8) kworker/10:3-514   [010] d..1.  2925.635540: xhci_dbc_handle_event:
	EVENT: TRB 000000019588bc90 status 'Short Packet' len 763 slot 1
	ep 3 type 'Transfer Event' flags e:C

9) kworker/10:3-514   [010] d..1.  2925.635540: xhci_dbc_handle_transfer:
	BULK: Buffer 0000000244b49c00 length 1024 TD size 0 intr 0 type
	'Normal' flags b:i:I:c:s:i:e:C

10) kworker/10:3-514  [010] d..1.  2925.635541: xhci_dbc_giveback_request:
	bulk-in: req 00000000b4ec77d7 length 261/1024 ==> 0

11) kworker/10:3-514  [010] dNs2.  2925.635561: xhci_dbc_gadget_ep_queue:
	BULK: Buffer 0000000244b49c00 length 1024 TD size 0 intr 0 type
	'Normal' flags b:i:I:c:s:i:e:c

12) kworker/10:3-514  [010] dNs1.  2925.635562: xhci_dbc_queue_request:
	bulk-in: req 00000000b4ec77d7 length 0/1024 ==> -115

Lines 1 and 2 are Endpoints OUT and IN responses to receiving ClearFeature
Halt requests.

Line 7 notifies of reception of 24 bytes packet.

Line 8 notifies of reception of 261 bytes packet

In Lines [9, 12] 261 bytes packet is being processed.

However 24 bytes packet gets dropped. The kernel log includes entry which
is an indication of a packet drop:
[  127.651845] xhci_hcd 0000:00:0d.0: no matched request

This fix adds correct handling of ClearFeature Halt requests
by restarting an endpoint which received the request.

Fixes: dfba2174dc42 ("usb: xhci: Add DbC support in xHCI driver")
Signed-off-by: Łukasz Bartosik <ukaszb@chromium.org>
---
Changes v2->v1:
- Documented both xhci_dbc_flush_single_request and 
xhci_dbc_flush_endpoint_requests functions.
---
 drivers/usb/host/xhci-dbgcap.c | 48 +++++++++++++++++++++++++++-------
 drivers/usb/host/xhci-dbgtty.c |  5 ++++
 2 files changed, 43 insertions(+), 10 deletions(-)

Comments

Mathias Nyman July 29, 2024, 2:12 p.m. UTC | #1
Hi

On 25.7.2024 10.48, Łukasz Bartosik wrote:
> DbC driver does not handle ClearFeature Halt requests correctly
> which in turn may lead to dropping packets on the receive path.

Nice catch.
Looks like a halted endpoint is treated almost as a disconnect.

> 
> Below is a trace capture where due to incorrect handling of
> ClearFeature Halt packet gets dropped on the receive path.
> 
> /sys/kernel/debug/tracing # cat trace
> 1) kworker/10:3-514   [010] d..1.  2925.601843: xhci_dbc_handle_event:
> 	EVENT: TRB 000000019588c0e0 status 'Stall Error' len 0 slot 1 ep 2
> 	type 'Transfer Event' flags e:C
> 
> 2) kworker/10:3-514   [010] d..1.  2925.613285: xhci_dbc_handle_event:
> 	EVENT: TRB 000000019588bc80 status 'Stall Error' len 1024 slot 1
> 	ep 3 type 'Transfer Event' flags e:C
> 
> 3) kworker/10:3-514   [010] d..1.  2925.619024: xhci_dbc_handle_transfer:
> 	BULK: Buffer 0000000244b49800 length 1024 TD size 0 intr 0 type
> 	'Normal' flags b:i:I:c:s:i:e:C
> 
> 4) kworker/10:3-514   [010] d..1.  2925.619025: xhci_dbc_giveback_request:
> 	bulk-in: req 00000000a70b5ad2 length 0/1024 ==> -6
> 
> 5) kworker/10:3-514   [010] dNs2.  2925.623820: xhci_dbc_gadget_ep_queue:
> 	BULK: Buffer 0000000244b49800 length 1024 TD size 0 intr 0 type
> 	'Normal' flags b:i:I:c:s:i:e:c
> 
> 6) kworker/10:3-514   [010] dNs1.  2925.623823: xhci_dbc_queue_request:
> 	bulk-in: req 00000000a70b5ad2 length 0/1024 ==> -115
> 
> 7) kworker/10:3-514   [010] d..1.  2925.629865: xhci_dbc_handle_event:
> 	EVENT: TRB 000000019588bc80 status 'Short Packet' len 1000 slot 1
> 	ep 3 type 'Transfer Event' flags e:C
> 
> 8) kworker/10:3-514   [010] d..1.  2925.635540: xhci_dbc_handle_event:
> 	EVENT: TRB 000000019588bc90 status 'Short Packet' len 763 slot 1
> 	ep 3 type 'Transfer Event' flags e:C
> 
> 9) kworker/10:3-514   [010] d..1.  2925.635540: xhci_dbc_handle_transfer:
> 	BULK: Buffer 0000000244b49c00 length 1024 TD size 0 intr 0 type
> 	'Normal' flags b:i:I:c:s:i:e:C
> 
> 10) kworker/10:3-514  [010] d..1.  2925.635541: xhci_dbc_giveback_request:
> 	bulk-in: req 00000000b4ec77d7 length 261/1024 ==> 0
> 
> 11) kworker/10:3-514  [010] dNs2.  2925.635561: xhci_dbc_gadget_ep_queue:
> 	BULK: Buffer 0000000244b49c00 length 1024 TD size 0 intr 0 type
> 	'Normal' flags b:i:I:c:s:i:e:c
> 
> 12) kworker/10:3-514  [010] dNs1.  2925.635562: xhci_dbc_queue_request:
> 	bulk-in: req 00000000b4ec77d7 length 0/1024 ==> -115
> 
> Lines 1 and 2 are Endpoints OUT and IN responses to receiving ClearFeature
> Halt requests.

Stall errors (Line 1 and 2) events are probably issued already when endpoint halts,
before the host responds with a ClearFeature(Halt) request.

> 
> Line 7 notifies of reception of 24 bytes packet.
> 
> Line 8 notifies of reception of 261 bytes packet
> 
> In Lines [9, 12] 261 bytes packet is being processed.
> 
> However 24 bytes packet gets dropped. The kernel log includes entry which
> is an indication of a packet drop:
> [  127.651845] xhci_hcd 0000:00:0d.0: no matched request
> 
> This fix adds correct handling of ClearFeature Halt requests
> by restarting an endpoint which received the request.
> 
> Fixes: dfba2174dc42 ("usb: xhci: Add DbC support in xHCI driver")
> Signed-off-by: Łukasz Bartosik <ukaszb@chromium.org>
> ---
> Changes v2->v1:
> - Documented both xhci_dbc_flush_single_request and
> xhci_dbc_flush_endpoint_requests functions.
> ---
>   drivers/usb/host/xhci-dbgcap.c | 48 +++++++++++++++++++++++++++-------
>   drivers/usb/host/xhci-dbgtty.c |  5 ++++
>   2 files changed, 43 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/usb/host/xhci-dbgcap.c b/drivers/usb/host/xhci-dbgcap.c
> index 872d9cddbcef..36ec3242e096 100644
> --- a/drivers/usb/host/xhci-dbgcap.c
> +++ b/drivers/usb/host/xhci-dbgcap.c
> @@ -173,7 +173,17 @@ static void xhci_dbc_giveback(struct dbc_request *req, int status)
>   	spin_lock(&dbc->lock);
>   }
>   
> -static void xhci_dbc_flush_single_request(struct dbc_request *req)
> +/**
> + * xhci_dbc_flush_single_request - flushes single request
> + * @req: request to flush
> + * @status: 0 or -ERESTART - after the request is flushed it will be queued
> + *          back to the endpoint
> + *
> + *          -ESHUTDOWN - after the request is flushed it won't be queued back
> + *          to the endpoint and if it was last endpoint's request the endpoint
> + *          will stay shut
> + */

Hmm, I need to dig into this.

I don't think we should push this problem up to the request completion handler.
Maybe we are flushing requests that should not be flushed?

Do you have an easy way to reproduce the stall error case?

Thanks
-Mathias
Łukasz Bartosik July 30, 2024, 12:17 a.m. UTC | #2
On Mon, Jul 29, 2024 at 4:11 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> Hi
>
> On 25.7.2024 10.48, Łukasz Bartosik wrote:
> > DbC driver does not handle ClearFeature Halt requests correctly
> > which in turn may lead to dropping packets on the receive path.
>
> Nice catch.
> Looks like a halted endpoint is treated almost as a disconnect.
>
> >
> > Below is a trace capture where due to incorrect handling of
> > ClearFeature Halt packet gets dropped on the receive path.
> >
> > /sys/kernel/debug/tracing # cat trace
> > 1) kworker/10:3-514   [010] d..1.  2925.601843: xhci_dbc_handle_event:
> >       EVENT: TRB 000000019588c0e0 status 'Stall Error' len 0 slot 1 ep 2
> >       type 'Transfer Event' flags e:C
> >
> > 2) kworker/10:3-514   [010] d..1.  2925.613285: xhci_dbc_handle_event:
> >       EVENT: TRB 000000019588bc80 status 'Stall Error' len 1024 slot 1
> >       ep 3 type 'Transfer Event' flags e:C
> >
> > 3) kworker/10:3-514   [010] d..1.  2925.619024: xhci_dbc_handle_transfer:
> >       BULK: Buffer 0000000244b49800 length 1024 TD size 0 intr 0 type
> >       'Normal' flags b:i:I:c:s:i:e:C
> >
> > 4) kworker/10:3-514   [010] d..1.  2925.619025: xhci_dbc_giveback_request:
> >       bulk-in: req 00000000a70b5ad2 length 0/1024 ==> -6
> >
> > 5) kworker/10:3-514   [010] dNs2.  2925.623820: xhci_dbc_gadget_ep_queue:
> >       BULK: Buffer 0000000244b49800 length 1024 TD size 0 intr 0 type
> >       'Normal' flags b:i:I:c:s:i:e:c
> >
> > 6) kworker/10:3-514   [010] dNs1.  2925.623823: xhci_dbc_queue_request:
> >       bulk-in: req 00000000a70b5ad2 length 0/1024 ==> -115
> >
> > 7) kworker/10:3-514   [010] d..1.  2925.629865: xhci_dbc_handle_event:
> >       EVENT: TRB 000000019588bc80 status 'Short Packet' len 1000 slot 1
> >       ep 3 type 'Transfer Event' flags e:C
> >
> > 8) kworker/10:3-514   [010] d..1.  2925.635540: xhci_dbc_handle_event:
> >       EVENT: TRB 000000019588bc90 status 'Short Packet' len 763 slot 1
> >       ep 3 type 'Transfer Event' flags e:C
> >
> > 9) kworker/10:3-514   [010] d..1.  2925.635540: xhci_dbc_handle_transfer:
> >       BULK: Buffer 0000000244b49c00 length 1024 TD size 0 intr 0 type
> >       'Normal' flags b:i:I:c:s:i:e:C
> >
> > 10) kworker/10:3-514  [010] d..1.  2925.635541: xhci_dbc_giveback_request:
> >       bulk-in: req 00000000b4ec77d7 length 261/1024 ==> 0
> >
> > 11) kworker/10:3-514  [010] dNs2.  2925.635561: xhci_dbc_gadget_ep_queue:
> >       BULK: Buffer 0000000244b49c00 length 1024 TD size 0 intr 0 type
> >       'Normal' flags b:i:I:c:s:i:e:c
> >
> > 12) kworker/10:3-514  [010] dNs1.  2925.635562: xhci_dbc_queue_request:
> >       bulk-in: req 00000000b4ec77d7 length 0/1024 ==> -115
> >
> > Lines 1 and 2 are Endpoints OUT and IN responses to receiving ClearFeature
> > Halt requests.
>
> Stall errors (Line 1 and 2) events are probably issued already when endpoint halts,
> before the host responds with a ClearFeature(Halt) request.
>

I observed these stall errors are the result of calling
libusb_clear_halt() function which results
in sending ClearFeature(Halt) requests to a device. I should have
mentioned it in the commit
message because this is a vital piece of information.

> >
> > Line 7 notifies of reception of 24 bytes packet.
> >
> > Line 8 notifies of reception of 261 bytes packet
> >
> > In Lines [9, 12] 261 bytes packet is being processed.
> >
> > However 24 bytes packet gets dropped. The kernel log includes entry which
> > is an indication of a packet drop:
> > [  127.651845] xhci_hcd 0000:00:0d.0: no matched request
> >
> > This fix adds correct handling of ClearFeature Halt requests
> > by restarting an endpoint which received the request.
> >
> > Fixes: dfba2174dc42 ("usb: xhci: Add DbC support in xHCI driver")
> > Signed-off-by: Łukasz Bartosik <ukaszb@chromium.org>
> > ---
> > Changes v2->v1:
> > - Documented both xhci_dbc_flush_single_request and
> > xhci_dbc_flush_endpoint_requests functions.
> > ---
> >   drivers/usb/host/xhci-dbgcap.c | 48 +++++++++++++++++++++++++++-------
> >   drivers/usb/host/xhci-dbgtty.c |  5 ++++
> >   2 files changed, 43 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/usb/host/xhci-dbgcap.c b/drivers/usb/host/xhci-dbgcap.c
> > index 872d9cddbcef..36ec3242e096 100644
> > --- a/drivers/usb/host/xhci-dbgcap.c
> > +++ b/drivers/usb/host/xhci-dbgcap.c
> > @@ -173,7 +173,17 @@ static void xhci_dbc_giveback(struct dbc_request *req, int status)
> >       spin_lock(&dbc->lock);
> >   }
> >
> > -static void xhci_dbc_flush_single_request(struct dbc_request *req)
> > +/**
> > + * xhci_dbc_flush_single_request - flushes single request
> > + * @req: request to flush
> > + * @status: 0 or -ERESTART - after the request is flushed it will be queued
> > + *          back to the endpoint
> > + *
> > + *          -ESHUTDOWN - after the request is flushed it won't be queued back
> > + *          to the endpoint and if it was last endpoint's request the endpoint
> > + *          will stay shut
> > + */
>
> Hmm, I need to dig into this.
>
> I don't think we should push this problem up to the request completion handler.
> Maybe we are flushing requests that should not be flushed?
>

Section 7.6.4.3 "Halted DbC Endpoints" in Intel's xHCI specification
says the endpoint can
be halted by HW in case of error. Also it can be halted by software
through HIT or HOT flags for DbC.
I wonder how to recover properly from the Halted state caused by HW
error ? Does it make sense to continue with
the requests or just restart the endpoint (flush all requests) as this
patch does ?

> Do you have an easy way to reproduce the stall error case?
>

Yes I can easily reproduce the case with the stall errors.
Would you like me to run any specific scenarios ?

Thanks,
Lukasz

> Thanks
> -Mathias
Mathias Nyman July 30, 2024, 3:47 p.m. UTC | #3
On 30.7.2024 3.17, Łukasz Bartosik wrote:
> On Mon, Jul 29, 2024 at 4:11 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>>
>> Hi
>>
>> On 25.7.2024 10.48, Łukasz Bartosik wrote:
>>> DbC driver does not handle ClearFeature Halt requests correctly
>>> which in turn may lead to dropping packets on the receive path.
>>
>> Nice catch.
>> Looks like a halted endpoint is treated almost as a disconnect.
>>

...

>>
>> Hmm, I need to dig into this.
>>
>> I don't think we should push this problem up to the request completion handler.
>> Maybe we are flushing requests that should not be flushed?
>>
> 
> Section 7.6.4.3 "Halted DbC Endpoints" in Intel's xHCI specification
> says the endpoint can
> be halted by HW in case of error. Also it can be halted by software
> through HIT or HOT flags for DbC.
> I wonder how to recover properly from the Halted state caused by HW
> error ? Does it make sense to continue with
> the requests or just restart the endpoint (flush all requests) as this
> patch does ?

DbC should respond with STALL packets to host if HIT or HOT is set.
Host side should react to this by sending a ClearFeature(HALT) request to DbC,
which should clear the halted endpoint and HIT/HOT flags.

I think we may lose more data in DbC bulk-out stall cases (data from DbC to host)
when flushing the requests.
Data is copied from kfifo to the requests while queuing them, if we then flush them
we will never send that data to host.

> 
>> Do you have an easy way to reproduce the stall error case?
>>
> 
> Yes I can easily reproduce the case with the stall errors.
> Would you like me to run any specific scenarios ?

I pushed my thoughts to a "fix_dbc_halted_ep" branch, it compiles but is not complete.
It mostly focuses on getting the STALL case for bulk-in working which this report was
about.

I think the code itself best describes what I'm trying to do.

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_dbc_halted_ep
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=8532b621314e93336535528d5d45e41974c75e01
  
If you can try it out it would be great.

Thanks
-Mathias
Łukasz Bartosik July 31, 2024, 12:28 p.m. UTC | #4
On Tue, Jul 30, 2024 at 5:45 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> On 30.7.2024 3.17, Łukasz Bartosik wrote:
> > On Mon, Jul 29, 2024 at 4:11 PM Mathias Nyman
> > <mathias.nyman@linux.intel.com> wrote:
> >>
> >> Hi
> >>
> >> On 25.7.2024 10.48, Łukasz Bartosik wrote:
> >>> DbC driver does not handle ClearFeature Halt requests correctly
> >>> which in turn may lead to dropping packets on the receive path.
> >>
> >> Nice catch.
> >> Looks like a halted endpoint is treated almost as a disconnect.
> >>
>
> ...
>

Sorry I didn't get that ?

> >>
> >> Hmm, I need to dig into this.
> >>
> >> I don't think we should push this problem up to the request completion handler.
> >> Maybe we are flushing requests that should not be flushed?
> >>
> >
> > Section 7.6.4.3 "Halted DbC Endpoints" in Intel's xHCI specification
> > says the endpoint can
> > be halted by HW in case of error. Also it can be halted by software
> > through HIT or HOT flags for DbC.
> > I wonder how to recover properly from the Halted state caused by HW
> > error ? Does it make sense to continue with
> > the requests or just restart the endpoint (flush all requests) as this
> > patch does ?
>
> DbC should respond with STALL packets to host if HIT or HOT is set.
> Host side should react to this by sending a ClearFeature(HALT) request to DbC,
> which should clear the halted endpoint and HIT/HOT flags.
>

Based on that I wonder when the DbC endpoint is not halted and it
receives the ClearFeature(Halt) request
whether this is correct behavior for the DbC endpoint to report a
STALL error in such a case ?

> I think we may lose more data in DbC bulk-out stall cases (data from DbC to host)
> when flushing the requests.
> Data is copied from kfifo to the requests while queuing them, if we then flush them
> we will never send that data to host.
>
> >
> >> Do you have an easy way to reproduce the stall error case?
> >>
> >
> > Yes I can easily reproduce the case with the stall errors.
> > Would you like me to run any specific scenarios ?
>
> I pushed my thoughts to a "fix_dbc_halted_ep" branch, it compiles but is not complete.
> It mostly focuses on getting the STALL case for bulk-in working which this report was
> about.
>
> I think the code itself best describes what I'm trying to do.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_dbc_halted_ep
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=8532b621314e93336535528d5d45e41974c75e01
>
> If you can try it out it would be great.
>

Sure I will test your patch and get back with the result.

Thanks,
Lukasz

> Thanks
> -Mathias
>
Mathias Nyman July 31, 2024, 12:49 p.m. UTC | #5
On 31.7.2024 15.28, Łukasz Bartosik wrote:
> On Tue, Jul 30, 2024 at 5:45 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>>
>> On 30.7.2024 3.17, Łukasz Bartosik wrote:
>>> On Mon, Jul 29, 2024 at 4:11 PM Mathias Nyman
>>> <mathias.nyman@linux.intel.com> wrote:
>>>>
>>>> Hi
>>>>
>>>> On 25.7.2024 10.48, Łukasz Bartosik wrote:
>>>>> DbC driver does not handle ClearFeature Halt requests correctly
>>>>> which in turn may lead to dropping packets on the receive path.
>>>>
>>>> Nice catch.
>>>> Looks like a halted endpoint is treated almost as a disconnect.
>>>>
>>
>>
>> I pushed my thoughts to a "fix_dbc_halted_ep" branch, it compiles but is not complete.
>> It mostly focuses on getting the STALL case for bulk-in working which this report was
>> about.
>>
>> I think the code itself best describes what I'm trying to do.
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_dbc_halted_ep
>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=8532b621314e93336535528d5d45e41974c75e01
>>
>> If you can try it out it would be great.
>>
> 
> Sure I will test your patch and get back with the result.

Thanks, that patch was missing a "ctrl = readl(&dbc->regs->control);" line

It's now fixed here:
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=cf99b473a1477c1b3510af0021877197a039c43f

Can you try that instead

Thanks
Mathias
Łukasz Bartosik July 31, 2024, 1:42 p.m. UTC | #6
On Wed, Jul 31, 2024 at 2:47 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> On 31.7.2024 15.28, Łukasz Bartosik wrote:
> > On Tue, Jul 30, 2024 at 5:45 PM Mathias Nyman
> > <mathias.nyman@linux.intel.com> wrote:
> >>
> >> On 30.7.2024 3.17, Łukasz Bartosik wrote:
> >>> On Mon, Jul 29, 2024 at 4:11 PM Mathias Nyman
> >>> <mathias.nyman@linux.intel.com> wrote:
> >>>>
> >>>> Hi
> >>>>
> >>>> On 25.7.2024 10.48, Łukasz Bartosik wrote:
> >>>>> DbC driver does not handle ClearFeature Halt requests correctly
> >>>>> which in turn may lead to dropping packets on the receive path.
> >>>>
> >>>> Nice catch.
> >>>> Looks like a halted endpoint is treated almost as a disconnect.
> >>>>
> >>
> >>
> >> I pushed my thoughts to a "fix_dbc_halted_ep" branch, it compiles but is not complete.
> >> It mostly focuses on getting the STALL case for bulk-in working which this report was
> >> about.
> >>
> >> I think the code itself best describes what I'm trying to do.
> >>
> >> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_dbc_halted_ep
> >> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=8532b621314e93336535528d5d45e41974c75e01
> >>
> >> If you can try it out it would be great.
> >>
> >
> > Sure I will test your patch and get back with the result.
>
> Thanks, that patch was missing a "ctrl = readl(&dbc->regs->control);" line
>
> It's now fixed here:
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=cf99b473a1477c1b3510af0021877197a039c43f
>
> Can you try that instead
>

I will. Thanks for the update.

Łukasz

> Thanks
> Mathias
>
Łukasz Bartosik Aug. 1, 2024, 3:02 p.m. UTC | #7
On Wed, Jul 31, 2024 at 3:42 PM Łukasz Bartosik <ukaszb@chromium.org> wrote:
>
> On Wed, Jul 31, 2024 at 2:47 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
> >
> > On 31.7.2024 15.28, Łukasz Bartosik wrote:
> > > On Tue, Jul 30, 2024 at 5:45 PM Mathias Nyman
> > > <mathias.nyman@linux.intel.com> wrote:
> > >>
> > >> On 30.7.2024 3.17, Łukasz Bartosik wrote:
> > >>> On Mon, Jul 29, 2024 at 4:11 PM Mathias Nyman
> > >>> <mathias.nyman@linux.intel.com> wrote:
> > >>>>
> > >>>> Hi
> > >>>>
> > >>>> On 25.7.2024 10.48, Łukasz Bartosik wrote:
> > >>>>> DbC driver does not handle ClearFeature Halt requests correctly
> > >>>>> which in turn may lead to dropping packets on the receive path.
> > >>>>
> > >>>> Nice catch.
> > >>>> Looks like a halted endpoint is treated almost as a disconnect.
> > >>>>
> > >>
> > >>
> > >> I pushed my thoughts to a "fix_dbc_halted_ep" branch, it compiles but is not complete.
> > >> It mostly focuses on getting the STALL case for bulk-in working which this report was
> > >> about.
> > >>
> > >> I think the code itself best describes what I'm trying to do.
> > >>
> > >> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_dbc_halted_ep
> > >> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=8532b621314e93336535528d5d45e41974c75e01
> > >>
> > >> If you can try it out it would be great.
> > >>
> > >
> > > Sure I will test your patch and get back with the result.
> >
> > Thanks, that patch was missing a "ctrl = readl(&dbc->regs->control);" line
> >
> > It's now fixed here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=cf99b473a1477c1b3510af0021877197a039c43f
> >
> > Can you try that instead
> >
>
> I will. Thanks for the update.
>
> Łukasz
>

Hi Matthias,

I ran the test with your fix cf99b473a1477c1b3510af0021877197a039c43f
however the issue still exists.

Trace capture:
# tracer: nop
#
# entries-in-buffer/entries-written: 12/12   #P:12
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
     kworker/7:1-418     [007] d..1.   103.737275:
xhci_dbc_handle_event: EVENT: TRB 00000001943ae000 status 'Stall
Error' len 0 slot 1 ep 2 type 'Transfer Event' flags e:C
     kworker/7:1-418     [007] d..1.   103.742975:
xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Stall
Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:C
     kworker/7:1-418     [007] d..1.   103.742976:
xhci_dbc_handle_transfer: BULK: Buffer 000000017988e400 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
     kworker/7:1-418     [007] d..1.   103.753393:
xhci_dbc_giveback_request: bulk-in: req 00000000f2919fbe length 0/1024
==> 0
     kworker/7:1-418     [007] dNs2.   103.753407:
xhci_dbc_gadget_ep_queue: BULK: Buffer 000000017988e400 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c
     kworker/7:1-418     [007] dNs1.   103.753409:
xhci_dbc_queue_request: bulk-in: req 00000000f2919fbe length 0/1024
==> -115
     kworker/7:1-418     [007] d..1.   103.767607:
xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Short
Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:C
     kworker/7:1-418     [007] d..1.   103.773260:
xhci_dbc_handle_event: EVENT: TRB 00000001943ad010 status 'Short
Packet' len 763 slot 1 ep 3 type 'Transfer Event' flags e:C
     kworker/7:1-418     [007] d..1.   103.773260:
xhci_dbc_handle_transfer: BULK: Buffer 000000017988e800 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
     kworker/7:1-418     [007] d..1.   103.773261:
xhci_dbc_giveback_request: bulk-in: req 00000000eb22d5f2 length
261/1024 ==> 0
     kworker/7:1-418     [007] dNs2.   103.773281:
xhci_dbc_gadget_ep_queue: BULK: Buffer 000000017988e800 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c
     kworker/7:1-418     [007] dNs1.   103.773283:
xhci_dbc_queue_request: bulk-in: req 00000000eb22d5f2 length 0/1024
==> -115

A 24 bytes packet is dropped which is confirmed in kernel logs by "no
matched request" entries.

Kernel logs related to DbC:
[   55.569943] xhci_hcd 0000:00:0d.0: DbC connected
[   55.861946] xhci_hcd 0000:00:0d.0: DbC configured
[  103.702190] xhci_hcd 0000:00:0d.0: no matched request
[  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB
1943ad000, remaining 1024, ep deq 1943ad001
[  103.726161] xhci_hcd 0000:00:0d.0: DbC Endpoint halt cleared
[  103.732517] xhci_hcd 0000:00:0d.0: no matched request


> DbC should respond with STALL packets to host if HIT or HOT is set.
> Host side should react to this by sending a ClearFeature(HALT) request to DbC,
> which should clear the halted endpoint and HIT/HOT flags.
>

I would like to ask you about it again because I wonder when the DbC
endpoint is not halted and it
receives the ClearFeature(Halt) request whether this is correct
behavior for the DbC endpoint to report a
STALL error in such a case ?

Thanks,
Lukasz

> > Thanks
> > Mathias
> >
Łukasz Bartosik Aug. 3, 2024, 9:51 p.m. UTC | #8
On Thu, Aug 1, 2024 at 5:02 PM Łukasz Bartosik <ukaszb@chromium.org> wrote:
>
> On Wed, Jul 31, 2024 at 3:42 PM Łukasz Bartosik <ukaszb@chromium.org> wrote:
> >
> > On Wed, Jul 31, 2024 at 2:47 PM Mathias Nyman
> > <mathias.nyman@linux.intel.com> wrote:
> > >
> > > On 31.7.2024 15.28, Łukasz Bartosik wrote:
> > > > On Tue, Jul 30, 2024 at 5:45 PM Mathias Nyman
> > > > <mathias.nyman@linux.intel.com> wrote:
> > > >>
> > > >> On 30.7.2024 3.17, Łukasz Bartosik wrote:
> > > >>> On Mon, Jul 29, 2024 at 4:11 PM Mathias Nyman
> > > >>> <mathias.nyman@linux.intel.com> wrote:
> > > >>>>
> > > >>>> Hi
> > > >>>>
> > > >>>> On 25.7.2024 10.48, Łukasz Bartosik wrote:
> > > >>>>> DbC driver does not handle ClearFeature Halt requests correctly
> > > >>>>> which in turn may lead to dropping packets on the receive path.
> > > >>>>
> > > >>>> Nice catch.
> > > >>>> Looks like a halted endpoint is treated almost as a disconnect.
> > > >>>>
> > > >>
> > > >>
> > > >> I pushed my thoughts to a "fix_dbc_halted_ep" branch, it compiles but is not complete.
> > > >> It mostly focuses on getting the STALL case for bulk-in working which this report was
> > > >> about.
> > > >>
> > > >> I think the code itself best describes what I'm trying to do.
> > > >>
> > > >> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_dbc_halted_ep
> > > >> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=8532b621314e93336535528d5d45e41974c75e01
> > > >>
> > > >> If you can try it out it would be great.
> > > >>
> > > >
> > > > Sure I will test your patch and get back with the result.
> > >
> > > Thanks, that patch was missing a "ctrl = readl(&dbc->regs->control);" line
> > >
> > > It's now fixed here:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=cf99b473a1477c1b3510af0021877197a039c43f
> > >
> > > Can you try that instead
> > >
> >
> > I will. Thanks for the update.
> >
> > Łukasz
> >
>
> Hi Matthias,
>
> I ran the test with your fix cf99b473a1477c1b3510af0021877197a039c43f
> however the issue still exists.
>
> Trace capture:
> # tracer: nop
> #
> # entries-in-buffer/entries-written: 12/12   #P:12
> #
> #                                _-----=> irqs-off/BH-disabled
> #                               / _----=> need-resched
> #                              | / _---=> hardirq/softirq
> #                              || / _--=> preempt-depth
> #                              ||| / _-=> migrate-disable
> #                              |||| /     delay
> #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
> #              | |         |   |||||     |         |
>      kworker/7:1-418     [007] d..1.   103.737275:
> xhci_dbc_handle_event: EVENT: TRB 00000001943ae000 status 'Stall
> Error' len 0 slot 1 ep 2 type 'Transfer Event' flags e:C
>      kworker/7:1-418     [007] d..1.   103.742975:
> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Stall
> Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:C
>      kworker/7:1-418     [007] d..1.   103.742976:
> xhci_dbc_handle_transfer: BULK: Buffer 000000017988e400 length 1024 TD
> size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
>      kworker/7:1-418     [007] d..1.   103.753393:
> xhci_dbc_giveback_request: bulk-in: req 00000000f2919fbe length 0/1024
> ==> 0
>      kworker/7:1-418     [007] dNs2.   103.753407:
> xhci_dbc_gadget_ep_queue: BULK: Buffer 000000017988e400 length 1024 TD
> size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c
>      kworker/7:1-418     [007] dNs1.   103.753409:
> xhci_dbc_queue_request: bulk-in: req 00000000f2919fbe length 0/1024
> ==> -115
>      kworker/7:1-418     [007] d..1.   103.767607:
> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Short
> Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:C
>      kworker/7:1-418     [007] d..1.   103.773260:
> xhci_dbc_handle_event: EVENT: TRB 00000001943ad010 status 'Short
> Packet' len 763 slot 1 ep 3 type 'Transfer Event' flags e:C
>      kworker/7:1-418     [007] d..1.   103.773260:
> xhci_dbc_handle_transfer: BULK: Buffer 000000017988e800 length 1024 TD
> size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
>      kworker/7:1-418     [007] d..1.   103.773261:
> xhci_dbc_giveback_request: bulk-in: req 00000000eb22d5f2 length
> 261/1024 ==> 0
>      kworker/7:1-418     [007] dNs2.   103.773281:
> xhci_dbc_gadget_ep_queue: BULK: Buffer 000000017988e800 length 1024 TD
> size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c
>      kworker/7:1-418     [007] dNs1.   103.773283:
> xhci_dbc_queue_request: bulk-in: req 00000000eb22d5f2 length 0/1024
> ==> -115
>
> A 24 bytes packet is dropped which is confirmed in kernel logs by "no
> matched request" entries.
>
> Kernel logs related to DbC:
> [   55.569943] xhci_hcd 0000:00:0d.0: DbC connected
> [   55.861946] xhci_hcd 0000:00:0d.0: DbC configured
> [  103.702190] xhci_hcd 0000:00:0d.0: no matched request
> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB
> 1943ad000, remaining 1024, ep deq 1943ad001
> [  103.726161] xhci_hcd 0000:00:0d.0: DbC Endpoint halt cleared
> [  103.732517] xhci_hcd 0000:00:0d.0: no matched request
>
>
> > DbC should respond with STALL packets to host if HIT or HOT is set.
> > Host side should react to this by sending a ClearFeature(HALT) request to DbC,
> > which should clear the halted endpoint and HIT/HOT flags.
> >
>
> I would like to ask you about it again because I wonder when the DbC
> endpoint is not halted and it
> receives the ClearFeature(Halt) request whether this is correct
> behavior for the DbC endpoint to report a
> STALL error in such a case ?
>
> Thanks,
> Lukasz
>

Hi Mathias,

One more thing which I would like to add to the previous is the observation
which I made during debugging the issue. Looking at the above trace
there is stall
on IN endpoint:
xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Stall
Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:C

And then when 24 bytes packet arrives
xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Short
Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:C

the same TRB is being used 00000001943ad000 but the DbC driver already
moved forward and the IN endpoint's pending list does not include a
dbc_request pointing
to the TRB 00000001943ad000 which results in "no matched request"
error and dropping
of the packet.

Thanks,
Lukasz

> > > Thanks
> > > Mathias
> > >
Mathias Nyman Aug. 5, 2024, 6:49 a.m. UTC | #9
On 4.8.2024 0.51, Łukasz Bartosik wrote:
> On Thu, Aug 1, 2024 at 5:02 PM Łukasz Bartosik <ukaszb@chromium.org> wrote:
>>
>> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB
>> 1943ad000, remaining 1024, ep deq 1943ad001
>>
>>
>>> DbC should respond with STALL packets to host if HIT or HOT is set.
>>> Host side should react to this by sending a ClearFeature(HALT) request to DbC,
>>> which should clear the halted endpoint and HIT/HOT flags.
>>>
>>
>> I would like to ask you about it again because I wonder when the DbC
>> endpoint is not halted and it
>> receives the ClearFeature(Halt) request whether this is correct
>> behavior for the DbC endpoint to report a
>> STALL error in such a case ?
>>

Ok, Setup was unclear to me, I assumed the STALL transfer error was due to an
actual transfer issue on a bulk endpoint.

I don't think xHCI DbC should send STALL error transfer events for non existing
bulk transfers as a response to ClearFeature(ENDPOINT_HALT) control request.

Especially as it seems not a single byte was transferred on either bulk endpoint.

But we need to handle it in the driver properly


>> Thanks,
>> Lukasz
>>
> 
> Hi Mathias,
> 
> One more thing which I would like to add to the previous is the observation
> which I made during debugging the issue. Looking at the above trace
> there is stall
> on IN endpoint:
> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Stall
> Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:C
> 
> And then when 24 bytes packet arrives
> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Short
> Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:C
> 
> the same TRB is being used 00000001943ad000 but the DbC driver already
> moved forward and the IN endpoint's pending list does not include a
> dbc_request pointing
> to the TRB 00000001943ad000 which results in "no matched request"
> error and dropping
> of the packet.
> 

I noticed yes, this is why the patch compared the TRB pointed to by the
STALL transfer event with the one in the endpoint context.
If they match the patch attempts to turn that TRB to no-op, forcing xHC
hardware to move to the next TRB.

The check did however not work as it didn't mask out the cycle bit.
Lops show we compare  1943ad000 with 1943ad001

[  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB 1943ad000, remaining 1024, ep deq 1943ad001

Code in patch:
+if (ep_ctx->deq == event->trans_event.buffer) { // FIXME do we need to worry about cycle bit?
+			dev_warn(dbc->dev, "Stall error TRB matches ep_ctx->deq, clear it\n");
+			trb_to_noop(req->trb);

With the new information that the STALL transfer event can be completely
spurious and not related to any actual bytes being transmitted on bulk
endpoints I think we shouldn't give back he transfer req in those cases.

I'll make another patch

Thanks
Mathias
Mathias Nyman Aug. 5, 2024, 3:52 p.m. UTC | #10
On 5.8.2024 9.49, Mathias Nyman wrote:
> On 4.8.2024 0.51, Łukasz Bartosik wrote:
>> On Thu, Aug 1, 2024 at 5:02 PM Łukasz Bartosik <ukaszb@chromium.org> wrote:
>>>
>>> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB
>>> 1943ad000, remaining 1024, ep deq 1943ad001
>>>
>>>
>>>> DbC should respond with STALL packets to host if HIT or HOT is set.
>>>> Host side should react to this by sending a ClearFeature(HALT) request to DbC,
>>>> which should clear the halted endpoint and HIT/HOT flags.
>>>>
>>>
>>> I would like to ask you about it again because I wonder when the DbC
>>> endpoint is not halted and it
>>> receives the ClearFeature(Halt) request whether this is correct
>>> behavior for the DbC endpoint to report a
>>> STALL error in such a case ?
>>>
> 
> Ok, Setup was unclear to me, I assumed the STALL transfer error was due to an
> actual transfer issue on a bulk endpoint.
> 
> I don't think xHCI DbC should send STALL error transfer events for non existing
> bulk transfers as a response to ClearFeature(ENDPOINT_HALT) control request.
> 
> Especially as it seems not a single byte was transferred on either bulk endpoint.
> 
> But we need to handle it in the driver properly
> 
> 
>>> Thanks,
>>> Lukasz
>>>
>>
>> Hi Mathias,
>>
>> One more thing which I would like to add to the previous is the observation
>> which I made during debugging the issue. Looking at the above trace
>> there is stall
>> on IN endpoint:
>> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Stall
>> Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:C
>>
>> And then when 24 bytes packet arrives
>> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Short
>> Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:C
>>
>> the same TRB is being used 00000001943ad000 but the DbC driver already
>> moved forward and the IN endpoint's pending list does not include a
>> dbc_request pointing
>> to the TRB 00000001943ad000 which results in "no matched request"
>> error and dropping
>> of the packet.
>>
> 
> I noticed yes, this is why the patch compared the TRB pointed to by the
> STALL transfer event with the one in the endpoint context.
> If they match the patch attempts to turn that TRB to no-op, forcing xHC
> hardware to move to the next TRB.
> 
> The check did however not work as it didn't mask out the cycle bit.
> Lops show we compare  1943ad000 with 1943ad001
> 
> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB 1943ad000, remaining 1024, ep deq 1943ad001
> 
> Code in patch:
> +if (ep_ctx->deq == event->trans_event.buffer) { // FIXME do we need to worry about cycle bit?
> +            dev_warn(dbc->dev, "Stall error TRB matches ep_ctx->deq, clear it\n");
> +            trb_to_noop(req->trb);
> 
> With the new information that the STALL transfer event can be completely
> spurious and not related to any actual bytes being transmitted on bulk
> endpoints I think we shouldn't give back he transfer req in those cases.
> 
> I'll make another patch

New patch ready for testing:
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=96cd909cc8115b3d2dff1bdcf265171bb0fdab18

Thanks
Mathias
Łukasz Bartosik Aug. 22, 2024, 3:35 a.m. UTC | #11
On Mon, Aug 5, 2024 at 5:50 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> On 5.8.2024 9.49, Mathias Nyman wrote:
> > On 4.8.2024 0.51, Łukasz Bartosik wrote:
> >> On Thu, Aug 1, 2024 at 5:02 PM Łukasz Bartosik <ukaszb@chromium.org> wrote:
> >>>
> >>> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB
> >>> 1943ad000, remaining 1024, ep deq 1943ad001
> >>>
> >>>
> >>>> DbC should respond with STALL packets to host if HIT or HOT is set.
> >>>> Host side should react to this by sending a ClearFeature(HALT) request to DbC,
> >>>> which should clear the halted endpoint and HIT/HOT flags.
> >>>>
> >>>
> >>> I would like to ask you about it again because I wonder when the DbC
> >>> endpoint is not halted and it
> >>> receives the ClearFeature(Halt) request whether this is correct
> >>> behavior for the DbC endpoint to report a
> >>> STALL error in such a case ?
> >>>
> >
> > Ok, Setup was unclear to me, I assumed the STALL transfer error was due to an
> > actual transfer issue on a bulk endpoint.
> >
> > I don't think xHCI DbC should send STALL error transfer events for non existing
> > bulk transfers as a response to ClearFeature(ENDPOINT_HALT) control request.
> >
> > Especially as it seems not a single byte was transferred on either bulk endpoint.
> >
> > But we need to handle it in the driver properly
> >
> >
> >>> Thanks,
> >>> Lukasz
> >>>
> >>
> >> Hi Mathias,
> >>
> >> One more thing which I would like to add to the previous is the observation
> >> which I made during debugging the issue. Looking at the above trace
> >> there is stall
> >> on IN endpoint:
> >> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Stall
> >> Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:C
> >>
> >> And then when 24 bytes packet arrives
> >> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Short
> >> Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:C
> >>
> >> the same TRB is being used 00000001943ad000 but the DbC driver already
> >> moved forward and the IN endpoint's pending list does not include a
> >> dbc_request pointing
> >> to the TRB 00000001943ad000 which results in "no matched request"
> >> error and dropping
> >> of the packet.
> >>
> >
> > I noticed yes, this is why the patch compared the TRB pointed to by the
> > STALL transfer event with the one in the endpoint context.
> > If they match the patch attempts to turn that TRB to no-op, forcing xHC
> > hardware to move to the next TRB.
> >
> > The check did however not work as it didn't mask out the cycle bit.
> > Lops show we compare  1943ad000 with 1943ad001
> >
> > [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB 1943ad000, remaining 1024, ep deq 1943ad001
> >
> > Code in patch:
> > +if (ep_ctx->deq == event->trans_event.buffer) { // FIXME do we need to worry about cycle bit?
> > +            dev_warn(dbc->dev, "Stall error TRB matches ep_ctx->deq, clear it\n");
> > +            trb_to_noop(req->trb);
> >
> > With the new information that the STALL transfer event can be completely
> > spurious and not related to any actual bytes being transmitted on bulk
> > endpoints I think we shouldn't give back he transfer req in those cases.
> >
> > I'll make another patch
>
> New patch ready for testing:
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=96cd909cc8115b3d2dff1bdcf265171bb0fdab18
>

Hi Mathias,

I finally tested your latest patch 96cd909cc8115b3d2dff1bdcf265171bb0fdab18.
It resolves the issue.

Here are trace logs from one of the runs:
# entries-in-buffer/entries-written: 23/23   #P:12
#
#                                _-----=> irqs-off/BH-disabled
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| / _-=> migrate-disable
#                              |||| /     delay
#           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
#              | |         |   |||||     |         |
     kworker/4:0-30      [004] d..1.   427.334578:
xhci_dbc_handle_event: EVENT: TRB 000000019dbcaaa0 status 'Stall
Error' len 0 slot 1 ep 2 type 'Transfer Event' flags e:c
     kworker/4:0-30      [004] d..1.   427.340272:
xhci_dbc_handle_event: EVENT: TRB 000000019aef5960 status 'Stall
Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:c
     kworker/4:0-30      [004] d..1.   427.340273:
xhci_dbc_handle_transfer: BULK: Buffer 000000017255a000 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
     kworker/4:0-30      [004] d..1.   427.382584:
xhci_dbc_handle_event: EVENT: TRB 000000019aef5960 status 'Short
Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:c
     kworker/4:0-30      [004] d..1.   427.382586:
xhci_dbc_handle_transfer: BULK: Buffer 000000017255a000 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
     kworker/4:0-30      [004] d..1.   427.382587:
xhci_dbc_giveback_request: bulk-in: req 000000005c9bd064 length
24/1024 ==> -6
     kworker/4:0-30      [004] d..1.   427.382592:
xhci_dbc_handle_event: EVENT: TRB 000000019aef5970 status 'Short
Packet' len 763 slot 1 ep 3 type 'Transfer Event' flags e:c
     kworker/4:0-30      [004] d..1.   427.382592:
xhci_dbc_handle_transfer: BULK: Buffer 000000017255c000 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
     kworker/4:0-30      [004] d..1.   427.382592:
xhci_dbc_giveback_request: bulk-in: req 00000000f836aa56 length
261/1024 ==> 0
     ksoftirqd/4-29      [004] d.s1.   427.387396:
xhci_dbc_gadget_ep_queue: BULK: Buffer 000000017255c000 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c
     ksoftirqd/4-29      [004] d.s..   427.387397:
xhci_dbc_queue_request: bulk-in: req 00000000f836aa56 length 0/1024
==> -115
     ksoftirqd/4-29      [004] d.s1.   427.387397:
xhci_dbc_gadget_ep_queue: BULK: Buffer 000000017255a000 length 1024 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c
     ksoftirqd/4-29      [004] d.s..   427.387398:
xhci_dbc_queue_request: bulk-in: req 000000005c9bd064 length 0/1024
==> -115
       sock->usb-4089    [004] d..1.   427.388796:
xhci_dbc_gadget_ep_queue: BULK: Buffer 000000017255d400 length 24 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c
       sock->usb-4089    [004] d....   427.388801:
xhci_dbc_queue_request: bulk-out: req 00000000653a50a0 length 0/24 ==>
-115
       sock->usb-4089    [004] d..1.   427.388857:
xhci_dbc_gadget_ep_queue: BULK: Buffer 000000017255d000 length 371 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:c
       sock->usb-4089    [004] d....   427.388858:
xhci_dbc_queue_request: bulk-out: req 000000006de24797 length 0/371
==> -115
     kworker/4:0-30      [004] d..1.   427.388869:
xhci_dbc_handle_event: EVENT: TRB 000000019dbcaaa0 status 'Success'
len 0 slot 1 ep 2 type 'Transfer Event' flags e:c
     kworker/4:0-30      [004] d..1.   427.388870:
xhci_dbc_handle_transfer: BULK: Buffer 000000017255d400 length 24 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
     kworker/4:0-30      [004] d..1.   427.388870:
xhci_dbc_giveback_request: bulk-out: req 00000000653a50a0 length 24/24
==> 0
     kworker/4:0-30      [004] d..1.   427.394600:
xhci_dbc_handle_event: EVENT: TRB 000000019dbcaab0 status 'Success'
len 0 slot 1 ep 2 type 'Transfer Event' flags e:c
     kworker/4:0-30      [004] d..1.   427.394601:
xhci_dbc_handle_transfer: BULK: Buffer 000000017255d000 length 371 TD
size 0 intr 0 type 'Normal' flags b:i:I:c:s:i:e:C
     kworker/4:0-30      [004] d..1.   427.394603:
xhci_dbc_giveback_request: bulk-out: req 000000006de24797 length
371/371 ==> 0

Short packet was processed instead of being dropped as it happened
previously without your fix.

Thanks,
Lukasz

> Thanks
> Mathias
Mathias Nyman Aug. 22, 2024, 12:50 p.m. UTC | #12
On 22.8.2024 6.35, Łukasz Bartosik wrote:
> On Mon, Aug 5, 2024 at 5:50 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>>
>> On 5.8.2024 9.49, Mathias Nyman wrote:
>>> On 4.8.2024 0.51, Łukasz Bartosik wrote:
>>>> On Thu, Aug 1, 2024 at 5:02 PM Łukasz Bartosik <ukaszb@chromium.org> wrote:
>>>>>
>>>>> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB
>>>>> 1943ad000, remaining 1024, ep deq 1943ad001
>>>>>
>>>>>
>>>>>> DbC should respond with STALL packets to host if HIT or HOT is set.
>>>>>> Host side should react to this by sending a ClearFeature(HALT) request to DbC,
>>>>>> which should clear the halted endpoint and HIT/HOT flags.
>>>>>>
>>>>>
>>>>> I would like to ask you about it again because I wonder when the DbC
>>>>> endpoint is not halted and it
>>>>> receives the ClearFeature(Halt) request whether this is correct
>>>>> behavior for the DbC endpoint to report a
>>>>> STALL error in such a case ?
>>>>>
>>>
>>> Ok, Setup was unclear to me, I assumed the STALL transfer error was due to an
>>> actual transfer issue on a bulk endpoint.
>>>
>>> I don't think xHCI DbC should send STALL error transfer events for non existing
>>> bulk transfers as a response to ClearFeature(ENDPOINT_HALT) control request.
>>>
>>> Especially as it seems not a single byte was transferred on either bulk endpoint.
>>>
>>> But we need to handle it in the driver properly
>>>
>>>
>>>>> Thanks,
>>>>> Lukasz
>>>>>
>>>>
>>>> Hi Mathias,
>>>>
>>>> One more thing which I would like to add to the previous is the observation
>>>> which I made during debugging the issue. Looking at the above trace
>>>> there is stall
>>>> on IN endpoint:
>>>> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Stall
>>>> Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:C
>>>>
>>>> And then when 24 bytes packet arrives
>>>> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Short
>>>> Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:C
>>>>
>>>> the same TRB is being used 00000001943ad000 but the DbC driver already
>>>> moved forward and the IN endpoint's pending list does not include a
>>>> dbc_request pointing
>>>> to the TRB 00000001943ad000 which results in "no matched request"
>>>> error and dropping
>>>> of the packet.
>>>>
>>>
>>> I noticed yes, this is why the patch compared the TRB pointed to by the
>>> STALL transfer event with the one in the endpoint context.
>>> If they match the patch attempts to turn that TRB to no-op, forcing xHC
>>> hardware to move to the next TRB.
>>>
>>> The check did however not work as it didn't mask out the cycle bit.
>>> Lops show we compare  1943ad000 with 1943ad001
>>>
>>> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB 1943ad000, remaining 1024, ep deq 1943ad001
>>>
>>> Code in patch:
>>> +if (ep_ctx->deq == event->trans_event.buffer) { // FIXME do we need to worry about cycle bit?
>>> +            dev_warn(dbc->dev, "Stall error TRB matches ep_ctx->deq, clear it\n");
>>> +            trb_to_noop(req->trb);
>>>
>>> With the new information that the STALL transfer event can be completely
>>> spurious and not related to any actual bytes being transmitted on bulk
>>> endpoints I think we shouldn't give back he transfer req in those cases.
>>>
>>> I'll make another patch
>>
>> New patch ready for testing:
>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=96cd909cc8115b3d2dff1bdcf265171bb0fdab18
>>
> 
> Hi Mathias,
> 
> I finally tested your latest patch 96cd909cc8115b3d2dff1bdcf265171bb0fdab18.
> It resolves the issue.

Thanks for testing.

I did some minor additional cleanups to it.
I'll post it as a proper patch. can I ask you to give it one last run,
just to make sure everything still works.

Thanks
Mathias
Łukasz Bartosik Aug. 23, 2024, 10:06 a.m. UTC | #13
On Thu, Aug 22, 2024 at 2:48 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> On 22.8.2024 6.35, Łukasz Bartosik wrote:
> > On Mon, Aug 5, 2024 at 5:50 PM Mathias Nyman
> > <mathias.nyman@linux.intel.com> wrote:
> >>
> >> On 5.8.2024 9.49, Mathias Nyman wrote:
> >>> On 4.8.2024 0.51, Łukasz Bartosik wrote:
> >>>> On Thu, Aug 1, 2024 at 5:02 PM Łukasz Bartosik <ukaszb@chromium.org> wrote:
> >>>>>
> >>>>> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB
> >>>>> 1943ad000, remaining 1024, ep deq 1943ad001
> >>>>>
> >>>>>
> >>>>>> DbC should respond with STALL packets to host if HIT or HOT is set.
> >>>>>> Host side should react to this by sending a ClearFeature(HALT) request to DbC,
> >>>>>> which should clear the halted endpoint and HIT/HOT flags.
> >>>>>>
> >>>>>
> >>>>> I would like to ask you about it again because I wonder when the DbC
> >>>>> endpoint is not halted and it
> >>>>> receives the ClearFeature(Halt) request whether this is correct
> >>>>> behavior for the DbC endpoint to report a
> >>>>> STALL error in such a case ?
> >>>>>
> >>>
> >>> Ok, Setup was unclear to me, I assumed the STALL transfer error was due to an
> >>> actual transfer issue on a bulk endpoint.
> >>>
> >>> I don't think xHCI DbC should send STALL error transfer events for non existing
> >>> bulk transfers as a response to ClearFeature(ENDPOINT_HALT) control request.
> >>>
> >>> Especially as it seems not a single byte was transferred on either bulk endpoint.
> >>>
> >>> But we need to handle it in the driver properly
> >>>
> >>>
> >>>>> Thanks,
> >>>>> Lukasz
> >>>>>
> >>>>
> >>>> Hi Mathias,
> >>>>
> >>>> One more thing which I would like to add to the previous is the observation
> >>>> which I made during debugging the issue. Looking at the above trace
> >>>> there is stall
> >>>> on IN endpoint:
> >>>> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Stall
> >>>> Error' len 1024 slot 1 ep 3 type 'Transfer Event' flags e:C
> >>>>
> >>>> And then when 24 bytes packet arrives
> >>>> xhci_dbc_handle_event: EVENT: TRB 00000001943ad000 status 'Short
> >>>> Packet' len 1000 slot 1 ep 3 type 'Transfer Event' flags e:C
> >>>>
> >>>> the same TRB is being used 00000001943ad000 but the DbC driver already
> >>>> moved forward and the IN endpoint's pending list does not include a
> >>>> dbc_request pointing
> >>>> to the TRB 00000001943ad000 which results in "no matched request"
> >>>> error and dropping
> >>>> of the packet.
> >>>>
> >>>
> >>> I noticed yes, this is why the patch compared the TRB pointed to by the
> >>> STALL transfer event with the one in the endpoint context.
> >>> If they match the patch attempts to turn that TRB to no-op, forcing xHC
> >>> hardware to move to the next TRB.
> >>>
> >>> The check did however not work as it didn't mask out the cycle bit.
> >>> Lops show we compare  1943ad000 with 1943ad001
> >>>
> >>> [  103.707882] xhci_hcd 0000:00:0d.0: Stall error at bulk TRB 1943ad000, remaining 1024, ep deq 1943ad001
> >>>
> >>> Code in patch:
> >>> +if (ep_ctx->deq == event->trans_event.buffer) { // FIXME do we need to worry about cycle bit?
> >>> +            dev_warn(dbc->dev, "Stall error TRB matches ep_ctx->deq, clear it\n");
> >>> +            trb_to_noop(req->trb);
> >>>
> >>> With the new information that the STALL transfer event can be completely
> >>> spurious and not related to any actual bytes being transmitted on bulk
> >>> endpoints I think we shouldn't give back he transfer req in those cases.
> >>>
> >>> I'll make another patch
> >>
> >> New patch ready for testing:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/commit/?h=fix_dbc_halted_ep&id=96cd909cc8115b3d2dff1bdcf265171bb0fdab18
> >>
> >
> > Hi Mathias,
> >
> > I finally tested your latest patch 96cd909cc8115b3d2dff1bdcf265171bb0fdab18.
> > It resolves the issue.
>
> Thanks for testing.
>
> I did some minor additional cleanups to it.
> I'll post it as a proper patch. can I ask you to give it one last run,
> just to make sure everything still works.
>

Sure Mathias, I will test it. Please add me to CC when you submit the
final version.

Thanks,
Lukasz

> Thanks
> Mathias
>
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-dbgcap.c b/drivers/usb/host/xhci-dbgcap.c
index 872d9cddbcef..36ec3242e096 100644
--- a/drivers/usb/host/xhci-dbgcap.c
+++ b/drivers/usb/host/xhci-dbgcap.c
@@ -173,7 +173,17 @@  static void xhci_dbc_giveback(struct dbc_request *req, int status)
 	spin_lock(&dbc->lock);
 }
 
-static void xhci_dbc_flush_single_request(struct dbc_request *req)
+/**
+ * xhci_dbc_flush_single_request - flushes single request
+ * @req: request to flush
+ * @status: 0 or -ERESTART - after the request is flushed it will be queued
+ *          back to the endpoint
+ *
+ *          -ESHUTDOWN - after the request is flushed it won't be queued back
+ *          to the endpoint and if it was last endpoint's request the endpoint
+ *          will stay shut
+ */
+static void xhci_dbc_flush_single_request(struct dbc_request *req, int status)
 {
 	union xhci_trb	*trb = req->trb;
 
@@ -183,21 +193,36 @@  static void xhci_dbc_flush_single_request(struct dbc_request *req)
 	trb->generic.field[3]	&= cpu_to_le32(TRB_CYCLE);
 	trb->generic.field[3]	|= cpu_to_le32(TRB_TYPE(TRB_TR_NOOP));
 
-	xhci_dbc_giveback(req, -ESHUTDOWN);
+	xhci_dbc_giveback(req, status);
 }
 
-static void xhci_dbc_flush_endpoint_requests(struct dbc_ep *dep)
+/**
+ * xhci_dbc_flush_endpoint_requests - flushes endpoint's requests
+ * @dep: endpoint to flush requests
+ * @restart: true - after being flushed, the requests will be queued back
+ *           to the endpoint and its operation will resume
+ *
+ *           false - after flushing last endpoint's request the endpoint will
+ *           stay shut
+ */
+static void xhci_dbc_flush_endpoint_requests(struct dbc_ep *dep, bool restart)
 {
+	struct list_head	*list = &dep->list_pending;
 	struct dbc_request	*req, *tmp;
+	int			status = -ESHUTDOWN;
 
-	list_for_each_entry_safe(req, tmp, &dep->list_pending, list_pending)
-		xhci_dbc_flush_single_request(req);
+	list_for_each_entry_safe(req, tmp, list, list_pending) {
+		if (restart && list_is_last(&req->list_pending, list))
+			status = -ERESTART;
+
+		xhci_dbc_flush_single_request(req, status);
+	}
 }
 
 static void xhci_dbc_flush_requests(struct xhci_dbc *dbc)
 {
-	xhci_dbc_flush_endpoint_requests(&dbc->eps[BULK_OUT]);
-	xhci_dbc_flush_endpoint_requests(&dbc->eps[BULK_IN]);
+	xhci_dbc_flush_endpoint_requests(&dbc->eps[BULK_OUT], false);
+	xhci_dbc_flush_endpoint_requests(&dbc->eps[BULK_IN], false);
 }
 
 struct dbc_request *
@@ -718,10 +743,13 @@  static void dbc_handle_xfer_event(struct xhci_dbc *dbc, union xhci_trb *event)
 	case COMP_TRB_ERROR:
 	case COMP_BABBLE_DETECTED_ERROR:
 	case COMP_USB_TRANSACTION_ERROR:
-	case COMP_STALL_ERROR:
 		dev_warn(dbc->dev, "tx error %d detected\n", comp_code);
 		status = -comp_code;
 		break;
+	case COMP_STALL_ERROR:
+		/* Restart endpoint */
+		xhci_dbc_flush_endpoint_requests(dep, true);
+		return;
 	default:
 		dev_err(dbc->dev, "unknown tx error %d\n", comp_code);
 		status = -comp_code;
@@ -823,12 +851,12 @@  static enum evtreturn xhci_dbc_do_handle_events(struct xhci_dbc *dbc)
 
 			if (ctrl & DBC_CTRL_HALT_IN_TR) {
 				dep = get_in_ep(dbc);
-				xhci_dbc_flush_endpoint_requests(dep);
+				xhci_dbc_flush_endpoint_requests(dep, false);
 			}
 
 			if (ctrl & DBC_CTRL_HALT_OUT_TR) {
 				dep = get_out_ep(dbc);
-				xhci_dbc_flush_endpoint_requests(dep);
+				xhci_dbc_flush_endpoint_requests(dep, false);
 			}
 
 			return EVT_DONE;
diff --git a/drivers/usb/host/xhci-dbgtty.c b/drivers/usb/host/xhci-dbgtty.c
index b74e98e94393..bdf80aa2b28c 100644
--- a/drivers/usb/host/xhci-dbgtty.c
+++ b/drivers/usb/host/xhci-dbgtty.c
@@ -121,6 +121,7 @@  static void dbc_write_complete(struct xhci_dbc *dbc, struct dbc_request *req)
 	list_add(&req->list_pool, &port->write_pool);
 	switch (req->status) {
 	case 0:
+	case -ERESTART:
 		dbc_start_tx(port);
 		break;
 	case -ESHUTDOWN:
@@ -318,6 +319,10 @@  static void dbc_rx_push(struct tasklet_struct *t)
 		case -ESHUTDOWN:
 			disconnect = true;
 			break;
+		case -ERESTART:
+			disconnect = false;
+			req->actual = 0;
+			break;
 		default:
 			pr_warn("ttyDBC0: unexpected RX status %d\n",
 				req->status);