diff mbox series

[RTF,v3] xhci: process isoc TD properly when there was an error mid TD.

Message ID 20240119105835.2637358-1-mathias.nyman@linux.intel.com (mailing list archive)
State Accepted
Commit 5372c65e1311a16351ef03dd096ff576e6477674
Headers show
Series [RTF,v3] xhci: process isoc TD properly when there was an error mid TD. | expand

Commit Message

Mathias Nyman Jan. 19, 2024, 10:58 a.m. UTC
The last TRB of a isoc TD might not trigger an event if there was
an error event for a TRB mid TD. This is seen on a NEC Corporation
uPD720200 USB 3.0 Host

After an error mid a multi-TRB TD the xHC should according to xhci 4.9.1
generate events for passed TRBs with IOC flag set if it proceeds to the
next TD. This event is either a copy of the original error, or a
"success" transfer event.

If that event is missing then the driver and xHC host get out of sync as
the driver is still expecting a transfer event for that first TD, while
xHC host is already sending events for the next TD in the list.
This leads to
"Transfer event TRB DMA ptr not part of current TD" messages.

As a solution we tag the isoc TDs that get error events mid TD.
If an event doesn't match the first TD, then check if the tag is
set, and event points to the next TD.
In that case give back the fist TD and process the next TD normally

Make sure TD status and transferred length stay valid in both cases
with and without final TD completion event.

Reported-by: Michał Pecio <michal.pecio@gmail.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
  v2 -> v3
  - Rework and fix setting TD transfer lengt issue in v2
  - Minor commit message tuning 

  v1 -> v2:
  - Check for empty TD list, suggested by Michał
  - Don't overwrite error status, suggested by Michał
  - Add a debug print for missed TD completion event, suggested by Michał
  - Reword commit message and code comments
  - Set correct TD transfer length in isoc error case


drivers/usb/host/xhci-ring.c | 74 +++++++++++++++++++++++++++++-------
 drivers/usb/host/xhci.h      |  1 +
 2 files changed, 61 insertions(+), 14 deletions(-)

Comments

Michał Pecio Jan. 19, 2024, 9:54 p.m. UTC | #1
Usual tests passed (but only transaction errors are covered).


I noticed that with your new fix to the frame length bug, error_mid_td
always equals urb_length_set. So the new flag is perhaps not necessary
after all. The test in handle_tx_event() could be:

if (usb_pipeisoc(td->urb->pipe) && td->urb_length_set &&
    !list_is_last(...  ))

I tried it and it works just as well.


While disconnecting my test camera from the VIA, I got this:

[95511.647441] xhci_hcd 0000:03:00.0: finishing TD with status -75 (comp_code 3 error_mid_td 0)
[95511.647453] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1
[95511.647457] xhci_hcd 0000:03:00.0: Looking for event-dma 0000000135121710 trb-start 0000000135121720 trb-end 0000000135121720 seg-start 0000000135121000 seg-end 0000000135121ff0
... followed by some ordinary transaction errors 2 milliseconds later.

Apparently a babble error, and it seems to have generated a "success"
which the event handler tried to match with the next TD. So a mid TD
babble may need the same treatment, which is not surprising.

Unfortunately I didn't print enough debug info to confirm this with
certainty and it is absolutely unreproducible, I have no idea why it
happened.
Michał Pecio Jan. 22, 2024, 9:03 a.m. UTC | #2
> Apparently a babble error, and it seems to have generated a "success"
> which the event handler tried to match with the next TD. So a mid TD
> babble may need the same treatment, which is not surprising.

This is now confirmed and fixed here. The change is obvious enough:
        case COMP_ISOCH_BUFFER_OVERRUN:
        case COMP_BABBLE_DETECTED_ERROR:
+               error_mid_td = true;
                frame->status = -EOVERFLOW;
                break;

I don't know yet what COMP_ISOCH_BUFFER_OVERRUN means, but I guess it's
the same story. BTW, error_mid_td is a local variable now and I use the
urb_length_set flag instead, as explained before.

I found that it can be reproduced on the VIA host, with enough tries it
can happen even on a chained TD. NEC doesn't signal these babble errors
but new mid TD event handling should cope with either host.

Debug trace ("interesting" is other than "success" or "short packet"):
[ 4113.376349] xhci_hcd 0000:03:00.0: handle_tx_event interesting ep_trb_dma 132961000 comp_code 3 slot 2 ep 2
[ 4113.376361] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010
[ 4113.376364] xhci_hcd 0000:03:00.0: Error mid isoc TD, wait for final completion event
[ 4113.376366] xhci_hcd 0000:03:00.0: handle_tx_event uninteresting ep_trb_dma 132961010 comp_code 1 slot 2 ep 2
[ 4113.376369] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010
[ 4113.376371] xhci_hcd 0000:03:00.0: Got SUCCESS after mid TD error
[ 4113.376373] xhci_hcd 0000:03:00.0: finish_td comp_code 1 status -75


Thanks,
Michal
Mathias Nyman Jan. 22, 2024, 11:47 a.m. UTC | #3
On 19.1.2024 23.54, Michał Pecio wrote:
> Usual tests passed (but only transaction errors are covered).

Thanks for testing, I'l add your a tested-by tag to the patch.

> 
> 
> I noticed that with your new fix to the frame length bug, error_mid_td
> always equals urb_length_set. So the new flag is perhaps not necessary
> after all. The test in handle_tx_event() could be:
> 
> if (usb_pipeisoc(td->urb->pipe) && td->urb_length_set &&
>      !list_is_last(...  ))
> 
> I tried it and it works just as well.

This is probably true here, but urb_length_set and error_mid_td have a bit
different idea behind them

For example short control transfers may send an event mid TD (during data stage).
There is no error, we wait for the completion of the status stage but don't want
to change the transfer length set in the urb.

I suspect we will need to play with both flags for other error cases in the future.

Thanks
Mathias
Mathias Nyman Jan. 22, 2024, 1:37 p.m. UTC | #4
On 22.1.2024 11.03, Michał Pecio wrote:
>> Apparently a babble error, and it seems to have generated a "success"
>> which the event handler tried to match with the next TD. So a mid TD
>> babble may need the same treatment, which is not surprising.

Makes sense.

> 
> This is now confirmed and fixed here. The change is obvious enough:
>          case COMP_ISOCH_BUFFER_OVERRUN:
>          case COMP_BABBLE_DETECTED_ERROR:
> +               error_mid_td = true;
>                  frame->status = -EOVERFLOW;
>                  break;
> 
> I don't know yet what COMP_ISOCH_BUFFER_OVERRUN means, but I guess it's
> the same story. BTW, error_mid_td is a local variable now and I use the
> urb_length_set flag instead, as explained before.

To me it looks like COMP_BABBLE_DETECTED_ERROR and COMP_ISOCH_BUFFER_OVERRUN
have the same cause, device is sending too much data.

Isoc endpoints should use COMP_ISOCH_BUFFER_OVERRUN to indicate endpoint
hasn't halted like it does in the COMP_BABBLE_DETECTED_ERROR case.

See xhci 6.4.5 "TRB completion codes" Footnote 115

> 
> I found that it can be reproduced on the VIA host, with enough tries it
> can happen even on a chained TD. NEC doesn't signal these babble errors
> but new mid TD event handling should cope with either host.

So looks like VIA host incorrectly sends babble for Isoc endpoints

> 
> Debug trace ("interesting" is other than "success" or "short packet"):
> [ 4113.376349] xhci_hcd 0000:03:00.0: handle_tx_event interesting ep_trb_dma 132961000 comp_code 3 slot 2 ep 2
> [ 4113.376361] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010
> [ 4113.376364] xhci_hcd 0000:03:00.0: Error mid isoc TD, wait for final completion event
> [ 4113.376366] xhci_hcd 0000:03:00.0: handle_tx_event uninteresting ep_trb_dma 132961010 comp_code 1 slot 2 ep 2
> [ 4113.376369] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010
> [ 4113.376371] xhci_hcd 0000:03:00.0: Got SUCCESS after mid TD error
> [ 4113.376373] xhci_hcd 0000:03:00.0: finish_td comp_code 1 status -75

I'm afraid we end up tuning that original patch forever with these new findings,
so lets make this into a new patch on top of the previous one.
That one is tested and known to work in the transaction error case.

Let me know if you want to write it, otherwise I will

Thanks
Mathias
Michał Pecio Jan. 22, 2024, 5:10 p.m. UTC | #5
> I'm afraid we end up tuning that original patch forever
Point taken. This patch is more important, as it fixes total failure
to deal with trivial bit errors on uPD720200.


> Isoc endpoints should use COMP_ISOCH_BUFFER_OVERRUN to indicate
> endpoint hasn't halted like it does in the COMP_BABBLE_DETECTED_ERROR
> case.
> 
> See xhci 6.4.5 "TRB completion codes" Footnote 115
This applies to "TD Babble", which is:
submitted TD size < data from the device <= ESIT payload

Other babble conditions exist related to blatant device bugs (4.10.2.4)
and it looks like some may result in a halt or device disconnection even
on isoch endpoints(?)

Other babbles aren't included in "isoch buffer overrun" per 6.4.5.

I feel like one would need to systematically go through the whole spec
to avoid missing important details or special cases.


Thanks,
Michal
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 33806ae966f9..41be7d31a36e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2376,6 +2376,9 @@  static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 	/* handle completion code */
 	switch (trb_comp_code) {
 	case COMP_SUCCESS:
+		/* Don't overwrite status if TD had an error, see xHCI 4.9.1 */
+		if (td->error_mid_td)
+			break;
 		if (remaining) {
 			frame->status = short_framestatus;
 			if (xhci->quirks & XHCI_TRUST_TX_LENGTH)
@@ -2401,8 +2404,9 @@  static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 		break;
 	case COMP_USB_TRANSACTION_ERROR:
 		frame->status = -EPROTO;
+		sum_trbs_for_length = true;
 		if (ep_trb != td->last_trb)
-			return 0;
+			td->error_mid_td = true;
 		break;
 	case COMP_STOPPED:
 		sum_trbs_for_length = true;
@@ -2422,6 +2426,9 @@  static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 		break;
 	}
 
+	if (td->urb_length_set)
+		goto finish_td;
+
 	if (sum_trbs_for_length)
 		frame->actual_length = sum_trb_lengths(xhci, ep->ring, ep_trb) +
 			ep_trb_len - remaining;
@@ -2430,6 +2437,14 @@  static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 
 	td->urb->actual_length += frame->actual_length;
 
+finish_td:
+	/* Don't give back TD yet if we encountered an error mid TD */
+	if (td->error_mid_td && ep_trb != td->last_trb) {
+		xhci_dbg(xhci, "Error mid isoc TD, wait for final completion event\n");
+		td->urb_length_set = true;
+		return 0;
+	}
+
 	return finish_td(xhci, ep, ep_ring, td, trb_comp_code);
 }
 
@@ -2808,17 +2823,51 @@  static int handle_tx_event(struct xhci_hcd *xhci,
 		}
 
 		if (!ep_seg) {
-			if (!ep->skip ||
-			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
-				/* Some host controllers give a spurious
-				 * successful event after a short transfer.
-				 * Ignore it.
-				 */
-				if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
-						ep_ring->last_td_was_short) {
-					ep_ring->last_td_was_short = false;
-					goto cleanup;
+
+			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
+				skip_isoc_td(xhci, td, ep, status);
+				goto cleanup;
+			}
+
+			/*
+			 * Some hosts give a spurious success event after a short
+			 * transfer. Ignore it.
+			 */
+			if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
+			    ep_ring->last_td_was_short) {
+				ep_ring->last_td_was_short = false;
+				goto cleanup;
+			}
+
+			/*
+			 * xhci 4.10.2 states isoc endpoints should continue
+			 * processing the next TD if there was an error mid TD.
+			 * So host like NEC don't generate an event for the last
+			 * isoc TRB even if the IOC flag is set.
+			 * xhci 4.9.1 states that if there are errors in mult-TRB
+			 * TDs xHC should generate an error for that TRB, and if xHC
+			 * proceeds to the next TD it should genete an event for
+			 * any TRB with IOC flag on the way. Other host follow this.
+			 * So this event might be for the next TD.
+			 */
+			if (td->error_mid_td &&
+			    !list_is_last(&td->td_list, &ep_ring->td_list)) {
+				struct xhci_td *td_next = list_next_entry(td, td_list);
+
+				ep_seg = trb_in_td(xhci, td_next->start_seg, td_next->first_trb,
+						   td_next->last_trb, ep_trb_dma, false);
+				if (ep_seg) {
+					/* give back previous TD, start handling new */
+					xhci_dbg(xhci, "Missing TD completion event after mid TD error\n");
+					ep_ring->dequeue = td->last_trb;
+					ep_ring->deq_seg = td->last_trb_seg;
+					inc_deq(xhci, ep_ring);
+					xhci_td_cleanup(xhci, td, ep_ring, td->status);
+					td = td_next;
 				}
+			}
+
+			if (!ep_seg) {
 				/* HC is busted, give up! */
 				xhci_err(xhci,
 					"ERROR Transfer event TRB DMA ptr not "
@@ -2830,9 +2879,6 @@  static int handle_tx_event(struct xhci_hcd *xhci,
 					  ep_trb_dma, true);
 				return -ESHUTDOWN;
 			}
-
-			skip_isoc_td(xhci, td, ep, status);
-			goto cleanup;
 		}
 		if (trb_comp_code == COMP_SHORT_PACKET)
 			ep_ring->last_td_was_short = true;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index a5c72a634e6a..6f82d404883f 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1549,6 +1549,7 @@  struct xhci_td {
 	struct xhci_segment	*bounce_seg;
 	/* actual_length of the URB has already been set */
 	bool			urb_length_set;
+	bool			error_mid_td;
 	unsigned int		num_trbs;
 };