diff mbox series

[RFT] xhci: process isoc TD properly when there was an error mid TD.

Message ID 20240116153618.2527463-1-mathias.nyman@linux.intel.com (mailing list archive)
State Superseded
Headers show
Series [RFT] xhci: process isoc TD properly when there was an error mid TD. | expand

Commit Message

Mathias Nyman Jan. 16, 2024, 3:36 p.m. UTC
The last TRB of a isoc TD might not trigger an event if there was
an error event for TRB mid TD. This is seen on a  NEC Corporation
uPD720200 USB 3.0 Host

Driver and xHC host get out of sync as driver is still expecting a
transfer event for that first TD, while xHC host is already sending
events for the next TD in the list. This leads to
"Transfer event TRB DMA ptr not part of current TD" messages.

As a solution we tag the isoc TDs that get error events mid TD.
If an event doesn't match the first TD, then check if the tag is
set, and event points to the next TD.
In that case give back the fist TD and process the next TD normally

Reported-by: Michał Pecio <michal.pecio@gmail.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 55 +++++++++++++++++++++++++++---------
 drivers/usb/host/xhci.h      |  1 +
 2 files changed, 42 insertions(+), 14 deletions(-)

Comments

Michał Pecio Jan. 16, 2024, 10:20 p.m. UTC | #1
I applied your patch on v6.7 and it appears to be working. It removes
the disconnection spam and also handles intermittent transmission errors
on UVC without obvious glitches or errors messages, except one xhci_dbg
added to confirm that I'm really hitting this edge case.

Anything else that might be worth testing?


I have a question, though. What happens if there is no next TD because
a mid TD error has occured on the last packet queued by the client? Is
there any mechanism to retire that stuck TD on a NEC host which submits
one mid TD error event and then goes silent?

Would it be possible to retire the TD right after the first failed TRB?
(I imagine difficulties in determining when exactly the host has moved
its internal pointer past the remaining TRBs so they can be reused).


> -			if (!ep->skip ||
> -			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
> +			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
I like this. I would suggest another cleanup: the if(!ep_seg && stuf)
right above your change could be pulled inside if(!ep_seg).


> +			 * if there was an error event mid TD then host may not
> +			 * give an event for the last TRB on an isoc TD.
> +			 * This event can be for the next TD, See xHCI 4.9.1.
This seems to suggest that 4.9.1 encourages such behavior, but the
opposite is the case as far as I understand.


> +			if (td->error_mid_td) {
> +				struct xhci_td *td_next = list_next_entry(td, td_list);
This if needs && !list_is_last(&td->td_list, &ep_ring->td_list).

Otherwise a serious bug in the host (maybe in the driver too) tricks
us into grabbing a pointer to ep_ring instead, filling the subsequent
"TRB not part of current TD" message with mystifying garbage numbers.


> +				if (ep_seg) {
> +					/* give back previous TD, start handling new */
Suggested:
> +					xhci_dbg(xhci, "Missing completion event after mid TD error\n");


Thanks,
Michal
Mathias Nyman Jan. 17, 2024, 10:46 a.m. UTC | #2
On 17.1.2024 0.20, Michał Pecio wrote:
> I applied your patch on v6.7 and it appears to be working. It removes
> the disconnection spam and also handles intermittent transmission errors
> on UVC without obvious glitches or errors messages, except one xhci_dbg
> added to confirm that I'm really hitting this edge case.
> 
> Anything else that might be worth testing?
> 
> 
> I have a question, though. What happens if there is no next TD because
> a mid TD error has occured on the last packet queued by the client? Is
> there any mechanism to retire that stuck TD on a NEC host which submits
> one mid TD error event and then goes silent?

In disconnect cases usb core should flush the remaining URBs once
roothub code notices the disconnect.

But yes, if the last TD in a URB is a multi TRB isoc TD, and it has an error
MID TD then its stuck until timeout.

> 
> Would it be possible to retire the TD right after the first failed TRB?
> (I imagine difficulties in determining when exactly the host has moved
> its internal pointer past the remaining TRBs so they can be reused).

Probably not as a normal error handling routine.
We have the same "Transfer event TRB DMA ptr not part of current TD" issue
for hosts that do issue an event for the last TRB.

If the TD is given back immediately we also have a memory issue as the
DMA address pointed to by that last TRB might be accessed by the controller
_after_ driver gave back the TD, and possibly freed/unmapped it.

But for that special case where there are no more TDs queued it might
make sense

> 
> 
>> -			if (!ep->skip ||
>> -			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
>> +			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
> I like this. I would suggest another cleanup: the if(!ep_seg && stuf)
> right above your change could be pulled inside if(!ep_seg).

Noticed the same, but for stable kernel reasons it's probably better to limit
this patch to mostly fixing this bug.

> 
> 
>> +			 * if there was an error event mid TD then host may not
>> +			 * give an event for the last TRB on an isoc TD.
>> +			 * This event can be for the next TD, See xHCI 4.9.1.
> This seems to suggest that 4.9.1 encourages such behavior, but the
> opposite is the case as far as I understand.

I'll rephrase this.

> 
> 
>> +			if (td->error_mid_td) {
>> +				struct xhci_td *td_next = list_next_entry(td, td_list);
> This if needs && !list_is_last(&td->td_list, &ep_ring->td_list).

Thanks, nice catch, good point.

> 
> Otherwise a serious bug in the host (maybe in the driver too) tricks
> us into grabbing a pointer to ep_ring instead, filling the subsequent
> "TRB not part of current TD" message with mystifying garbage numbers.
> 
> 
>> +				if (ep_seg) {
>> +					/* give back previous TD, start handling new */
> Suggested:
>> +					xhci_dbg(xhci, "Missing completion event after mid TD error\n");

Makes sense.

Thanks
Mathias
Michał Pecio Jan. 17, 2024, 5:49 p.m. UTC | #3
> But yes, if the last TD in a URB is a multi TRB isoc TD, and it has
> an error MID TD then its stuck until timeout.

If there are timeouts to deal with hosts failing to respond then that's
good enough for me. It should be a rare case anyway.

I just don't like when stuff locks up forever and requires reconnection
or reboot to clean up.


> > Would it be possible to retire the TD right after the first failed
> > TRB?
> 
> Probably not as a normal error handling routine.
> We have the same "Transfer event TRB DMA ptr not part of current TD"
> issue for hosts that do issue an event for the last TRB.

Obviously it would require keeping some information about the retired
TD to detect subsequent completions and to prevent freeing of its
remaining TRBs. Probably much more effort than the current approach.
 
> But for that special case where there are no more TDs queued it might
> make sense

Wouldn't it still require remembering not to free the TRBs too fast?
(It seems to have worked in the early days, but feels dodgy).


> +					xhci_dbg(xhci, "Missing completion event after mid TD error\n");
On second thought, this could also print ep_trb_dma to be useful in
debugging "TRB not part of current TD" issues. Although anyone able to
compile the kernel with DYNAMIC_DEBUG could also edit as necessary...
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 33806ae966f9..4869005379f5 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -2401,8 +2401,11 @@  static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep,
 		break;
 	case COMP_USB_TRANSACTION_ERROR:
 		frame->status = -EPROTO;
-		if (ep_trb != td->last_trb)
+		if (ep_trb != td->last_trb) {
+			td->error_mid_td = true;
+			/* FIXME update actual_length */
 			return 0;
+		}
 		break;
 	case COMP_STOPPED:
 		sum_trbs_for_length = true;
@@ -2808,17 +2811,44 @@  static int handle_tx_event(struct xhci_hcd *xhci,
 		}
 
 		if (!ep_seg) {
-			if (!ep->skip ||
-			    !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
-				/* Some host controllers give a spurious
-				 * successful event after a short transfer.
-				 * Ignore it.
-				 */
-				if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
-						ep_ring->last_td_was_short) {
-					ep_ring->last_td_was_short = false;
-					goto cleanup;
+
+			if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) {
+				skip_isoc_td(xhci, td, ep, status);
+				goto cleanup;
+			}
+
+			/*
+			 * Some hosts give a spurious success event after a short
+			 * transfer. Ignore it.
+			 */
+			if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) &&
+			    ep_ring->last_td_was_short) {
+				ep_ring->last_td_was_short = false;
+				goto cleanup;
+			}
+
+			/*
+			 * if there was an error event mid TD then host may not
+			 * give an event for the last TRB on an isoc TD.
+			 * This event can be for the next TD, See xHCI 4.9.1.
+			 */
+			if (td->error_mid_td) {
+				struct xhci_td *td_next = list_next_entry(td, td_list);
+
+				ep_seg = trb_in_td(xhci, td_next->start_seg, td_next->first_trb,
+						   td_next->last_trb, ep_trb_dma, false);
+				if (ep_seg) {
+					/* give back previous TD, start handling new */
+					ep_ring->dequeue = td->last_trb;
+					ep_ring->deq_seg = td->last_trb_seg;
+					inc_deq(xhci, ep_ring);
+					xhci_td_cleanup(xhci, td, ep_ring, td->status);
+					td = td_next;
 				}
+
+			}
+
+			if (!ep_seg) {
 				/* HC is busted, give up! */
 				xhci_err(xhci,
 					"ERROR Transfer event TRB DMA ptr not "
@@ -2830,9 +2860,6 @@  static int handle_tx_event(struct xhci_hcd *xhci,
 					  ep_trb_dma, true);
 				return -ESHUTDOWN;
 			}
-
-			skip_isoc_td(xhci, td, ep, status);
-			goto cleanup;
 		}
 		if (trb_comp_code == COMP_SHORT_PACKET)
 			ep_ring->last_td_was_short = true;
diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h
index a5c72a634e6a..6f82d404883f 100644
--- a/drivers/usb/host/xhci.h
+++ b/drivers/usb/host/xhci.h
@@ -1549,6 +1549,7 @@  struct xhci_td {
 	struct xhci_segment	*bounce_seg;
 	/* actual_length of the URB has already been set */
 	bool			urb_length_set;
+	bool			error_mid_td;
 	unsigned int		num_trbs;
 };