Message ID | 20240116153618.2527463-1-mathias.nyman@linux.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [RFT] xhci: process isoc TD properly when there was an error mid TD. | expand |
I applied your patch on v6.7 and it appears to be working. It removes the disconnection spam and also handles intermittent transmission errors on UVC without obvious glitches or errors messages, except one xhci_dbg added to confirm that I'm really hitting this edge case. Anything else that might be worth testing? I have a question, though. What happens if there is no next TD because a mid TD error has occured on the last packet queued by the client? Is there any mechanism to retire that stuck TD on a NEC host which submits one mid TD error event and then goes silent? Would it be possible to retire the TD right after the first failed TRB? (I imagine difficulties in determining when exactly the host has moved its internal pointer past the remaining TRBs so they can be reused). > - if (!ep->skip || > - !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) { > + if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) { I like this. I would suggest another cleanup: the if(!ep_seg && stuf) right above your change could be pulled inside if(!ep_seg). > + * if there was an error event mid TD then host may not > + * give an event for the last TRB on an isoc TD. > + * This event can be for the next TD, See xHCI 4.9.1. This seems to suggest that 4.9.1 encourages such behavior, but the opposite is the case as far as I understand. > + if (td->error_mid_td) { > + struct xhci_td *td_next = list_next_entry(td, td_list); This if needs && !list_is_last(&td->td_list, &ep_ring->td_list). Otherwise a serious bug in the host (maybe in the driver too) tricks us into grabbing a pointer to ep_ring instead, filling the subsequent "TRB not part of current TD" message with mystifying garbage numbers. > + if (ep_seg) { > + /* give back previous TD, start handling new */ Suggested: > + xhci_dbg(xhci, "Missing completion event after mid TD error\n"); Thanks, Michal
On 17.1.2024 0.20, Michał Pecio wrote: > I applied your patch on v6.7 and it appears to be working. It removes > the disconnection spam and also handles intermittent transmission errors > on UVC without obvious glitches or errors messages, except one xhci_dbg > added to confirm that I'm really hitting this edge case. > > Anything else that might be worth testing? > > > I have a question, though. What happens if there is no next TD because > a mid TD error has occured on the last packet queued by the client? Is > there any mechanism to retire that stuck TD on a NEC host which submits > one mid TD error event and then goes silent? In disconnect cases usb core should flush the remaining URBs once roothub code notices the disconnect. But yes, if the last TD in a URB is a multi TRB isoc TD, and it has an error MID TD then its stuck until timeout. > > Would it be possible to retire the TD right after the first failed TRB? > (I imagine difficulties in determining when exactly the host has moved > its internal pointer past the remaining TRBs so they can be reused). Probably not as a normal error handling routine. We have the same "Transfer event TRB DMA ptr not part of current TD" issue for hosts that do issue an event for the last TRB. If the TD is given back immediately we also have a memory issue as the DMA address pointed to by that last TRB might be accessed by the controller _after_ driver gave back the TD, and possibly freed/unmapped it. But for that special case where there are no more TDs queued it might make sense > > >> - if (!ep->skip || >> - !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) { >> + if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) { > I like this. I would suggest another cleanup: the if(!ep_seg && stuf) > right above your change could be pulled inside if(!ep_seg). Noticed the same, but for stable kernel reasons it's probably better to limit this patch to mostly fixing this bug. > > >> + * if there was an error event mid TD then host may not >> + * give an event for the last TRB on an isoc TD. >> + * This event can be for the next TD, See xHCI 4.9.1. > This seems to suggest that 4.9.1 encourages such behavior, but the > opposite is the case as far as I understand. I'll rephrase this. > > >> + if (td->error_mid_td) { >> + struct xhci_td *td_next = list_next_entry(td, td_list); > This if needs && !list_is_last(&td->td_list, &ep_ring->td_list). Thanks, nice catch, good point. > > Otherwise a serious bug in the host (maybe in the driver too) tricks > us into grabbing a pointer to ep_ring instead, filling the subsequent > "TRB not part of current TD" message with mystifying garbage numbers. > > >> + if (ep_seg) { >> + /* give back previous TD, start handling new */ > Suggested: >> + xhci_dbg(xhci, "Missing completion event after mid TD error\n"); Makes sense. Thanks Mathias
> But yes, if the last TD in a URB is a multi TRB isoc TD, and it has > an error MID TD then its stuck until timeout. If there are timeouts to deal with hosts failing to respond then that's good enough for me. It should be a rare case anyway. I just don't like when stuff locks up forever and requires reconnection or reboot to clean up. > > Would it be possible to retire the TD right after the first failed > > TRB? > > Probably not as a normal error handling routine. > We have the same "Transfer event TRB DMA ptr not part of current TD" > issue for hosts that do issue an event for the last TRB. Obviously it would require keeping some information about the retired TD to detect subsequent completions and to prevent freeing of its remaining TRBs. Probably much more effort than the current approach. > But for that special case where there are no more TDs queued it might > make sense Wouldn't it still require remembering not to free the TRBs too fast? (It seems to have worked in the early days, but feels dodgy). > + xhci_dbg(xhci, "Missing completion event after mid TD error\n"); On second thought, this could also print ep_trb_dma to be useful in debugging "TRB not part of current TD" issues. Although anyone able to compile the kernel with DYNAMIC_DEBUG could also edit as necessary...
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 33806ae966f9..4869005379f5 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2401,8 +2401,11 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, break; case COMP_USB_TRANSACTION_ERROR: frame->status = -EPROTO; - if (ep_trb != td->last_trb) + if (ep_trb != td->last_trb) { + td->error_mid_td = true; + /* FIXME update actual_length */ return 0; + } break; case COMP_STOPPED: sum_trbs_for_length = true; @@ -2808,17 +2811,44 @@ static int handle_tx_event(struct xhci_hcd *xhci, } if (!ep_seg) { - if (!ep->skip || - !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) { - /* Some host controllers give a spurious - * successful event after a short transfer. - * Ignore it. - */ - if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) && - ep_ring->last_td_was_short) { - ep_ring->last_td_was_short = false; - goto cleanup; + + if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) { + skip_isoc_td(xhci, td, ep, status); + goto cleanup; + } + + /* + * Some hosts give a spurious success event after a short + * transfer. Ignore it. + */ + if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) && + ep_ring->last_td_was_short) { + ep_ring->last_td_was_short = false; + goto cleanup; + } + + /* + * if there was an error event mid TD then host may not + * give an event for the last TRB on an isoc TD. + * This event can be for the next TD, See xHCI 4.9.1. + */ + if (td->error_mid_td) { + struct xhci_td *td_next = list_next_entry(td, td_list); + + ep_seg = trb_in_td(xhci, td_next->start_seg, td_next->first_trb, + td_next->last_trb, ep_trb_dma, false); + if (ep_seg) { + /* give back previous TD, start handling new */ + ep_ring->dequeue = td->last_trb; + ep_ring->deq_seg = td->last_trb_seg; + inc_deq(xhci, ep_ring); + xhci_td_cleanup(xhci, td, ep_ring, td->status); + td = td_next; } + + } + + if (!ep_seg) { /* HC is busted, give up! */ xhci_err(xhci, "ERROR Transfer event TRB DMA ptr not " @@ -2830,9 +2860,6 @@ static int handle_tx_event(struct xhci_hcd *xhci, ep_trb_dma, true); return -ESHUTDOWN; } - - skip_isoc_td(xhci, td, ep, status); - goto cleanup; } if (trb_comp_code == COMP_SHORT_PACKET) ep_ring->last_td_was_short = true; diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index a5c72a634e6a..6f82d404883f 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1549,6 +1549,7 @@ struct xhci_td { struct xhci_segment *bounce_seg; /* actual_length of the URB has already been set */ bool urb_length_set; + bool error_mid_td; unsigned int num_trbs; };
The last TRB of a isoc TD might not trigger an event if there was an error event for TRB mid TD. This is seen on a NEC Corporation uPD720200 USB 3.0 Host Driver and xHC host get out of sync as driver is still expecting a transfer event for that first TD, while xHC host is already sending events for the next TD in the list. This leads to "Transfer event TRB DMA ptr not part of current TD" messages. As a solution we tag the isoc TDs that get error events mid TD. If an event doesn't match the first TD, then check if the tag is set, and event points to the next TD. In that case give back the fist TD and process the next TD normally Reported-by: Michał Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> --- drivers/usb/host/xhci-ring.c | 55 +++++++++++++++++++++++++++--------- drivers/usb/host/xhci.h | 1 + 2 files changed, 42 insertions(+), 14 deletions(-)