Message ID | 20240119105835.2637358-1-mathias.nyman@linux.intel.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 5372c65e1311a16351ef03dd096ff576e6477674 |
Headers | show |
Series | [RTF,v3] xhci: process isoc TD properly when there was an error mid TD. | expand |
Usual tests passed (but only transaction errors are covered). I noticed that with your new fix to the frame length bug, error_mid_td always equals urb_length_set. So the new flag is perhaps not necessary after all. The test in handle_tx_event() could be: if (usb_pipeisoc(td->urb->pipe) && td->urb_length_set && !list_is_last(... )) I tried it and it works just as well. While disconnecting my test camera from the VIA, I got this: [95511.647441] xhci_hcd 0000:03:00.0: finishing TD with status -75 (comp_code 3 error_mid_td 0) [95511.647453] xhci_hcd 0000:03:00.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1 [95511.647457] xhci_hcd 0000:03:00.0: Looking for event-dma 0000000135121710 trb-start 0000000135121720 trb-end 0000000135121720 seg-start 0000000135121000 seg-end 0000000135121ff0 ... followed by some ordinary transaction errors 2 milliseconds later. Apparently a babble error, and it seems to have generated a "success" which the event handler tried to match with the next TD. So a mid TD babble may need the same treatment, which is not surprising. Unfortunately I didn't print enough debug info to confirm this with certainty and it is absolutely unreproducible, I have no idea why it happened.
> Apparently a babble error, and it seems to have generated a "success" > which the event handler tried to match with the next TD. So a mid TD > babble may need the same treatment, which is not surprising. This is now confirmed and fixed here. The change is obvious enough: case COMP_ISOCH_BUFFER_OVERRUN: case COMP_BABBLE_DETECTED_ERROR: + error_mid_td = true; frame->status = -EOVERFLOW; break; I don't know yet what COMP_ISOCH_BUFFER_OVERRUN means, but I guess it's the same story. BTW, error_mid_td is a local variable now and I use the urb_length_set flag instead, as explained before. I found that it can be reproduced on the VIA host, with enough tries it can happen even on a chained TD. NEC doesn't signal these babble errors but new mid TD event handling should cope with either host. Debug trace ("interesting" is other than "success" or "short packet"): [ 4113.376349] xhci_hcd 0000:03:00.0: handle_tx_event interesting ep_trb_dma 132961000 comp_code 3 slot 2 ep 2 [ 4113.376361] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010 [ 4113.376364] xhci_hcd 0000:03:00.0: Error mid isoc TD, wait for final completion event [ 4113.376366] xhci_hcd 0000:03:00.0: handle_tx_event uninteresting ep_trb_dma 132961010 comp_code 1 slot 2 ep 2 [ 4113.376369] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010 [ 4113.376371] xhci_hcd 0000:03:00.0: Got SUCCESS after mid TD error [ 4113.376373] xhci_hcd 0000:03:00.0: finish_td comp_code 1 status -75 Thanks, Michal
On 19.1.2024 23.54, Michał Pecio wrote: > Usual tests passed (but only transaction errors are covered). Thanks for testing, I'l add your a tested-by tag to the patch. > > > I noticed that with your new fix to the frame length bug, error_mid_td > always equals urb_length_set. So the new flag is perhaps not necessary > after all. The test in handle_tx_event() could be: > > if (usb_pipeisoc(td->urb->pipe) && td->urb_length_set && > !list_is_last(... )) > > I tried it and it works just as well. This is probably true here, but urb_length_set and error_mid_td have a bit different idea behind them For example short control transfers may send an event mid TD (during data stage). There is no error, we wait for the completion of the status stage but don't want to change the transfer length set in the urb. I suspect we will need to play with both flags for other error cases in the future. Thanks Mathias
On 22.1.2024 11.03, Michał Pecio wrote: >> Apparently a babble error, and it seems to have generated a "success" >> which the event handler tried to match with the next TD. So a mid TD >> babble may need the same treatment, which is not surprising. Makes sense. > > This is now confirmed and fixed here. The change is obvious enough: > case COMP_ISOCH_BUFFER_OVERRUN: > case COMP_BABBLE_DETECTED_ERROR: > + error_mid_td = true; > frame->status = -EOVERFLOW; > break; > > I don't know yet what COMP_ISOCH_BUFFER_OVERRUN means, but I guess it's > the same story. BTW, error_mid_td is a local variable now and I use the > urb_length_set flag instead, as explained before. To me it looks like COMP_BABBLE_DETECTED_ERROR and COMP_ISOCH_BUFFER_OVERRUN have the same cause, device is sending too much data. Isoc endpoints should use COMP_ISOCH_BUFFER_OVERRUN to indicate endpoint hasn't halted like it does in the COMP_BABBLE_DETECTED_ERROR case. See xhci 6.4.5 "TRB completion codes" Footnote 115 > > I found that it can be reproduced on the VIA host, with enough tries it > can happen even on a chained TD. NEC doesn't signal these babble errors > but new mid TD event handling should cope with either host. So looks like VIA host incorrectly sends babble for Isoc endpoints > > Debug trace ("interesting" is other than "success" or "short packet"): > [ 4113.376349] xhci_hcd 0000:03:00.0: handle_tx_event interesting ep_trb_dma 132961000 comp_code 3 slot 2 ep 2 > [ 4113.376361] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010 > [ 4113.376364] xhci_hcd 0000:03:00.0: Error mid isoc TD, wait for final completion event > [ 4113.376366] xhci_hcd 0000:03:00.0: handle_tx_event uninteresting ep_trb_dma 132961010 comp_code 1 slot 2 ep 2 > [ 4113.376369] xhci_hcd 0000:03:00.0: handle_tx_event first_trb 132961000 last_trb 132961010 > [ 4113.376371] xhci_hcd 0000:03:00.0: Got SUCCESS after mid TD error > [ 4113.376373] xhci_hcd 0000:03:00.0: finish_td comp_code 1 status -75 I'm afraid we end up tuning that original patch forever with these new findings, so lets make this into a new patch on top of the previous one. That one is tested and known to work in the transaction error case. Let me know if you want to write it, otherwise I will Thanks Mathias
> I'm afraid we end up tuning that original patch forever Point taken. This patch is more important, as it fixes total failure to deal with trivial bit errors on uPD720200. > Isoc endpoints should use COMP_ISOCH_BUFFER_OVERRUN to indicate > endpoint hasn't halted like it does in the COMP_BABBLE_DETECTED_ERROR > case. > > See xhci 6.4.5 "TRB completion codes" Footnote 115 This applies to "TD Babble", which is: submitted TD size < data from the device <= ESIT payload Other babble conditions exist related to blatant device bugs (4.10.2.4) and it looks like some may result in a halt or device disconnection even on isoch endpoints(?) Other babbles aren't included in "isoch buffer overrun" per 6.4.5. I feel like one would need to systematically go through the whole spec to avoid missing important details or special cases. Thanks, Michal
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 33806ae966f9..41be7d31a36e 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -2376,6 +2376,9 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, /* handle completion code */ switch (trb_comp_code) { case COMP_SUCCESS: + /* Don't overwrite status if TD had an error, see xHCI 4.9.1 */ + if (td->error_mid_td) + break; if (remaining) { frame->status = short_framestatus; if (xhci->quirks & XHCI_TRUST_TX_LENGTH) @@ -2401,8 +2404,9 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, break; case COMP_USB_TRANSACTION_ERROR: frame->status = -EPROTO; + sum_trbs_for_length = true; if (ep_trb != td->last_trb) - return 0; + td->error_mid_td = true; break; case COMP_STOPPED: sum_trbs_for_length = true; @@ -2422,6 +2426,9 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, break; } + if (td->urb_length_set) + goto finish_td; + if (sum_trbs_for_length) frame->actual_length = sum_trb_lengths(xhci, ep->ring, ep_trb) + ep_trb_len - remaining; @@ -2430,6 +2437,14 @@ static int process_isoc_td(struct xhci_hcd *xhci, struct xhci_virt_ep *ep, td->urb->actual_length += frame->actual_length; +finish_td: + /* Don't give back TD yet if we encountered an error mid TD */ + if (td->error_mid_td && ep_trb != td->last_trb) { + xhci_dbg(xhci, "Error mid isoc TD, wait for final completion event\n"); + td->urb_length_set = true; + return 0; + } + return finish_td(xhci, ep, ep_ring, td, trb_comp_code); } @@ -2808,17 +2823,51 @@ static int handle_tx_event(struct xhci_hcd *xhci, } if (!ep_seg) { - if (!ep->skip || - !usb_endpoint_xfer_isoc(&td->urb->ep->desc)) { - /* Some host controllers give a spurious - * successful event after a short transfer. - * Ignore it. - */ - if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) && - ep_ring->last_td_was_short) { - ep_ring->last_td_was_short = false; - goto cleanup; + + if (ep->skip && usb_endpoint_xfer_isoc(&td->urb->ep->desc)) { + skip_isoc_td(xhci, td, ep, status); + goto cleanup; + } + + /* + * Some hosts give a spurious success event after a short + * transfer. Ignore it. + */ + if ((xhci->quirks & XHCI_SPURIOUS_SUCCESS) && + ep_ring->last_td_was_short) { + ep_ring->last_td_was_short = false; + goto cleanup; + } + + /* + * xhci 4.10.2 states isoc endpoints should continue + * processing the next TD if there was an error mid TD. + * So host like NEC don't generate an event for the last + * isoc TRB even if the IOC flag is set. + * xhci 4.9.1 states that if there are errors in mult-TRB + * TDs xHC should generate an error for that TRB, and if xHC + * proceeds to the next TD it should genete an event for + * any TRB with IOC flag on the way. Other host follow this. + * So this event might be for the next TD. + */ + if (td->error_mid_td && + !list_is_last(&td->td_list, &ep_ring->td_list)) { + struct xhci_td *td_next = list_next_entry(td, td_list); + + ep_seg = trb_in_td(xhci, td_next->start_seg, td_next->first_trb, + td_next->last_trb, ep_trb_dma, false); + if (ep_seg) { + /* give back previous TD, start handling new */ + xhci_dbg(xhci, "Missing TD completion event after mid TD error\n"); + ep_ring->dequeue = td->last_trb; + ep_ring->deq_seg = td->last_trb_seg; + inc_deq(xhci, ep_ring); + xhci_td_cleanup(xhci, td, ep_ring, td->status); + td = td_next; } + } + + if (!ep_seg) { /* HC is busted, give up! */ xhci_err(xhci, "ERROR Transfer event TRB DMA ptr not " @@ -2830,9 +2879,6 @@ static int handle_tx_event(struct xhci_hcd *xhci, ep_trb_dma, true); return -ESHUTDOWN; } - - skip_isoc_td(xhci, td, ep, status); - goto cleanup; } if (trb_comp_code == COMP_SHORT_PACKET) ep_ring->last_td_was_short = true; diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index a5c72a634e6a..6f82d404883f 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1549,6 +1549,7 @@ struct xhci_td { struct xhci_segment *bounce_seg; /* actual_length of the URB has already been set */ bool urb_length_set; + bool error_mid_td; unsigned int num_trbs; };
The last TRB of a isoc TD might not trigger an event if there was an error event for a TRB mid TD. This is seen on a NEC Corporation uPD720200 USB 3.0 Host After an error mid a multi-TRB TD the xHC should according to xhci 4.9.1 generate events for passed TRBs with IOC flag set if it proceeds to the next TD. This event is either a copy of the original error, or a "success" transfer event. If that event is missing then the driver and xHC host get out of sync as the driver is still expecting a transfer event for that first TD, while xHC host is already sending events for the next TD in the list. This leads to "Transfer event TRB DMA ptr not part of current TD" messages. As a solution we tag the isoc TDs that get error events mid TD. If an event doesn't match the first TD, then check if the tag is set, and event points to the next TD. In that case give back the fist TD and process the next TD normally Make sure TD status and transferred length stay valid in both cases with and without final TD completion event. Reported-by: Michał Pecio <michal.pecio@gmail.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> --- v2 -> v3 - Rework and fix setting TD transfer lengt issue in v2 - Minor commit message tuning v1 -> v2: - Check for empty TD list, suggested by Michał - Don't overwrite error status, suggested by Michał - Add a debug print for missed TD completion event, suggested by Michał - Reword commit message and code comments - Set correct TD transfer length in isoc error case drivers/usb/host/xhci-ring.c | 74 +++++++++++++++++++++++++++++------- drivers/usb/host/xhci.h | 1 + 2 files changed, 61 insertions(+), 14 deletions(-)