mbox series

[v4,0/3] xhci: Fix Stop Endpoint problems

Message ID 20241104103200.533fe1fb@foxbook (mailing list archive)
Headers show
Series xhci: Fix Stop Endpoint problems | expand

Message

Michał Pecio Nov. 4, 2024, 9:32 a.m. UTC
v4 of the attempt to sort out Stop Endpoint retries and related issues.

General design:

1/3 guarantee no infinite retries under any failure conditions, ever.
2/3 fix bugs and make Set TR Dequeue handler call invalidate/giveback.
3/3 eliminate cases of pointless Stop EP commands, depends on 2/3.

Changes since v3:

2/3 no longer tries to add code cleanups (and bugs, as it turned out)
    to a bugfix patch, and became much shorter as a result.
3/3 dropped a desperate attempt at detecting a halt/unlink race, which
    I was unable to reproduce because it seems impossible - EP_HALTED
    or SET_DEQ_PENDING remains set until after the halted URB is given
    back to USB core, so we would never queue a pointless Stop command.

Also some commit message and comment cleanups in all three.
And I documented another bug which 2/3 appears to (partly) fix.


The state of other chips:

VIA VL805 and Etron EJ168A don't seem to have the "latent restart" bug
so they don't care about this code, unless it causes pointless retries
on an already stopped endpoint for 100ms, if it's enabled for them.

ASM1042 and ASM3142 have the bug, although not as common as on NEC. It
seems to be particularly triggered by multiple endpoints cancelling at
the same time, so for example, both loops below simultaneously:

while true; do timeout .3 yavta -c /dev/video0; done
while true; do timeout .1 cat /dev/ttyUSB0 >/dev/null; done

On ASM3142 this triggers the bug about once per minute, and prints:

WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.

The "mitigation" patch seems to be helping, because reverting it adds:

ERROR Transfer event TRB DMA ptr not part of current TD ep_index 6 comp_code 13
Looking for event-dma 0000000003715ec0 trb-start 0000000003715ed0 trb-end 0000000003715ed0 seg-start 0000000003715000 seg-end 0000000003715ff0

Enabling the NEC workaround for this chip appears to fix the problem
entirely and so far it only took one retry each time:

Stop ep completion ctx error, ctx_state 3

I have left this running overninght and nothing unexpected was logged
in dmesg. IOW, no undesirable side effects known so far. But I haven't
done much tests with bad cables yet, and ASMedia chips appear to be
quite buggy and prone to strange internal race conditions. Examples:

- the ugly ifconfig up/down lockup issue on 3142 (not tried 1042 yet)
- a pair of "stopped" events out of order preceding the above lockup
- on 1042 Set Deq on streams seems to be no-op (streams are disabled)
- Soft Retry preceding a Set Deq seems to undo the Set Deq's effect
- Stop Endpoint *success* and then a spontaneous restart (1042 only)
- spontaneous restart causes a halt, but transfer event is missing
- TRB 15 (Stop EP) on the event ring (seen only once on 1042)

ASM1042 still not tested for compatilibity with the workaround at all.


Michal Pecio (3):
  usb: xhci: Limit Stop Endpoint retries
  usb: xhci: Fix TD invalidation under pending Set TR Dequeue
  usb: xhci: Avoid queuing redundant Stop Endpoint commands

 drivers/usb/host/xhci-ring.c | 59 ++++++++++++++++++++++++++++++------
 drivers/usb/host/xhci.c      | 21 ++++++++++---
 drivers/usb/host/xhci.h      |  2 ++
 3 files changed, 69 insertions(+), 13 deletions(-)

Comments

Mathias Nyman Nov. 5, 2024, 11:05 a.m. UTC | #1
On 4.11.2024 11.32, Michal Pecio wrote:
> v4 of the attempt to sort out Stop Endpoint retries and related issues.
> 
> General design:
> 
> 1/3 guarantee no infinite retries under any failure conditions, ever.
> 2/3 fix bugs and make Set TR Dequeue handler call invalidate/giveback.
> 3/3 eliminate cases of pointless Stop EP commands, depends on 2/3.
> 
> Changes since v3:
> 
> 2/3 no longer tries to add code cleanups (and bugs, as it turned out)
>      to a bugfix patch, and became much shorter as a result.
> 3/3 dropped a desperate attempt at detecting a halt/unlink race, which
>      I was unable to reproduce because it seems impossible - EP_HALTED
>      or SET_DEQ_PENDING remains set until after the halted URB is given
>      back to USB core, so we would never queue a pointless Stop command.
> 
> Also some commit message and comment cleanups in all three.
> And I documented another bug which 2/3 appears to (partly) fix.
> 
> 
> The state of other chips:
> 
> VIA VL805 and Etron EJ168A don't seem to have the "latent restart" bug
> so they don't care about this code, unless it causes pointless retries
> on an already stopped endpoint for 100ms, if it's enabled for them.
> 
> ASM1042 and ASM3142 have the bug, although not as common as on NEC. It
> seems to be particularly triggered by multiple endpoints cancelling at
> the same time, so for example, both loops below simultaneously:
> 
> while true; do timeout .3 yavta -c /dev/video0; done
> while true; do timeout .1 cat /dev/ttyUSB0 >/dev/null; done
> 
> On ASM3142 this triggers the bug about once per minute, and prints:
> 
> WARN Set TR Deq Ptr cmd failed due to incorrect slot or ep state.
> 
> The "mitigation" patch seems to be helping, because reverting it adds:
> 
> ERROR Transfer event TRB DMA ptr not part of current TD ep_index 6 comp_code 13
> Looking for event-dma 0000000003715ec0 trb-start 0000000003715ed0 trb-end 0000000003715ed0 seg-start 0000000003715000 seg-end 0000000003715ff0
> 
> Enabling the NEC workaround for this chip appears to fix the problem
> entirely and so far it only took one retry each time:
> 
> Stop ep completion ctx error, ctx_state 3
> 
> I have left this running overninght and nothing unexpected was logged
> in dmesg. IOW, no undesirable side effects known so far. But I haven't
> done much tests with bad cables yet, and ASMedia chips appear to be
> quite buggy and prone to strange internal race conditions. Examples:
> 
> - the ugly ifconfig up/down lockup issue on 3142 (not tried 1042 yet)
> - a pair of "stopped" events out of order preceding the above lockup
> - on 1042 Set Deq on streams seems to be no-op (streams are disabled)
> - Soft Retry preceding a Set Deq seems to undo the Set Deq's effect
> - Stop Endpoint *success* and then a spontaneous restart (1042 only)
> - spontaneous restart causes a halt, but transfer event is missing
> - TRB 15 (Stop EP) on the event ring (seen only once on 1042)
> 
> ASM1042 still not tested for compatilibity with the workaround at all.

Thanks for all the work you put into this series.
I added all patches to my for-usb-next branch and will send them to Greg shortly.

I think this code won't be NEC specific for very long as other vendors
have these similar issues.

Out of curiosity, what was the longest needed 'stop endpoint' retry time
you saw which still had a successful outcome?

Thanks
Mathias
Michał Pecio Nov. 5, 2024, 12:18 p.m. UTC | #2
On Tue, 5 Nov 2024 13:05:19 +0200, Mathias Nyman wrote:
> Out of curiosity, what was the longest needed 'stop endpoint' retry
> time you saw which still had a successful outcome?

On NEC, periodic endpoints appear to take up to one ESIT. It's one or
two retries on fast isochronous EPs, but interrupt may take a while.

The longest I have seen was 24ms on a SuperSpeed EP with bInterval 11,
so it looks like this may be the chip's limit (bInterval 11 = 128ms),
but admittedly, I haven't tested this thoroughly with many devices.

No problems ever seen with bulk or control on NEC.

On ASM3142 it seems that one retry tends to be enough, and IIRC all
endpoint types can misbehave.


I attached a patch, based on the earlier "redundant" patch, which tries
to detect these bugs by looping until it sees EP Context state change.
Looping under spinlock sure is dodgy, but with these timeout values it
doesn't seem to completely break the driver and it finds these HC bugs.

And yes, I have seen the "absolutely fubar" case on ASM1042 with bad
cable (repeated halts, resets and stops). Some time later it "died".

Regards,
Michal