mbox series

[0/2] Fix the NEC stop bug workaround

Message ID 20241014210840.5941d336@foxbook (mailing list archive)
Headers show
Series Fix the NEC stop bug workaround | expand

Message

Michał Pecio Oct. 14, 2024, 7:08 p.m. UTC
Hi,

I found an unfortunate problem with my workaround for this hardware bug.

To recap, Stop Endpoint sometimes fails, the Endpoint Context says the
EP is Stopped, but cancelled TRBs are still executed. I found this bug
earlier this year and submitted a workaround, which retries the command
(sometimes a few times) and all is good.

This works fine for common cases, but what if the endpoint is really
stopped? Then Stop Endpoint is supposed to fail and fail it does. The
workaround code doesn't know that it happened and retries infinitely.

I have never seen it in normal use, but I devised a reliable repro.
The effect isn't pretty - no URBs can be cancelled, device gets stuck,
if unplugged it locks up connections/disconnections on the whole bus.

With some experimentation I found that the bug is a variant of the old
"stop after restart" issue - the doorbell ring is internally reordered
after the subsequent command. By busy-waiting I confirmed that EP state
which is initially seen as Stopped becomes Running some time later.

I came up with a few ways to deal with it, this patch implements #3
which I think is the lowest risk. But for completeness:

0. Do nothing, revert the old patch. Not great, we are back to those
races and DMA-after-free. Seems particularly dangereous on Int IN EPs,
which may take a few ms to start - plenty of time to reuse URB buffers.

1. Ring the doorbell before queuing another Stop Endpoint. This ensures
that the EP will restart even if it wasn't meant to yet. Then a retry
succeeds and we are done. Super-simple and seems to work 100% reliably,
but what if there are still more gotchas in this hardware?

2. Set a flag on doorbell ring, clear on Stop EP or Halt. Look at the
flag to decide if it's the bug or a legitimately stopped EP. But I
didn't like adding overhead to DB ring, even if almost unmeasurable.

3. Set a flag when we know that the command will fail. This appears to
actually be doable. Then we just look at the flag and retry only if it
wasn't supposed to fail.

And some further ideas, considered but not implemented yet:

4. If we know that the command will fail, don't queue it at all. Other
commands are pending in those cases, so modify their handlers to do
our work for us. A little more invasive than the simple fixes 1-3.

5. Maybe it would make sense to ensure that the command can't ever
retry infinitely. Just give up and call hc_died() after 5 seconds.

Regards,
Michal

Comments

Mathias Nyman Oct. 15, 2024, 12:23 p.m. UTC | #1
On 14.10.2024 22.08, Michal Pecio wrote:
> Hi,
> 
> I found an unfortunate problem with my workaround for this hardware bug.
> 
> To recap, Stop Endpoint sometimes fails, the Endpoint Context says the
> EP is Stopped, but cancelled TRBs are still executed. I found this bug
> earlier this year and submitted a workaround, which retries the command
> (sometimes a few times) and all is good.
> 
> This works fine for common cases, but what if the endpoint is really
> stopped? Then Stop Endpoint is supposed to fail and fail it does. The
> workaround code doesn't know that it happened and retries infinitely.
> 
> I have never seen it in normal use, but I devised a reliable repro.
> The effect isn't pretty - no URBs can be cancelled, device gets stuck,
> if unplugged it locks up connections/disconnections on the whole bus.
> 
> With some experimentation I found that the bug is a variant of the old
> "stop after restart" issue - the doorbell ring is internally reordered
> after the subsequent command. By busy-waiting I confirmed that EP state
> which is initially seen as Stopped becomes Running some time later.
> 

Seems host controllers aren't designed to stop, move dequeue, and restart
an endpoint in quick succession.

In addition to fixing this NEC case we could think about avoiding these
cases, some could be avoided by adding a new ".flush_endpoint()" callback to
the USB host side API. Usb core itself has a usb_hcd_flush_endpoint() function
that calls .urb_dequeue() in a loop for each queued URB, causing host to
issue the stop, move deq and ring doorbell for every URB.

If usbcore knows all URBs will be cancelled it could let host do it in one go.
i.e. stop endpoint once.

Thanks
Mathias
Alan Stern Oct. 15, 2024, 2:51 p.m. UTC | #2
On Tue, Oct 15, 2024 at 03:23:23PM +0300, Mathias Nyman wrote:
> In addition to fixing this NEC case we could think about avoiding these
> cases, some could be avoided by adding a new ".flush_endpoint()" callback to
> the USB host side API. Usb core itself has a usb_hcd_flush_endpoint() function
> that calls .urb_dequeue() in a loop for each queued URB, causing host to
> issue the stop, move deq and ring doorbell for every URB.
> 
> If usbcore knows all URBs will be cancelled it could let host do it in one go.
> i.e. stop endpoint once.

Indeed, this makes a lot of sense, and I have long thought that the
API should have been designed this way from the beginning.  At least
for non-Control transfers, unlinking a single URB somewhere inside a
sequence of URBs seems pointless.  I doubt that it ever happens in the
kernel.

(On the other hand, it _is_ reasonable to do this for Control
transfers, because they can come from several different sources, not
just from the device's driver.  The source for a Control URB might
want to unlink it while not affecting the URBs from other sources.)

Furthermore, I suspect this is what Windows does and what the USBIF
originally had in mind for URB management.  (It's harder to tell what
they thought about Control transfers, though.)

Alan Stern
Michał Pecio Oct. 16, 2024, 5:47 a.m. UTC | #3
> > With some experimentation I found that the bug is a variant of the
> > old "stop after restart" issue - the doorbell ring is internally
> > reordered after the subsequent command. By busy-waiting I confirmed
> > that EP state which is initially seen as Stopped becomes Running
> > some time later. 
> 
> Seems host controllers aren't designed to stop, move dequeue, and
> restart an endpoint in quick succession.

As it was you who added the Running case handling, do you know hardware
other than NEC which triggers this? Or could it be just a single vendor
who screwed up once 15 years ago and caused all the chaos?

NEC sometimes triggers the Running case too and it is obvious why. I'm
not sure how I missed it back in January and assumed it's some sort of
random failure for no reason.

BTW, the NEC problem appears to be limited to periodic endpoints. I am
unable to reproduce it on bulk. I thought that I reproduced it on bulk
back then, but on second thought it may have been interrupt, which that
device also has. Unfortunatel I wasn't printing endpoint numbers then.

Regards,
Michal