[0/2] xhci: Fix the NEC stop bug workaround

Message ID	20241025121806.628e32c0@foxbook (mailing list archive)
Headers	show Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 570D0192587 for <linux-usb@vger.kernel.org>; Fri, 25 Oct 2024 10:18:15 +0000 (UTC) Date: Fri, 25 Oct 2024 12:18:06 +0200 From: Michal Pecio <michal.pecio@gmail.com> To: Mathias Nyman <mathias.nyman@intel.com>, linux-usb@vger.kernel.org Subject: [PATCH 0/2] xhci: Fix the NEC stop bug workaround Message-ID: <20241025121806.628e32c0@foxbook> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit
Series	xhci: Fix the NEC stop bug workaround \| expand [0/2] xhci: Fix the NEC stop bug workaround [v2,1/2] usb: xhci: Fix the NEC stop bug workaround [v2,2/2,RFC] usb: xhci: Don't queue redundant Stop Endpoint commands

Michał Pecio Oct. 25, 2024, 10:18 a.m. UTC

Hi,


This is the promised v2 of this bugfix. It took longer than expected
because I got sidetracked by two (related) issues:

1. looking for similar bugs in other chips
2. simplifying this to avoid adding the STOP_CMD_REDUNDANT flag

Changes in v2:

1. Dropped the warning patch, because dealing with other chips is a
   separate issue from fixing the bug in existing code.
2. Added CC:stable to make the patch bot happy.
3. Some comment updates/clarifications to address questions asked by
   reviewers. Comments made vendor-agnostic, no longer mention NEC in
   preparation for other buggy vendors.
4. Added an RFC patch to simplify things and avoid queuing redundant
   commands instead of trying to handle them. Still a little dodgy in
   one place, problem described in a C99 comment. But it works for me.

The simplification is a separate patch because that's how the code
evolved and because it could enable the more straightforward and lower
risk patch 1/2 to be used in stable without patch 2/2, if desired.

Or otherwise, I could squash the patches together, of course.


Regarding other chips, the following was found:
1. NEC discovered this bug and fixed it in a silicon or FW revision.
   Some chips have the bug, but I have one which doesn't.
2. I couldn't reproduce this bug on VIA VL805 and Etron EJ168A.
3. Two ASMedia chips tested, both have the bug. ASM3142 aka ASM2142
   is quite subtle, because it only seems to happen when multiple EPs
   are used at the same time. I suspect it's a matter of the command
   ring fetching commands asynchronously before we ring the command
   doorbell, or simply increased xHC load triggers some internal bug.

ASMedia presents an additional challange because it sometimes gets
stuck: Stop Endpoint fails in Stopped state even though our ep_state
says it should be running, and it never starts. I need to investigate
what exactly goes wrong and if our ep_state is bad or the chip.

This is dangerous, because the naive workaround would simply retry
the command forever. I suppose it may be a very good idea to add some
timeout. Say, if 100ms passes and the commands are still failing, just
assume that it is stopped for good and go ahead.


Regards,
Michal

Michał Pecio Oct. 28, 2024, 7:33 a.m. UTC | #1

Hi Mathias,

I would be grateful if you could take a look at patch 2/2 and tell if
there is anything obviously wrong with this approach. As far as I see,
it should be OK to just call those invalidation and giveback functions
directly from xhci_urb_dequeue(), and it works for me in practice.

Regarding the probem with xhci_invalidate_cancelled_tds() being called
while Set TR Dequeue is already pending, I found that it is much easier
to handle it by looking at SET_DEQ_PENDING and simply setting all TDs
to TD_CLEARING_CACHE_DEFERRED if that's the case. So this is solved.


However, these patches still don't solve the issue of infinite retries
completely. There is one more annoying case caused by halts. It is very
poorly defined what happens when a halted EP is hard-reset. Usually Set
TR Dequeue executes afterwards and it restarts the EP when done. But if
it doesn't, the EP stays stopped until a new URB is submitted, if ever.

There is the EP_HARD_CLEAR_TOGGLE flag which is set until the class
driver calls usb_clear_halt(), but it's neither the case that the EP is
guaranteed to be stopped until usb_clear_halt() is called nor that it
is guaranteed to restart afterwards.

Actually, I think it might be a bug? What if Set TR Dequeue restarts an
EP before the class driver clears the device side of the halt?


I'm starting to think that it may not be realistic to quickly solve all
those (and possibly other not known yet) problems now. Maybe just slap
a 100ms timeout on those retries, add quirks for ASMedia/Intel and call
it a day for now?

Regards,
Michal

Mathias Nyman Oct. 28, 2024, 9:54 a.m. UTC | #2

On 28.10.2024 9.33, Michal Pecio wrote:
> Hi Mathias,
> 
> I would be grateful if you could take a look at patch 2/2 and tell if
> there is anything obviously wrong with this approach. As far as I see,
> it should be OK to just call those invalidation and giveback functions
> directly from xhci_urb_dequeue(), and it works for me in practice.

Adding EP_HALTED case to xhci_urb_deqeue() should work fine, we
will both invalidate and give back the invalidated TDs in the completion
handler.

> 
> Regarding the probem with xhci_invalidate_cancelled_tds() being called
> while Set TR Dequeue is already pending, I found that it is much easier
> to handle it by looking at SET_DEQ_PENDING and simply setting all TDs
> to TD_CLEARING_CACHE_DEFERRED if that's the case. So this is solved.
>

The SET_DEQ_PENDING case is trickier. We would read the dequeue pointer
from hardware while we know hardware is processing a command to move the
dequeue pointer. Result may be old dequeue, or new dequeue,
possible unknown.

We are turning an already difficult issue even more complex

> 
> However, these patches still don't solve the issue of infinite retries
> completely. There is one more annoying case caused by halts. It is very
> poorly defined what happens when a halted EP is hard-reset. Usually Set
> TR Dequeue executes afterwards and it restarts the EP when done. But if
> it doesn't, the EP stays stopped until a new URB is submitted, if ever.
> 
> There is the EP_HARD_CLEAR_TOGGLE flag which is set until the class
> driver calls usb_clear_halt(), but it's neither the case that the EP is
> guaranteed to be stopped until usb_clear_halt() is called nor that it
> is guaranteed to restart afterwards.
> 
> Actually, I think it might be a bug? What if Set TR Dequeue restarts an
> EP before the class driver clears the device side of the halt?

Ok, I need to take some time to dig into this.

> 
> 
> I'm starting to think that it may not be realistic to quickly solve all
> those (and possibly other not known yet) problems now. Maybe just slap
> a 100ms timeout on those retries, add quirks for ASMedia/Intel and call
> it a day for now?

There are some mitigations that could be done.

As many of these issues are related to slow enpoint slow start causing
next stop endpoint command to complete with context error as endpoint is
still stopped.

We could ring the doorbell before giving back the invalidated tds in
xhci_handle_cmd_stop_ep(), and possibly xhci_handle_cmd_set_deq().
This gives hardware a bit more time to start the endpoint.

We could also track pending ring starts.
Set a "EP_START_PENDING flag when doorbell is rung on a stopped endpoint.
clear the flag when firt transfer event is received on that endpoint.

If a stop endpoint command fails with context error due to still being
stopped queue a new stop endpoint command, but only if flag was set:

xhci_handle_cmd_stop_ep()
	if (comp_code == COMP_CONTEXT_STATE_ERROR) {
		switch (GET_EP_CTX_STATE(ep_ctx))
		case EP_STATE_STOPPED:
			if (!(ep->ep_state & EP_START_PENDING)
				break;
			ep->ep_state &= ~EP_START_PENDING;
			xhci_queue_stop_endpoint();

Thanks
Mathias

Michał Pecio Oct. 29, 2024, 8:28 a.m. UTC | #3

On Mon, 28 Oct 2024 11:54:39 +0200, Mathias Nyman wrote:
> The SET_DEQ_PENDING case is trickier. We would read the dequeue
> pointer from hardware while we know hardware is processing a command
> to move the dequeue pointer. Result may be old dequeue, or new
> dequeue, possible unknown.

Damn, I looked at various things but I haven't thought about it. Yes,
it's dodgy and not really a great idea. Although I suspect it wouldn't
be *very* harmful, because the truly bad case (failing to queue a Set
TR Deq when it's necessary) is triggered by Set TR Deq already pending
on the same stream, so the stream's cache will be cleared anyway.

But it could easily queue a bunch of pointless commands, for example.

By the way, I think this race is already possible today, without my
patches. There is no testing for SET_DEQ_PENDING in xhci_urb_dequeue()
and no testing in handle_cmd_stop_ep(). If this happens in the middle
of a Set TR Deq chain on a streams endpoint, nothing seems to stop the
Stop EP handler from attempting invalidation under SET_DEQ_PENDING.

Maybe invalidate_cancelled_tds() should bail out if SET_DEQ_PENDING
and later Set Deq completion handler should unconditionally call the
invalidate/giveback combo before it exits.

> We could ring the doorbell before giving back the invalidated tds in
> xhci_handle_cmd_stop_ep(), and possibly xhci_handle_cmd_set_deq().
> This gives hardware a bit more time to start the endpoint.

This unfortunately doesn't buy much time, because giveback is a very
cheap operation - it just adds the URBs to a queue and wakes a worker
which runs all those callbacks. It finishes under 1us on my system.

> We could also track pending ring starts.
> Set a "EP_START_PENDING flag when doorbell is rung on a stopped
> endpoint. clear the flag when firt transfer event is received on that
> endpoint.

Yes, that was the second thing I tried. But I abandoned it:

Problem 1: URBs on "idle" devices are cancelled before generating
any event, so we need to clear the flag on Stop EP and Reset EP.
Not all of them use the default completion handler. Maybe it could
be handled reliably by tapping into handle_cmd_completion(). But...

Problem 2: hardware bugs. On ASMedia 3142, I can trigger (from
userspace) a condition when EP 0 doorbell is rung (even twice)
and its state is still Stopped a few seconds later, and remains
so after repeated Stop EP / doorbell rings.

It looks like a timeout is the only way to be really sure.

Regards,
Michal

Mathias Nyman Oct. 29, 2024, 9:16 a.m. UTC | #4

On 29.10.2024 10.28, Michał Pecio wrote:
> 
> By the way, I think this race is already possible today, without my
> patches. There is no testing for SET_DEQ_PENDING in xhci_urb_dequeue()
> and no testing in handle_cmd_stop_ep(). If this happens in the middle
> of a Set TR Deq chain on a streams endpoint, nothing seems to stop the
> Stop EP handler from attempting invalidation under SET_DEQ_PENDING.
> 
> Maybe invalidate_cancelled_tds() should bail out if SET_DEQ_PENDING
> and later Set Deq completion handler should unconditionally call the
> invalidate/giveback combo before it exits.
> 

I think you are on to something.
If we add invalidate/givaback to Set TR deq completion handler, allowing
it to possible queue new Set TR Deq commands, then we can bail out in
xhci_urb_dequeue() if SET_DEQ_PENDING is set.

xhci_urb_dequeue() would not queue a extra stop endpoint command, only
set td->cancel_status to TD_DIRTY dirty, and Set TR Deq handler would
not ring the doorbell unnecessary.

Sounds like a plan to ne.

>> We could ring the doorbell before giving back the invalidated tds in
>> xhci_handle_cmd_stop_ep(), and possibly xhci_handle_cmd_set_deq().
>> This gives hardware a bit more time to start the endpoint.
> 
> This unfortunately doesn't buy much time, because giveback is a very
> cheap operation - it just adds the URBs to a queue and wakes a worker
> which runs all those callbacks. It finishes under 1us on my system.

Probably true, but change is simple and free so might be worth it.

Thanks
Mathias

Mathias Nyman Oct. 30, 2024, 8:29 a.m. UTC | #5

On 29.10.2024 11.16, Mathias Nyman wrote:
> On 29.10.2024 10.28, Michał Pecio wrote:
>>
>> By the way, I think this race is already possible today, without my
>> patches. There is no testing for SET_DEQ_PENDING in xhci_urb_dequeue()
>> and no testing in handle_cmd_stop_ep(). If this happens in the middle
>> of a Set TR Deq chain on a streams endpoint, nothing seems to stop the
>> Stop EP handler from attempting invalidation under SET_DEQ_PENDING.
>>
>> Maybe invalidate_cancelled_tds() should bail out if SET_DEQ_PENDING
>> and later Set Deq completion handler should unconditionally call the
>> invalidate/giveback combo before it exits.
>>
> 
> I think you are on to something.
> If we add invalidate/givaback to Set TR deq completion handler, allowing
> it to possible queue new Set TR Deq commands, then we can bail out in
> xhci_urb_dequeue() if SET_DEQ_PENDING is set.
> 
> xhci_urb_dequeue() would not queue a extra stop endpoint command, only
> set td->cancel_status to TD_DIRTY dirty, and Set TR Deq handler would
> not ring the doorbell unnecessary.
> 
> Sounds like a plan to ne.

I wrote a testseries for this.

1st patch avoids stopping endpoint in urb cancel if Set TR Deq is pending
2nd patch handles Set TR Deq command ctx error due to running ep.
3rd patch tracks doorbell ring with a flag. It's for now only used to prevent
     infinite stop ep retries. Flag is not properly cleared in other cases.

Series can be found in my tree in a fix_stop_ep_race branch:

https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_stop_ep_race
git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git fix_stop_ep_race branch

Do these help in your NEC host case?

I'll see if I can set up some system to trigger this myself

Thanks
Mathias

Michał Pecio Oct. 31, 2024, 8:13 a.m. UTC | #6

Hi,

I will send out v3 of my own patches soon.

I also plan to research a radically different solution, which is simply
to prevent failed Stop Endpoint in the first place. General idea:

1. If commands pending, "outsource" the work to their handlers.
2. If EP stopped for other reason, invalidate/giveback immediately.
3. If Context State != Stopped, queue Stop Endpoint.
4. If Context State == Stopped but shouldn't, set up a delayed work.
5. The work looks at Context State and goto 3.
6. The work may choose to give up retrying after some time.
7. The work could even act as watchdog and call hc_died() if Set Deq
   or Reset EP get stuck in retry or abort loops (not seen so far).

On Wed, 30 Oct 2024 10:29:12 +0200, Mathias Nyman wrote:
> 1st patch avoids stopping endpoint in urb cancel if Set TR Deq is
> pending
> 2nd patch handles Set TR Deq command ctx error due to running ep.
> 3rd patch tracks doorbell ring with a flag. It's for now only
> used to prevent infinite stop ep retries. Flag is not properly
> cleared in other cases.

Quick comments:

1. To be specific, my suggestion was to make the function work even
   if called under SET_DEQ_PENDING rather than try to avoid this. It
   ends up simpler IMO and solves any risk of calls from other places.

   Then the change in xhci_urb_dequeue() becomes an optimization only,
   which is not required for correct operation, may be combined with
   other similar optimizations, or even reverted if necessary without
   breaking multiple stream cancellations again.

2. Mixed feelings. Adds complexity, obviously. Shouldn't be necessary
   if the retries prove to work on other chips (I have never had a Set
   TR Deq error on NEC with the workaround). Could help otherwise.

3. No need to pollute handle_tx_event() at all, because the flag only
   needs to be guaranteed false when the EP is Stopped, and this can
   only happen by *successful* execution of Stop EP or Reset EP. Easy
   to detect this in handle_cmd_completion().

   But I'm not sure if a new flag is needed at all. Its value will
   simply be negation of the condition in xhci_ring_ep_doorbell(),
   except for cases when we "forget" to ring EP doorbell, which are
   probably bugs that should be fixed.

Bugs need to be handled some other way, because hardware has them too
and it's impossible to predict when they bite. See below.

> Series can be found in my tree in a fix_stop_ep_race branch:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=fix_stop_ep_race
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
> fix_stop_ep_race branch
> 
> Do these help in your NEC host case?

This looks like it should work on semi-well-behaved HC like NEC, but it
doesn't account for hardware not restarting an EP "just because".

while true; do ifconfig eth0 up; ifconfig eth0 down; done

locks up on ASM3142 with AX88179 adapter as expected, and when the NIC
is disconnected I get those 'ep ctx error, ep still running' every few
seconds. It seems lots of network code got locked up and I can't even
ssh into the box anymore to copy exact dmesg output.

> I'll see if I can set up some system to trigger this myself

Good idea, lot's of fun ;)

Michal

Michał Pecio Oct. 31, 2024, 10:49 a.m. UTC | #7

On Thu, 31 Oct 2024 09:13:47 +0100, Michał Pecio wrote:
> This looks like it should work on semi-well-behaved HC like NEC, but
> it doesn't account for hardware not restarting an EP "just because".
> 
> 
> while true; do ifconfig eth0 up; ifconfig eth0 down; done
> 
> locks up on ASM3142 with AX88179 adapter as expected, and when the NIC
> is disconnected I get those 'ep ctx error, ep still running' every few
> seconds. It seems lots of network code got locked up and I can't even
> ssh into the box anymore to copy exact dmesg output.

I apologize for rushed testing and providing bad information.

Your patch doesn't trigger infinite retries because it gives up after
just one retry. The infinite retries every few seconds I observed were
all caused by separate cancellations. The class driver timed out on its
control transfer, cancelled one URB, tried another one, timed out, ...

It takes a few minutes before it gives up, and only if I kill the
ifconfig loop, otherwise it seems to be forever.

Your patch prints one dev_dbg() each time, mine spams many of them for
100ms each time. I will remove this one retry limit from your patch to
see if starts spinning infinitely, but I strongly suspect it will.


One retry is not enough. This is what I got on the first try with a
random UVC webcam:

[ 7297.326596] usb 10-2: new high-speed USB device number 2 using xhci_hcd
[ 7297.465252] usb 10-2: New USB device found, idVendor=1e4e, idProduct=0103, bcdDevice= 0.02
[ 7297.465259] usb 10-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 7297.465261] usb 10-2: Product: USB2.0 Camera
[ 7297.465263] usb 10-2: Manufacturer: Etron Technology, Inc.
[ 7297.468898] usb 10-2: Found UVC 1.00 device USB2.0 Camera (1e4e:0103)
[ 7297.475995] usb 10-2: UVC non compliance - GET_DEF(PROBE) not supported. Enabling workaround.
[ 7297.476928] input: USB2.0 Camera: USB2.0 Camera as /devices/pci0000:00/0000:00:05.0/0000:03:00.0/usb10/10-2/10-2:1.0/input/input25
[ 7297.492153] usb 10-2: Warning! Unlikely big volume range (=6144), cval->res is probably wrong.
[ 7297.492159] usb 10-2: [3] FU [Mic Capture Volume] ch = 1, val = 5120/11264/1
[ 7299.301892] usb 10-2: Device requested 3060 B/frame bandwidth
[ 7299.301907] usb 10-2: Selecting alternate setting 12 (3060 B/frame bandwidth)
[ 7299.304772] usb 10-2: Allocated 5 URB buffers of 32x3060 bytes each
[ 7300.339246] xhci_hcd 0000:03:00.0: Stop ep ctx error, already stopped with pending start
[ 7300.339252] xhci_hcd 0000:03:00.0: Stop ep completion ctx error, ep is running
[ 7300.339283] xhci_hcd 0000:03:00.0: Context Error unhandled, ctx_state 3
[ 7300.339324] xhci_hcd 0000:03:00.0: Stop ep ctx error, already stopped with pending start
[ 7300.339326] xhci_hcd 0000:03:00.0: Stop ep completion ctx error, ep is running
[ 7300.339365] xhci_hcd 0000:03:00.0: Stop ep completion ctx error, ep is running
[ 7300.339492] xhci_hcd 0000:03:00.0: Stop ep ctx error, already stopped with pending start
[ 7300.339494] xhci_hcd 0000:03:00.0: Stop ep completion ctx error, ep is running
[ 7300.339533] xhci_hcd 0000:03:00.0: Context Error unhandled, ctx_state 3
[ 7300.339593] xhci_hcd 0000:03:00.0: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
[ 7300.339594] xhci_hcd 0000:03:00.0: ep deq seg = ffff88810b006000, deq ptr = ffff88810ae0e780
[ 7300.339634] xhci_hcd 0000:03:00.0: Stop ep completion ctx error, ep is running

Not sure what caused this mismatch, it might be a new bug of yours
because I don't recall seeing such things before.

This was with your three patches on v6.12-rc4 plus the following:

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 7d036fda354c..7325729beac8 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1182,6 +1182,8 @@ static void xhci_handle_cmd_stop_ep(struct xhci_hcd *xhci, int slot_id,
 		default:
 			break;
 		}
+		xhci_err(xhci, "Context Error unhandled, ctx_state %d\n",
+				GET_EP_CTX_STATE(ep_ctx));
 	}
 
 	/* will queue a set TR deq if stopped on a cancelled, uncleared TD */

Michał Pecio Oct. 31, 2024, 11:17 a.m. UTC | #8

Update:

> Your patch prints one dev_dbg() each time, mine spams many of them for
> 100ms each time. I will remove this one retry limit from your patch to
> see if starts spinning infinitely, but I strongly suspect it will.

Yes, that's exactly what happens.

This time I have killed the ifconfig loop, unplugged the NIC and
started 'rmmod xhci_pci', which is still hanging 10 minutes later.

So business as usual when these things go wrong.

> One retry is not enough. This is what I got on the first try with a
> random UVC webcam:
> [...]

The Set TR Deq mismatch errors went away when I fixed your patch not
to give up on first try. Maybe I remembered wrong and they have always
been around.

Regards,
Michal

Mathias Nyman Oct. 31, 2024, 2:22 p.m. UTC | #9

On 31.10.2024 13.17, Michał Pecio wrote:
> Update:
> 
>> Your patch prints one dev_dbg() each time, mine spams many of them for
>> 100ms each time. I will remove this one retry limit from your patch to
>> see if starts spinning infinitely, but I strongly suspect it will.
> 
> Yes, that's exactly what happens.
> 
> This time I have killed the ifconfig loop, unplugged the NIC and
> started 'rmmod xhci_pci', which is still hanging 10 minutes later.
> 
> So business as usual when these things go wrong.
> 
>> One retry is not enough. This is what I got on the first try with a
>> random UVC webcam:
>> [...]

Ok, good to know, then using flag is not enough.

Using a retry timeout for failed stop endpoint commands also sounds good
to me.
It has a slight downside of a possible 100ms aggressive 'Stop Endpoint'
retry loop in cases where endpoint was stopped earlier for some other reason.

Not sure if that is a problem, if it is then we can add the flag and only
retry for 100ms if flag is set (only clear flag in handle_tx_event())

Another reason for the flag is the additional note in xhci 4.8.3 [1], we might
need to track the state better in software.

[1] xhci 4.8.3 Endpoint Context state

"There are several cases where the EP State field in the Output Endpoint Context
may not reflect the current state of an endpoint. The xHC should attempt to
keep EP State as current as possible, however it may defer these updates to
perform higher priority references to memory, e.g. Isoch data transfers, etc.
Software should maintain an internal variable that tracks the state of an
endpoint and not depend on EP State to represent the instantaneous state of
an endpoint.
For example, when a Command that affects EP State is issued, the value of EP
State may be updated anytime between when software rings the Command
Ring doorbell for a command and when the associated Command Completion
Event is placed on the Event Ring by the xHC. The update of EP State may also
be delayed relative to a Doorbell ring or error condition (e.g. TRB Error, STALL,
or USB Transaction Error) that causes an EP State change not generated by a
command.

Software should maintain an accurate value for EP State, by tracking it with an
internal variable that is driven by Events and Doorbell accesses associated with
an endpoint using the following method:

• When a command is issued to an endpoint that affects its state, software
should use the Command Completion Event to update its image of EP State
to the appropriate state.
• When a Transfer Event reports a TRB Error, software should update its image
of EP State to Error.
• When a Transfer Event reports a Stall Error or USB Transaction Error,
software should update its image of EP State to Halted.
• When software rings the Doorbell of an endpoint to transition it from the
Stopped to Running state, it should update its image of EP State to Running."

Thanks
-Mathias

Michał Pecio Nov. 1, 2024, 9:10 a.m. UTC | #10

On Thu, 31 Oct 2024 16:22:14 +0200, Mathias Nyman wrote:
> Ok, good to know, then using flag is not enough.
> 
> Using a retry timeout for failed stop endpoint commands also sounds
> good to me.
> It has a slight downside of a possible 100ms aggressive 'Stop
> Endpoint' retry loop in cases where endpoint was stopped earlier for
> some other reason.

Waiting 100ms is rare. It happens in cases when the original bad
workaround would retry infinitely, and this bug still hasn't been
reported by users for half a year. But I triggered it with patched
drivers or by stopping a device with flaky cable, so it can happen.

EP_RESTARTING flag could be used to avoid wasting 100ms in those
cases, but I found that I can predict its value in advance while
queuing the command, without adding noise to unrelated code. That
was the STOP_CMD_REDUNDANT patch, and now I'm trying to just avoid
queuing such redundant commands at all. And timeout if that fails.

I found this "redundant" prediction very accurate and pointless
commands are practically eliminated. Correctness is guaranteed by
ring_ep_doorbell() always starting an endpoint with pending URBs
if a few ep_state flags aren't set, and always being called when
any of those flags are cleared. Only two exceptions are known:

1. The "transferless tx events", which may trigger a hard reset
   without Set Deq. In this case ring_ep_doorbell() isn't called.
2. The bizarre ASM3142 ifconfig up/down issue, which crashes the
   whole bus and really looks like a hardware bug.

Generally, all cases of failing to restart an active endpoint are
very user-visible and problematic on their own right, so a bit of
extra delay wouldn't be the worst problem at this point, I hope.

> • When software rings the Doorbell of an endpoint to transition it
> from the Stopped to Running state, it should update its image of EP
> State to Running.

Making this assumption is exactly why we have problems, because the
start/stop race is tricky for hardware and some chips clearly don't
behave in such simple manner as suggested.

Regards,
Michal

[0/2] xhci: Fix the NEC stop bug workaround

Message

Comments