mbox series

[0/9] Various xhci fixes and improvements

Message ID 20240827194625.61be5733@foxbook (mailing list archive)
Headers show
Series Various xhci fixes and improvements | expand

Message

Michał Pecio Aug. 27, 2024, 5:46 p.m. UTC
Hi,

Here's a handful of bug fixes, cleanups and improvements for the xHCI
driver.

The first is a trivial fix, the second also fixes a bug albeit a less
trivial one. These two maybe deserve to go into usb-linus, the latter
perhaps to stable as well, but the patch doesn't apply as-is.

The rest are new functionality or code cleanups with low anticipated
user impact.

All patches were applied and tested on v6.11-rc5, all compiled cleanly
and worked without regressions with some HID, storage, audio, video.
Functional changes received additional functional testing described in
their respective changelogs.


Notably missing is a solution to the "Underrun after Missed Service"
problem. To recap, Underrun/Overrun typically has a NULL TRB pointer
and ep_ring->td_list contains some missed TDs and possibly a few that
have been added after the underrun occured but before we got the IRQ.
No way to tell them apart, at least by the spec, as far as I see.

On USB 3.1+ hosts we can get out of it with "expedited skipping" - it
ensures that the ring is never left with stale TDs in the first place.

On USB 3.0 hosts the underrun handler *will* sometimes face skipping
and it needs to make a decision.

Currently, it skips everything. This may cause DMA-after-free - not
great on IN endpoints - and "TRB DMA ptr not part of current TD" or
"WARN Event TRB with no TDs queued" later.

The obvious alterantive is to never skip on empty ring events, but it
can deadlock a driver which waits for its URBs to be given back when
all of them were missed and we can't get a valid dequeue from the HC.


I wonder if it would make sense to always skip the first queued TD
when we get an MSE with NULL pointer? I think it's legal, and likely
sufficient to avoid the stupid deadlock.

Just a last minute idea. I will think about it, but now I'd like to
flush this patch queue before it grows to infinity ;)

Thanks,
Michal

Comments

fps Aug. 28, 2024, 10:21 a.m. UTC | #1
On 8/27/24 7:46 PM, Michal Pecio wrote:
> Hi,
>
> Here's a handful of bug fixes, cleanups and improvements for the xHCI
> driver.

[...]

> All patches were applied and tested on v6.11-rc5, all compiled cleanly
> and worked without regressions with some HID, storage, audio, video.
> Functional changes received additional functional testing described in
> their respective changelogs.

Hi!

I have a 6.11-rc5 tree here (tar.gz from kernel.org) and trying to apply this series I get:

$ for n in ~/Downloads/michal\ patches/20240827-\[PATCH\ *; do echo "$n"; patch --dry-run -p 1 < "$n"; done
/home/fps/Downloads/michal patches/20240827-[PATCH 1_9] usb_ xhci_ Fix double newline in a deb-148577.txt
checking file drivers/usb/host/xhci-ring.c
/home/fps/Downloads/michal patches/20240827-[PATCH 2_9] usb_ xhci_ Fix handling errors mid TD -148571.txt
checking file drivers/usb/host/xhci-ring.c
/home/fps/Downloads/michal patches/20240827-[PATCH 3_9] usb_ xhci_ Clean up the TD skipping lo-148572.txt
checking file drivers/usb/host/xhci-ring.c
Hunk #1 FAILED at 2829.
1 out of 1 hunk FAILED
/home/fps/Downloads/michal patches/20240827-[PATCH 4_9] usb_ xhci_ Expedite processing missed -148579.txt
checking file drivers/usb/host/xhci-ring.c
/home/fps/Downloads/michal patches/20240827-[PATCH 5_9] usb_ xhci_ Simplify error_mid_td marki-148573.txt
checking file drivers/usb/host/xhci-ring.c
Hunk #1 FAILED at 2410.
1 out of 1 hunk FAILED
/home/fps/Downloads/michal patches/20240827-[PATCH 6_9] usb_ xhci_ Sanity check _spurious succ-148574.txt
checking file drivers/usb/host/xhci-ring.c
Hunk #1 succeeded at 2778 (offset -7 lines).
Hunk #2 FAILED at 2879.
1 out of 2 hunks FAILED
/home/fps/Downloads/michal patches/20240827-[PATCH 7_9] usb_ xhci_ Don't lie about trb_comp_co-148575.txt
checking file drivers/usb/host/xhci-ring.c
Hunk #1 succeeded at 2610 (offset -3 lines).
Hunk #2 succeeded at 2643 (offset -3 lines).
Hunk #3 FAILED at 2898.
1 out of 3 hunks FAILED
/home/fps/Downloads/michal patches/20240827-[PATCH 8_9] usb_ xhci_ Print completion code in th-148576.txt
checking file drivers/usb/host/xhci-ring.c
Hunk #1 FAILED at 2789.
1 out of 1 hunk FAILED
/home/fps/Downloads/michal patches/20240827-[PATCH 9_9] usb_ xhci_ Clean up and rename bad str-148580.txt
checking file drivers/usb/host/xhci-ring.c
Hunk #1 succeeded at 2563 (offset -3 lines).
Hunk #2 succeeded at 2582 (offset -3 lines).
Hunk #3 succeeded at 2635 (offset -4 lines).

Are you sure these are against 6.11-rc5? Or did I screw up on my side?

Kind regards,
FPS
Mathias Nyman Aug. 28, 2024, 11:14 a.m. UTC | #2
On 27.8.2024 20.46, Michal Pecio wrote:
> Hi,
> 
> Here's a handful of bug fixes, cleanups and improvements for the xHCI
> driver.
> 
> The first is a trivial fix, the second also fixes a bug albeit a less
> trivial one. These two maybe deserve to go into usb-linus, the latter
> perhaps to stable as well, but the patch doesn't apply as-is.
> 
> The rest are new functionality or code cleanups with low anticipated
> user impact.
> 
> All patches were applied and tested on v6.11-rc5, all compiled cleanly
> and worked without regressions with some HID, storage, audio, video.
> Functional changes received additional functional testing described in
> their respective changelogs.
> 
> 
> Notably missing is a solution to the "Underrun after Missed Service"
> problem. To recap, Underrun/Overrun typically has a NULL TRB pointer
> and ep_ring->td_list contains some missed TDs and possibly a few that
> have been added after the underrun occured but before we got the IRQ.
> No way to tell them apart, at least by the spec, as far as I see.
> 
> On USB 3.1+ hosts we can get out of it with "expedited skipping" - it
> ensures that the ring is never left with stale TDs in the first place.
> 
> On USB 3.0 hosts the underrun handler *will* sometimes face skipping
> and it needs to make a decision.
> 
> Currently, it skips everything. This may cause DMA-after-free - not
> great on IN endpoints - and "TRB DMA ptr not part of current TD" or
> "WARN Event TRB with no TDs queued" later.
> 
> The obvious alterantive is to never skip on empty ring events, but it
> can deadlock a driver which waits for its URBs to be given back when
> all of them were missed and we can't get a valid dequeue from the HC.
> 
> 
> I wonder if it would make sense to always skip the first queued TD
> when we get an MSE with NULL pointer? I think it's legal, and likely
> sufficient to avoid the stupid deadlock.
> 
> Just a last minute idea. I will think about it, but now I'd like to
> flush this patch queue before it grows to infinity ;)
> 

Thanks Michal for this series.

I need to go through the patches individually and pick fixes.

We are in the middle of reworking transfer event handling with Niklas as well.

I'd be grateful if you could take a quick look at our solution and give
your opinion on it as you have expertise in this area.

Main idea is simple.
If a transfer event points to a TRB between dequeue and enqueue we give
back all queued TDs passed up to this TRB.
And if the event points to a valid TD then handle that. (normal case)

Code is much simpler, we get rid of skip flag, and could probably merge
error_mid_td and urb_length_set flags into one flag.

The code isn't complete or tested yet. It doesn't handle under/overruns,
but that could probably be fixed by just turning "return 0" for those
cases into "break"

Code on top of 6.11-rc4 can be found in my feature_transfer_event_rework branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git feature_transfer_event_rework
https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=feature_transfer_event_rework

Thanks
Mathias
Michał Pecio Aug. 28, 2024, 1:24 p.m. UTC | #3
> Are you sure these are against 6.11-rc5? Or did I screw up on my side?
Absolutely for 6.11-rc5 (but -rc4 works too and -rc6 likely will).

Please try again without --dry-run, some of them will only apply on top
of earlier ones because they touch the same lines. If that doesn't work
then I may have screwed up mailing them.

BTW, #3 is logically dependent on #2, but the others could be reordered
at will IIRC.

As for your audio case, #4 is probably the only potentially interesting
one, but it didn't help on my sluggish ASMedia host. It should, however, 
at least eliminate the "WARN Event TRB with no TDs queued" on USB 3.1
and newer hosts.

But the root cause of those audio issues is likely elsewhere.

Regards,
Michal
fps Aug. 28, 2024, 1:48 p.m. UTC | #4
On August 28, 2024 3:24:58 PM GMT+02:00, "Michał Pecio" <michal.pecio@gmail.com> wrote:
>> Are you sure these are against 6.11-rc5? Or did I screw up on my side?
>Absolutely for 6.11-rc5 (but -rc4 works too and -rc6 likely will).
>
>Please try again without --dry-run, some of them will only apply on top
>of earlier ones because they touch the same lines. If that doesn't work
>then I may have screwed up mailing them.

Oh, of course! My bad. Sorry for the noise!


>
>BTW, #3 is logically dependent on #2, but the others could be reordered
>at will IIRC.
>
>As for your audio case, #4 is probably the only potentially interesting
>one, but it didn't help on my sluggish ASMedia host. It should, however, 
>at least eliminate the "WARN Event TRB with no TDs queued" on USB 3.1
>and newer hosts.
>
>But the root cause of those audio issues is likely elsewhere.

Yeah, just giving it a shot :)

Kind regards,
FPS
Michał Pecio Aug. 28, 2024, 3:02 p.m. UTC | #5
> We are in the middle of reworking transfer event handling with Niklas
> as well.
> 
> I'd be grateful if you could take a quick look at our solution and
> give your opinion on it as you have expertise in this area.
Beware that I may only have enough of that to be dangerous ;)

But I also have a few of those PCIe cards with various chips and all
three public revisions of the spec, so I'm trying to stay in spec and
verify things on hardware.

BTW, I only really touched isochronous transfers so far, but it looks
like some things are not entirely right in others too. The handling of
short packets for instance - it looks like TRB_ISP is set on various
transfers, but process_xxx_td() functions don't care if the event is
mid-TD and give back the URB immediately.

I though about fixing this (I have a naive hope that some xHC crashes
that I see are "undefined behavior" caused by driver bugs, even though
in reality the hardware is probably simply FUBAR) but I'd need to read
more spec chapters to get there.

> Main idea is simple.
> If a transfer event points to a TRB between dequeue and enqueue we
> give back all queued TDs passed up to this TRB.
> And if the event points to a valid TD then handle that. (normal case)
> 
> Code is much simpler, we get rid of skip flag, and could probably
> merge error_mid_td and urb_length_set flags into one flag.
Yes, lots of things could be better if the code walked the ring from
the last known dequeue to the current one and analyzed what's there.
We could know if the event is for something that we have already freed
(a bug?), or for any known pending TD and which one exactly, or for a
non-transfer TD (like Forced Stop Event).

Many of those thing are currently detected with heuristics which only
work about right if everything goes as planned (no bugs in HW or SW).

And first find the TD, then worry how to handle it. I like that Niklas
managed to shorten this giant loop already, and my patches #2 and #3
shorten it further and move TD selection logic up (#2) and all else
down (#3). This way we can work on selection and handling separately,
with less worrying about how they influence each other.

> The code isn't complete or tested yet. It doesn't handle
> under/overruns, but that could probably be fixed by just turning
> "return 0" for those cases into "break"
Problem is that not all events give valid dequeue pointers. These two,
for example, are not guaranteed to. Then we have no idea where on the
ring the HC currently is. The EP doesn't stall and it may have already
been resumed by a simple doorbell ring.

Regards,
Michal
Michał Pecio Aug. 30, 2024, 11:53 a.m. UTC | #6
Hi Mathias,

> Code on top of 6.11-rc4 can be found in my
> feature_transfer_event_rework branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
Thanks for pointing me to this.

It does look like a very promising approach. This is the principle of
"matching before handling" taken to the extreme, as nothing is changed
during the first pass through the TD list. It seems to work well.

I feel like the counting of distances from ring->dequeue is a little,
I don't know, unexpected maybe, but it does indeed produce surprisingly
compact and simple code thanks to those segment numbers.

That being said, I've always been a strong case of NIH syndrome, so
I implemented a similar search for passed TDs which relies solely on
walking the ring in order from dequeue to the event DMA pointer.

So I guess we will have two implementations to compare and choose from.

I will try to develop this approach into a complete implementation that
actually works and can be tested. I want to maximally preserve original
behavior, except for obvious bugs (I found an obscure one already) to
simplify validation and review.

So, for example, no removing the skip flag for now. I'm not sure if
it's a good idea at all, as it frankly is sweeping bugs under the rug.
I have one HC which sometimes silently misses TDs for no known reason
and it misses so many of them that auto-skipping wouldn't help anyway.
OTOH, when I get the TRB errors, at least I know the HC is screwed.

Regards,
Michal