diff mbox series

[2/7] usb: xhci: Check endpoint is valid before dereferencing it

Message ID 20230116142216.1141605-3-mathias.nyman@linux.intel.com (mailing list archive)
State Accepted
Commit e8fb5bc76eb86437ab87002d4a36d6da02165654
Headers show
Series usb and xhci fixes for usb-linus | expand

Commit Message

Mathias Nyman Jan. 16, 2023, 2:22 p.m. UTC
From: Jimmy Hu <hhhuuu@google.com>

When the host controller is not responding, all URBs queued to all
endpoints need to be killed. This can cause a kernel panic if we
dereference an invalid endpoint.

Fix this by using xhci_get_virt_ep() helper to find the endpoint and
checking if the endpoint is valid before dereferencing it.

[233311.853271] xhci-hcd xhci-hcd.1.auto: xHCI host controller not responding, assume dead
[233311.853393] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e8

[233311.853964] pc : xhci_hc_died+0x10c/0x270
[233311.853971] lr : xhci_hc_died+0x1ac/0x270

[233311.854077] Call trace:
[233311.854085]  xhci_hc_died+0x10c/0x270
[233311.854093]  xhci_stop_endpoint_command_watchdog+0x100/0x1a4
[233311.854105]  call_timer_fn+0x50/0x2d4
[233311.854112]  expire_timers+0xac/0x2e4
[233311.854118]  run_timer_softirq+0x300/0xabc
[233311.854127]  __do_softirq+0x148/0x528
[233311.854135]  irq_exit+0x194/0x1a8
[233311.854143]  __handle_domain_irq+0x164/0x1d0
[233311.854149]  gic_handle_irq.22273+0x10c/0x188
[233311.854156]  el1_irq+0xfc/0x1a8
[233311.854175]  lpm_cpuidle_enter+0x25c/0x418 [msm_pm]
[233311.854185]  cpuidle_enter_state+0x1f0/0x764
[233311.854194]  do_idle+0x594/0x6ac
[233311.854201]  cpu_startup_entry+0x7c/0x80
[233311.854209]  secondary_start_kernel+0x170/0x198

Fixes: 50e8725e7c42 ("xhci: Refactor command watchdog and fix split string.")
Cc: stable@vger.kernel.org
Signed-off-by: Jimmy Hu <hhhuuu@google.com>
Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
---
 drivers/usb/host/xhci-ring.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Ladislav Michl Jan. 16, 2023, 4:59 p.m. UTC | #1
Hi Mathias,

On Mon, Jan 16, 2023 at 04:22:11PM +0200, Mathias Nyman wrote:
> From: Jimmy Hu <hhhuuu@google.com>
> 
> When the host controller is not responding, all URBs queued to all
> endpoints need to be killed. This can cause a kernel panic if we
> dereference an invalid endpoint.
> 
> Fix this by using xhci_get_virt_ep() helper to find the endpoint and
> checking if the endpoint is valid before dereferencing it.

I'm a bit confused this goes in and even to stable. Let me quote your
own analysis from
Message-ID: <0fe978ed-8269-9774-1c40-f8a98c17e838@linux.intel.com>
On Thu, Dec 22, 2022 at 03:18:53PM +0200, Mathias Nyman wrote:
> I think root cause is that freeing xhci->devs[i] and including rings isn't
> protected by the lock, this happens in xhci_free_virt_device() called by
> xhci_free_dev(), which in turn may be called by usbcore at any time
> 
> So xhci->devs[i] might just suddenly disappear
> 
> Patch just checks more often if xhci->devs[i] is valid, between every endpoint.
> So the race between xhci_free_virt_device() and xhci_kill_endpoint_urbs()
> doesn't trigger null pointer deref as easily.

I believe the above is correct and even Jimmy was unable to verify your
later patch (3rd in this serie), which brings a question how could be this
patch tested. It just burns a bug a bit deeper and I do not think it is the
right approach.

	ladis

> [233311.853271] xhci-hcd xhci-hcd.1.auto: xHCI host controller not responding, assume dead
> [233311.853393] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000e8
> 
> [233311.853964] pc : xhci_hc_died+0x10c/0x270
> [233311.853971] lr : xhci_hc_died+0x1ac/0x270
> 
> [233311.854077] Call trace:
> [233311.854085]  xhci_hc_died+0x10c/0x270
> [233311.854093]  xhci_stop_endpoint_command_watchdog+0x100/0x1a4
> [233311.854105]  call_timer_fn+0x50/0x2d4
> [233311.854112]  expire_timers+0xac/0x2e4
> [233311.854118]  run_timer_softirq+0x300/0xabc
> [233311.854127]  __do_softirq+0x148/0x528
> [233311.854135]  irq_exit+0x194/0x1a8
> [233311.854143]  __handle_domain_irq+0x164/0x1d0
> [233311.854149]  gic_handle_irq.22273+0x10c/0x188
> [233311.854156]  el1_irq+0xfc/0x1a8
> [233311.854175]  lpm_cpuidle_enter+0x25c/0x418 [msm_pm]
> [233311.854185]  cpuidle_enter_state+0x1f0/0x764
> [233311.854194]  do_idle+0x594/0x6ac
> [233311.854201]  cpu_startup_entry+0x7c/0x80
> [233311.854209]  secondary_start_kernel+0x170/0x198
> 
> Fixes: 50e8725e7c42 ("xhci: Refactor command watchdog and fix split string.")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jimmy Hu <hhhuuu@google.com>
> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com>
> ---
>  drivers/usb/host/xhci-ring.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index ddc30037f9ce..f5b0e1ce22af 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1169,7 +1169,10 @@ static void xhci_kill_endpoint_urbs(struct xhci_hcd *xhci,
>  	struct xhci_virt_ep *ep;
>  	struct xhci_ring *ring;
>  
> -	ep = &xhci->devs[slot_id]->eps[ep_index];
> +	ep = xhci_get_virt_ep(xhci, slot_id, ep_index);
> +	if (!ep)
> +		return;
> +
>  	if ((ep->ep_state & EP_HAS_STREAMS) ||
>  			(ep->ep_state & EP_GETTING_NO_STREAMS)) {
>  		int stream_id;
> -- 
> 2.25.1
Mathias Nyman Jan. 17, 2023, 10:02 a.m. UTC | #2
On 16.1.2023 18.59, Ladislav Michl wrote:
> Hi Mathias,
> 
> On Mon, Jan 16, 2023 at 04:22:11PM +0200, Mathias Nyman wrote:
>> From: Jimmy Hu <hhhuuu@google.com>
>>
>> When the host controller is not responding, all URBs queued to all
>> endpoints need to be killed. This can cause a kernel panic if we
>> dereference an invalid endpoint.
>>
>> Fix this by using xhci_get_virt_ep() helper to find the endpoint and
>> checking if the endpoint is valid before dereferencing it.
> 
> I'm a bit confused this goes in and even to stable. Let me quote your
> own analysis from
> Message-ID: <0fe978ed-8269-9774-1c40-f8a98c17e838@linux.intel.com>
> On Thu, Dec 22, 2022 at 03:18:53PM +0200, Mathias Nyman wrote:
>> I think root cause is that freeing xhci->devs[i] and including rings isn't
>> protected by the lock, this happens in xhci_free_virt_device() called by
>> xhci_free_dev(), which in turn may be called by usbcore at any time
>>
>> So xhci->devs[i] might just suddenly disappear
>>
>> Patch just checks more often if xhci->devs[i] is valid, between every endpoint.
>> So the race between xhci_free_virt_device() and xhci_kill_endpoint_urbs()
>> doesn't trigger null pointer deref as easily.> 
> I believe the above is correct and even Jimmy was unable to verify your
> later patch (3rd in this serie), which brings a question how could be this
> patch tested. It just burns a bug a bit deeper and I do not think it is the
> right approach.

As I said in a direct response to the original patch I think this is a valid fix
for older kernels where we used to unlock xhci->lock while giving back
URBs. Together with PATCH 3/7 the issue should be completely resolved.
For later kernels PATCH 3/7 should be enough by itself, but no harm in keeping this.

See Message-ID: <379b395f-b65c-96fe-7ecc-f18e3740b990@linux.intel.com>

Older kernels are all before v5.5 that lack commit
36dc01657b49 usb: host: xhci: Support running urb giveback in tasklet context.

I haven't been able to trigger this issue myself, but based on the report and finding in
the code I still think this the right approach. The internal testing this has been
through could only prove these patches (2/7 and 3/7) don't cause any additional issues.

If you think the analysis or solution is incorrect let me know, and we can come up with a
better one.

Thanks
Mathias
youling 257 Feb. 23, 2023, 4:26 p.m. UTC | #3
I used type-c 20Gbps USB3.2 GEN2x2 PCIe Expansion Card, may be this patch cause USB3.2 GEN2x2 PCIe Expansion Card not work.

[    0.285088] xhci_hcd 0000:09:00.0: hcc params 0x0200ef80 hci version 0x110 quirks 0x0000000000800010
[    0.285334] usb usb7: We don't know the algorithms for LPM for this host, disabling LPM.
[    0.285347] xhci_hcd 0000:09:00.0: xHCI Host Controller
[    0.285407] hub 7-0:1.0: USB hub found
[    0.285415] hub 7-0:1.0: 4 ports detected
[    0.285783] xhci_hcd 0000:09:00.0: new USB bus registered, assigned bus number 8
[    0.285787] xhci_hcd 0000:09:00.0: Host supports USB 3.2 Enhanced SuperSpeed
[    0.285889] hub 4-0:1.0: USB hub found
[    0.285901] hub 4-0:1.0: 1 port detected
[    0.285988] usb usb8: We don't know the algorithms for LPM for this host, disabling LPM.
[ 3277.156054] xhci_hcd 0000:09:00.0: Abort failed to stop command ring: -110
[ 3277.156091] xhci_hcd 0000:09:00.0: xHCI host controller not responding, assume dead
[ 3277.156103] xhci_hcd 0000:09:00.0: HC died; cleaning up

may be this patch cause "xhci_hcd 0000:09:00.0: HC died; cleaning up" problem.
Mathias Nyman Feb. 24, 2023, 10:29 a.m. UTC | #4
On 23.2.2023 18.26, youling257 wrote:
> I used type-c 20Gbps USB3.2 GEN2x2 PCIe Expansion Card, may be this patch cause USB3.2 GEN2x2 PCIe Expansion Card not work.
> 
> [    0.285088] xhci_hcd 0000:09:00.0: hcc params 0x0200ef80 hci version 0x110 quirks 0x0000000000800010
> [    0.285334] usb usb7: We don't know the algorithms for LPM for this host, disabling LPM.
> [    0.285347] xhci_hcd 0000:09:00.0: xHCI Host Controller
> [    0.285407] hub 7-0:1.0: USB hub found
> [    0.285415] hub 7-0:1.0: 4 ports detected
> [    0.285783] xhci_hcd 0000:09:00.0: new USB bus registered, assigned bus number 8
> [    0.285787] xhci_hcd 0000:09:00.0: Host supports USB 3.2 Enhanced SuperSpeed
> [    0.285889] hub 4-0:1.0: USB hub found
> [    0.285901] hub 4-0:1.0: 1 port detected
> [    0.285988] usb usb8: We don't know the algorithms for LPM for this host, disabling LPM.
> [ 3277.156054] xhci_hcd 0000:09:00.0: Abort failed to stop command ring: -110
> [ 3277.156091] xhci_hcd 0000:09:00.0: xHCI host controller not responding, assume dead
> [ 3277.156103] xhci_hcd 0000:09:00.0: HC died; cleaning up
> 
> may be this patch cause "xhci_hcd 0000:09:00.0: HC died; cleaning up" problem.

Unlikely, this patch only touches code called after HC already died.

Does reverting this patch fix the issue?

Thanks
Mathias
youling 257 Feb. 24, 2023, 3:58 p.m. UTC | #5
February 17, when i used linux 6.2-rc8, happen "Abort failed to stop
command ring: -110", google search history February 17 search "Abort
failed to stop command ring: -110" and "Usbreset No such device
found".

Date: Fri, 17 Feb 2023 23:59:29 +0800
Subject: [PATCH] Revert "usb: xhci: Check endpoint is valid before
dereferencing it"
This reverts commit e8fb5bc76eb86437ab87002d4a36d6da02165654.

a week never see usb not work.
may be revert it fix my problem.

2023-02-24 18:29 GMT+08:00, Mathias Nyman <mathias.nyman@linux.intel.com>:
> On 23.2.2023 18.26, youling257 wrote:
>> I used type-c 20Gbps USB3.2 GEN2x2 PCIe Expansion Card, may be this patch
>> cause USB3.2 GEN2x2 PCIe Expansion Card not work.
>>
>> [    0.285088] xhci_hcd 0000:09:00.0: hcc params 0x0200ef80 hci version
>> 0x110 quirks 0x0000000000800010
>> [    0.285334] usb usb7: We don't know the algorithms for LPM for this
>> host, disabling LPM.
>> [    0.285347] xhci_hcd 0000:09:00.0: xHCI Host Controller
>> [    0.285407] hub 7-0:1.0: USB hub found
>> [    0.285415] hub 7-0:1.0: 4 ports detected
>> [    0.285783] xhci_hcd 0000:09:00.0: new USB bus registered, assigned bus
>> number 8
>> [    0.285787] xhci_hcd 0000:09:00.0: Host supports USB 3.2 Enhanced
>> SuperSpeed
>> [    0.285889] hub 4-0:1.0: USB hub found
>> [    0.285901] hub 4-0:1.0: 1 port detected
>> [    0.285988] usb usb8: We don't know the algorithms for LPM for this
>> host, disabling LPM.
>> [ 3277.156054] xhci_hcd 0000:09:00.0: Abort failed to stop command ring:
>> -110
>> [ 3277.156091] xhci_hcd 0000:09:00.0: xHCI host controller not responding,
>> assume dead
>> [ 3277.156103] xhci_hcd 0000:09:00.0: HC died; cleaning up
>>
>> may be this patch cause "xhci_hcd 0000:09:00.0: HC died; cleaning up"
>> problem.
>
> Unlikely, this patch only touches code called after HC already died.
>
> Does reverting this patch fix the issue?
>
> Thanks
> Mathias
>
youling 257 Feb. 24, 2023, 4:03 p.m. UTC | #6
By the way, i used this patch on linux kernel has been a year,
https://lore.kernel.org/all/6908aa69-469b-8f92-8e19-60685f524f9c@synopsys.com/

2023-02-24 23:58 GMT+08:00, youling 257 <youling257@gmail.com>:
> February 17, when i used linux 6.2-rc8, happen "Abort failed to stop
> command ring: -110", google search history February 17 search "Abort
> failed to stop command ring: -110" and "Usbreset No such device
> found".
>
> Date: Fri, 17 Feb 2023 23:59:29 +0800
> Subject: [PATCH] Revert "usb: xhci: Check endpoint is valid before
> dereferencing it"
> This reverts commit e8fb5bc76eb86437ab87002d4a36d6da02165654.
>
> a week never see usb not work.
> may be revert it fix my problem.
>
> 2023-02-24 18:29 GMT+08:00, Mathias Nyman <mathias.nyman@linux.intel.com>:
>> On 23.2.2023 18.26, youling257 wrote:
>>> I used type-c 20Gbps USB3.2 GEN2x2 PCIe Expansion Card, may be this
>>> patch
>>> cause USB3.2 GEN2x2 PCIe Expansion Card not work.
>>>
>>> [    0.285088] xhci_hcd 0000:09:00.0: hcc params 0x0200ef80 hci version
>>> 0x110 quirks 0x0000000000800010
>>> [    0.285334] usb usb7: We don't know the algorithms for LPM for this
>>> host, disabling LPM.
>>> [    0.285347] xhci_hcd 0000:09:00.0: xHCI Host Controller
>>> [    0.285407] hub 7-0:1.0: USB hub found
>>> [    0.285415] hub 7-0:1.0: 4 ports detected
>>> [    0.285783] xhci_hcd 0000:09:00.0: new USB bus registered, assigned
>>> bus
>>> number 8
>>> [    0.285787] xhci_hcd 0000:09:00.0: Host supports USB 3.2 Enhanced
>>> SuperSpeed
>>> [    0.285889] hub 4-0:1.0: USB hub found
>>> [    0.285901] hub 4-0:1.0: 1 port detected
>>> [    0.285988] usb usb8: We don't know the algorithms for LPM for this
>>> host, disabling LPM.
>>> [ 3277.156054] xhci_hcd 0000:09:00.0: Abort failed to stop command ring:
>>> -110
>>> [ 3277.156091] xhci_hcd 0000:09:00.0: xHCI host controller not
>>> responding,
>>> assume dead
>>> [ 3277.156103] xhci_hcd 0000:09:00.0: HC died; cleaning up
>>>
>>> may be this patch cause "xhci_hcd 0000:09:00.0: HC died; cleaning up"
>>> problem.
>>
>> Unlikely, this patch only touches code called after HC already died.
>>
>> Does reverting this patch fix the issue?
>>
>> Thanks
>> Mathias
>>
>
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index ddc30037f9ce..f5b0e1ce22af 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1169,7 +1169,10 @@  static void xhci_kill_endpoint_urbs(struct xhci_hcd *xhci,
 	struct xhci_virt_ep *ep;
 	struct xhci_ring *ring;
 
-	ep = &xhci->devs[slot_id]->eps[ep_index];
+	ep = xhci_get_virt_ep(xhci, slot_id, ep_index);
+	if (!ep)
+		return;
+
 	if ((ep->ep_state & EP_HAS_STREAMS) ||
 			(ep->ep_state & EP_GETTING_NO_STREAMS)) {
 		int stream_id;