diff mbox series

[-next] usb: xhci: do not free an empty cmd ring

Message ID 20230320042223.676505-1-xiehongyu1@kylinos.cn (mailing list archive)
State Superseded
Headers show
Series [-next] usb: xhci: do not free an empty cmd ring | expand

Commit Message

Hongyu Xie March 20, 2023, 4:22 a.m. UTC
It was first found on HUAWEI Kirin 9006C platform with a builtin xhci
controller during stress cycle test(stress-ng, glmark2, x11perf, S4...).

phase one:
[26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12
[26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12
...
phase two:
[28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
...
[28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000
[28650.583557] [0000000000000028] pgd=0000000000000000
...
[28650.583587] pc : xhci_suspend+0x154/0x5b0
[28650.583618] lr : xhci_suspend+0x148/0x5b0
[28650.583618] sp : ffffffc01c7ebbd0
[28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000
[28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848
[28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0
[28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000
[28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000
[28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001
[28650.583648] x17: 0000000000000803 x16: 00000000fffffffe
[28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00
[28650.583648] x13: 00000000f0000000 x12: 0000000000000001
[28650.583648] x11: 0000000000000000 x10: 0000000000000a80
[28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0
[28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101
[28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001
[28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8
[28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000

gdb:
(gdb) l *(xhci_suspend+0x154)
0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854).
849	{
850		struct xhci_ring *ring;
851		struct xhci_segment *seg;
852
853		ring = xhci->cmd_ring;
854		seg = ring->deq_seg;
(gdb) disassemble 0xffffffc010b6cd44
...
0xffffffc010b6cd40 <+336>:	ldr	x22, [x19, #160]
0xffffffc010b6cd44 <+340>:	ldr	x20, [x22, #40]
0xffffffc010b6cd48 <+344>:	mov	w1, #0x0                   	// #0

During phase one, platform_pm_thaw called xhci_plat_resume which called
xhci_resume. The rest possible calling routine might be
xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in
xhci_mem_cleanup before xhci_mem_init returned -ENOMEM.

During phase two, systemd was tring to hibernate again and called
xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring
which was already NULL.

So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return.

Co-developed-by: sunke <sunke@kylinos.cn>
Signed-off-by: sunke <sunke@kylinos.cn>
Signed-off-by: Hongyu Xie <xiehongyu1@kylinos.cn>
---
 drivers/usb/host/xhci.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Greg KH March 22, 2023, 7:29 p.m. UTC | #1
On Mon, Mar 20, 2023 at 12:22:23PM +0800, Hongyu Xie wrote:
> It was first found on HUAWEI Kirin 9006C platform with a builtin xhci
> controller during stress cycle test(stress-ng, glmark2, x11perf, S4...).
> 
> phase one:
> [26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12
> [26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12
> ...
> phase two:
> [28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
> ...
> [28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000
> [28650.583557] [0000000000000028] pgd=0000000000000000
> ...
> [28650.583587] pc : xhci_suspend+0x154/0x5b0
> [28650.583618] lr : xhci_suspend+0x148/0x5b0
> [28650.583618] sp : ffffffc01c7ebbd0
> [28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000
> [28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848
> [28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0
> [28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000
> [28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000
> [28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001
> [28650.583648] x17: 0000000000000803 x16: 00000000fffffffe
> [28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00
> [28650.583648] x13: 00000000f0000000 x12: 0000000000000001
> [28650.583648] x11: 0000000000000000 x10: 0000000000000a80
> [28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0
> [28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101
> [28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001
> [28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8
> [28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000
> 
> gdb:
> (gdb) l *(xhci_suspend+0x154)
> 0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854).
> 849	{
> 850		struct xhci_ring *ring;
> 851		struct xhci_segment *seg;
> 852
> 853		ring = xhci->cmd_ring;
> 854		seg = ring->deq_seg;
> (gdb) disassemble 0xffffffc010b6cd44
> ...
> 0xffffffc010b6cd40 <+336>:	ldr	x22, [x19, #160]
> 0xffffffc010b6cd44 <+340>:	ldr	x20, [x22, #40]
> 0xffffffc010b6cd48 <+344>:	mov	w1, #0x0                   	// #0
> 
> During phase one, platform_pm_thaw called xhci_plat_resume which called
> xhci_resume. The rest possible calling routine might be
> xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in
> xhci_mem_cleanup before xhci_mem_init returned -ENOMEM.
> 
> During phase two, systemd was tring to hibernate again and called
> xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring
> which was already NULL.
> 
> So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return.
> 
> Co-developed-by: sunke <sunke@kylinos.cn>
> Signed-off-by: sunke <sunke@kylinos.cn>
> Signed-off-by: Hongyu Xie <xiehongyu1@kylinos.cn>
> ---
>  drivers/usb/host/xhci.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
> index 6183ce8574b1..8b79ad2955e5 100644
> --- a/drivers/usb/host/xhci.c
> +++ b/drivers/usb/host/xhci.c
> @@ -919,6 +919,10 @@ static void xhci_clear_command_ring(struct xhci_hcd *xhci)
>  {
>  	struct xhci_ring *ring;
>  	struct xhci_segment *seg;
> +	if (!xhci->cmd_ring) {
> +		xhci_err(xhci, "Empty cmd ring");

Why is this being told to userspace?  What can it do about it?

> +		return;

Not returning an error?

> +	}
>  

Please always use checkpatch.pl when sending patches out so that
maintainers have to ask you to use checkpatch.pl.

Also, what commit does this fix?  Does this need to go to stable trees?

But wait, this feels wrong, what can keep this variable from being set
to NULL right after you check it?  Where it the lock involved?

thanks,

greg k-h
Hongyu Xie March 23, 2023, 6:19 a.m. UTC | #2
Hi greg,

在 2023/3/23 03:29, Greg KH 写道:
> On Mon, Mar 20, 2023 at 12:22:23PM +0800, Hongyu Xie wrote:
>> It was first found on HUAWEI Kirin 9006C platform with a builtin xhci
>> controller during stress cycle test(stress-ng, glmark2, x11perf, S4...).
>>
>> phase one:
>> [26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12
>> [26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12
>> ...
>> phase two:
>> [28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
>> ...
>> [28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000
>> [28650.583557] [0000000000000028] pgd=0000000000000000
>> ...
>> [28650.583587] pc : xhci_suspend+0x154/0x5b0
>> [28650.583618] lr : xhci_suspend+0x148/0x5b0
>> [28650.583618] sp : ffffffc01c7ebbd0
>> [28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000
>> [28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848
>> [28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0
>> [28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000
>> [28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000
>> [28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001
>> [28650.583648] x17: 0000000000000803 x16: 00000000fffffffe
>> [28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00
>> [28650.583648] x13: 00000000f0000000 x12: 0000000000000001
>> [28650.583648] x11: 0000000000000000 x10: 0000000000000a80
>> [28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0
>> [28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101
>> [28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001
>> [28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8
>> [28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000
>>
>> gdb:
>> (gdb) l *(xhci_suspend+0x154)
>> 0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854).
>> 849	{
>> 850		struct xhci_ring *ring;
>> 851		struct xhci_segment *seg;
>> 852
>> 853		ring = xhci->cmd_ring;
>> 854		seg = ring->deq_seg;
>> (gdb) disassemble 0xffffffc010b6cd44
>> ...
>> 0xffffffc010b6cd40 <+336>:	ldr	x22, [x19, #160]
>> 0xffffffc010b6cd44 <+340>:	ldr	x20, [x22, #40]
>> 0xffffffc010b6cd48 <+344>:	mov	w1, #0x0                   	// #0
>>
>> During phase one, platform_pm_thaw called xhci_plat_resume which called
>> xhci_resume. The rest possible calling routine might be
>> xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in
>> xhci_mem_cleanup before xhci_mem_init returned -ENOMEM.
>>
>> During phase two, systemd was tring to hibernate again and called
>> xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring
>> which was already NULL.
>>
>> So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return.
>>
>> Co-developed-by: sunke <sunke@kylinos.cn>
>> Signed-off-by: sunke <sunke@kylinos.cn>
>> Signed-off-by: Hongyu Xie <xiehongyu1@kylinos.cn>
>> ---
>>   drivers/usb/host/xhci.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
>> index 6183ce8574b1..8b79ad2955e5 100644
>> --- a/drivers/usb/host/xhci.c
>> +++ b/drivers/usb/host/xhci.c
>> @@ -919,6 +919,10 @@ static void xhci_clear_command_ring(struct xhci_hcd *xhci)
>>   {
>>   	struct xhci_ring *ring;
>>   	struct xhci_segment *seg;
>> +	if (!xhci->cmd_ring) {
>> +		xhci_err(xhci, "Empty cmd ring");
> 
> Why is this being told to userspace?  What can it do about it?
Can do nothing about it. But it's always good to know what was wrong.
> 
>> +		return;
> 
> Not returning an error?
Return -EINVAL to skip the rest of xhci_suspend? It might break kabi.
> 
>> +	}
>>   
> 
> Please always use checkpatch.pl when sending patches out so that
> maintainers have to ask you to use checkpatch.pl.
I did, no warning and no error.
> 
> Also, what commit does this fix?  Does this need to go to stable trees?
It would be 898213200cba. Yes. I'll fix it in v2.
> 
> But wait, this feels wrong, what can keep this variable from being set
> to NULL right after you check it?  Where it the lock involved?
xhci_clear_command_ring only exits in xhci_suspend. cmd_ring only sets
to NULL in xhci_mem_cleanup, which called by xhci_mem_init, xhci_stop,
xhci_resume.
> 
> thanks,
> 
> greg k-h
thanks,

Hongyu Xie
Mathias Nyman March 23, 2023, 11:01 a.m. UTC | #3
On 20.3.2023 6.22, Hongyu Xie wrote:
> It was first found on HUAWEI Kirin 9006C platform with a builtin xhci
> controller during stress cycle test(stress-ng, glmark2, x11perf, S4...).
> 
> phase one:
> [26788.706878] PM: dpm_run_callback(): platform_pm_thaw+0x0/0x68 returns -12
> [26788.706878] PM: Device xhci-hcd.1.auto failed to thaw async: error -12
> ...
> phase two:
> [28650.583496] [2023:01:19 04:43:29]Unable to handle kernel NULL pointer dereference at virtual address 0000000000000028
> ...
> [28650.583526] user pgtable: 4k pages, 39-bit VAs, pgdp=000000027862a000
> [28650.583557] [0000000000000028] pgd=0000000000000000
> ...
> [28650.583587] pc : xhci_suspend+0x154/0x5b0
> [28650.583618] lr : xhci_suspend+0x148/0x5b0
> [28650.583618] sp : ffffffc01c7ebbd0
> [28650.583618] x29: ffffffc01c7ebbd0 x28: ffffffec834d0000
> [28650.583618] x27: ffffffc0106a3cc8 x26: ffffffb2c540c848
> [28650.583618] x25: 0000000000000000 x24: ffffffec82ee30b0
> [28650.583618] x23: ffffffb43b31c2f8 x22: 0000000000000000
> [28650.583618] x21: 0000000000000000 x20: ffffffb43b31c000
> [28650.583648] x19: ffffffb43b31c2a8 x18: 0000000000000001
> [28650.583648] x17: 0000000000000803 x16: 00000000fffffffe
> [28650.583648] x15: 0000000000001000 x14: ffffffb150b67e00
> [28650.583648] x13: 00000000f0000000 x12: 0000000000000001
> [28650.583648] x11: 0000000000000000 x10: 0000000000000a80
> [28650.583648] x9 : ffffffc01c7eba00 x8 : ffffffb43ad10ae0
> [28650.583648] x7 : ffffffb84cd98dc0 x6 : 0000000cceb6a101
> [28650.583679] x5 : 00ffffffffffffff x4 : 0000000000000001
> [28650.583679] x3 : 0000000000000011 x2 : 0000000000e2cfa8
> [28650.583679] x1 : 00000000823535e1 x0 : 0000000000000000
> 
> gdb:
> (gdb) l *(xhci_suspend+0x154)
> 0xffffffc010b6cd44 is in xhci_suspend (/.../drivers/usb/host/xhci.c:854).
> 849	{
> 850		struct xhci_ring *ring;
> 851		struct xhci_segment *seg;
> 852
> 853		ring = xhci->cmd_ring;
> 854		seg = ring->deq_seg;
> (gdb) disassemble 0xffffffc010b6cd44
> ...
> 0xffffffc010b6cd40 <+336>:	ldr	x22, [x19, #160]
> 0xffffffc010b6cd44 <+340>:	ldr	x20, [x22, #40]
> 0xffffffc010b6cd48 <+344>:	mov	w1, #0x0                   	// #0
> 
> During phase one, platform_pm_thaw called xhci_plat_resume which called
> xhci_resume. The rest possible calling routine might be
> xhci_resume->xhci_init->xhci_mem_init, and xhci->cmd_ring was cleaned in
> xhci_mem_cleanup before xhci_mem_init returned -ENOMEM.
> 

Thanks for reporting this.

xhci_mem_init() failing with -ENOMEM doesn't sound good.

Are we really running out of memory? does kmemleak say anything?

Any chance you could look into where exactly xhci_mem_init() fails as
xhci_mem_init() always returns -ENOMEM on failure?

> During phase two, systemd was tring to hibernate again and called
> xhci_suspend, then xhci_clear_command_ring dereferenced xhci->cmd_ring
> which was already NULL.
> 
> So if xhci->cmd_ring is NULL, xhci_clear_command_ring just return.

If xhci_mem_init() failed then xhci driver is completely unusable.
it shouldn't be used at all after this.

Thanks
Mathias
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 6183ce8574b1..8b79ad2955e5 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -919,6 +919,10 @@  static void xhci_clear_command_ring(struct xhci_hcd *xhci)
 {
 	struct xhci_ring *ring;
 	struct xhci_segment *seg;
+	if (!xhci->cmd_ring) {
+		xhci_err(xhci, "Empty cmd ring");
+		return;
+	}
 
 	ring = xhci->cmd_ring;
 	seg = ring->deq_seg;