diff mbox

Regression 4.17-rc1: SSD doesn properly resume causing system hang (NULL pointer dereference)

Message ID 134955874b401e6764077393c75ab2d4549b940a.camel@wdc.com (mailing list archive)
State New, archived
Headers show

Commit Message

Bart Van Assche April 24, 2018, 6:27 p.m. UTC
On Tue, 2018-04-24 at 19:37 +0200, Paul Menzel wrote:
> On 04/24/18 19:31, Bart Van Assche wrote:
> Here it is, pasted as citation, as otherwise Thunderbird would wrap the 
> line.
> 
> > (gdb) disas blk_set_runtime_active
> > Dump of assembler code for function blk_set_runtime_active:
> >    0xc1518610 <+0>:	call   0xc106ac9c <__fentry__>
> >    0xc1518615 <+5>:	push   %ebp
> >    0xc1518616 <+6>:	mov    %esp,%ebp
> >    0xc1518618 <+8>:	sub    $0x14,%esp
> >    0xc151861b <+11>:	mov    %ebx,-0xc(%ebp)
> >    0xc151861e <+14>:	mov    %eax,%ebx
> >    0xc1518620 <+16>:	mov    %gs:0x14,%eax
> >    0xc1518626 <+22>:	mov    %eax,-0x10(%ebp)
> >    0xc1518629 <+25>:	xor    %eax,%eax
> >    0xc151862b <+27>:	test   %ebx,%ebx
> >    0xc151862d <+29>:	mov    %esi,-0x8(%ebp)
> >    0xc1518630 <+32>:	mov    %edi,-0x4(%ebp)
> >    0xc1518633 <+35>:	je     0xc15186b3 <blk_set_runtime_active+163>
> >    0xc1518635 <+37>:	mov    0xfc(%ebx),%eax
> >    0xc151863b <+43>:	call   0xc1a4b920 <_raw_spin_lock_irq>
> >    0xc1518640 <+48>:	mov    0x150(%ebx),%esi
> >    0xc1518646 <+54>:	xor    %eax,%eax
> >    0xc1518648 <+56>:	mov    0xc1ca7d20,%edi
> >    0xc151864e <+62>:	mov    %eax,0x154(%ebx)
> >    0xc1518654 <+68>:	cmp    $0xffffff0c,%esi
> >    0xc151865a <+74>:	mov    %edi,-0x14(%ebp)
> >    0xc151865d <+77>:	je     0xc15186a5 <blk_set_runtime_active+149>
> >    0xc151865f <+79>:	mov    %edi,0xf4(%esi)

The e-mail at the start of this e-mail thread shows that %esi == NULL at
the time of the crash and also that the crash occurred at offset 79 (0x4f)
in this function. I think that means that the crash occurred in the following
code: pm_request_autosuspend(q->dev) and also that this means that q->dev ==
NULL. Can you test the (untested) patch below?


Thanks,

Bart.

Comments

Bart Van Assche April 24, 2018, 9:17 p.m. UTC | #1
On Tue, 2018-04-24 at 23:04 +0200, Paul Menzel wrote:
> I applied your change, and rebuilt the Linux kernel. Unfortunately, it 

> looks like, it didn’t make a difference.


In that case I don't know what is causing the failure. Can you run a bisect
to determine which commit introduced this regression?

Thanks,

Bart.
Paul Menzel April 25, 2018, 5:37 a.m. UTC | #2
Dear Bart,


Am 24.04.2018 um 23:17 schrieb Bart Van Assche:
> On Tue, 2018-04-24 at 23:04 +0200, Paul Menzel wrote:
>> I applied your change, and rebuilt the Linux kernel. Unfortunately, it
>> looks like, it didn’t make a difference.
> 
> In that case I don't know what is causing the failure. Can you run a bisect
> to determine which commit introduced this regression?

With `scsi_mod.use_blk_mq=n` the system resumes fine, so for to me 
unknown reasons, that Kconfig option get selected in my Linux kernel 
configuration. I remember having similar issues when this was enabled by 
default in Linux 4.13-rc?, so it was just a configuration problem and 
not a regression. Unfortunately, the Linux configuration files are not 
under version control, so I cannot check, but it is probably my fault.

Sorry for the noise, and please tell me, what I can do to get the option 
working on this old device.


Kind regards,

Paul
Bart Van Assche April 25, 2018, 12:26 p.m. UTC | #3
On Wed, 2018-04-25 at 07:37 +0200, Paul Menzel wrote:
> Am 24.04.2018 um 23:17 schrieb Bart Van Assche:

> > On Tue, 2018-04-24 at 23:04 +0200, Paul Menzel wrote:

> > > I applied your change, and rebuilt the Linux kernel. Unfortunately, it

> > > looks like, it didn’t make a difference.

> > 

> > In that case I don't know what is causing the failure. Can you run a bisect

> > to determine which commit introduced this regression?

> 

> With `scsi_mod.use_blk_mq=n` the system resumes fine, so for to me 

> unknown reasons, that Kconfig option get selected in my Linux kernel 

> configuration. I remember having similar issues when this was enabled by 

> default in Linux 4.13-rc?, so it was just a configuration problem and 

> not a regression. Unfortunately, the Linux configuration files are not 

> under version control, so I cannot check, but it is probably my fault.

> 

> Sorry for the noise, and please tell me, what I can do to get the option 

> working on this old device.


Hello Paul,

Did the same system boot fine with a previous kernel with scsi-mq enabled?
Anyway, we would like to know what is the root cause such that this NULL
pointer dereference can be fixed. There are namely plans to remove the
legacy block layer in the not too distant future.

Thanks,

Bart.
Paul Menzel April 25, 2018, 12:34 p.m. UTC | #4
Dear Bart,


On 04/25/18 14:26, Bart Van Assche wrote:
> On Wed, 2018-04-25 at 07:37 +0200, Paul Menzel wrote:
>> Am 24.04.2018 um 23:17 schrieb Bart Van Assche:
>>> On Tue, 2018-04-24 at 23:04 +0200, Paul Menzel wrote:
>>>> I applied your change, and rebuilt the Linux kernel. Unfortunately, it
>>>> looks like, it didn’t make a difference.
>>>
>>> In that case I don't know what is causing the failure. Can you run a bisect
>>> to determine which commit introduced this regression?
>>
>> With `scsi_mod.use_blk_mq=n` the system resumes fine, so for to me
>> unknown reasons, that Kconfig option get selected in my Linux kernel
>> configuration. I remember having similar issues when this was enabled by
>> default in Linux 4.13-rc?, so it was just a configuration problem and
>> not a regression. Unfortunately, the Linux configuration files are not
>> under version control, so I cannot check, but it is probably my fault.
>>
>> Sorry for the noise, and please tell me, what I can do to get the option
>> working on this old device.

> Did the same system boot fine with a previous kernel with scsi-mq enabled?

No, as far as I know it never worked, see thread *[Regression 4.13-rc1] 
Resume does not work on Lenovo X60t* [1].

> Anyway, we would like to know what is the root cause such that this NULL
> pointer dereference can be fixed. There are namely plans to remove the
> legacy block layer in the not too distant future.

I’ll be happy to test proposed changes.


Kind regards,

Paul


PS: Your mailer also changed *doesn’t* to *doesn* in the subject line.


[1] https://www.spinics.net/lists/linux-scsi/msg111457.html
diff mbox

Patch

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 57cae47ab1c2..b029a94a1e66 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3272,7 +3272,6 @@  static void sd_probe_async(struct work_struct *work)
 		gd->events |= DISK_EVENT_MEDIA_CHANGE;
 	}
 
-	blk_pm_runtime_init(sdp->request_queue, dev);
 	device_add_disk(dev, gd);
 	if (sdkp->capacity)
 		sd_dif_config_host(sdkp);
@@ -3390,6 +3389,8 @@  static int sd_probe(struct device *dev)
 	get_device(dev);
 	dev_set_drvdata(dev, sdkp);
 
+	blk_pm_runtime_init(sdp->request_queue, dev);
+
 	get_device(&sdkp->dev);	/* prevent release before sd_probe_async() */
 	WARN_ON_ONCE(!queue_work(system_unbound_wq, &sdkp->probe_work));