diff mbox series

scsi: megaraid_sas: fix kdump kernel boot hung caused by JBOD

Message ID 1590651115-9619-1-git-send-email-newtongao@tencent.com (mailing list archive)
State Rejected
Headers show
Series scsi: megaraid_sas: fix kdump kernel boot hung caused by JBOD | expand

Commit Message

Kaixu Xia May 28, 2020, 7:31 a.m. UTC
From: Xiaoming Gao <newtongao@tencent.com>

when kernel crash, and kexec into kdump kernel, megaraid_sas will hung and
print follow error logs

24.1485901 sd 0:0:G:0: [sda 1 tag809 BRCfl Debug mfi stat 0x2(1, data len requested/conpleted 0X100
0/0x0)]
24.1867171 sd 0:0:G :9: [sda I tag861 BRCfl Debug mfft stat 0x2d, data len reques ted/conp1e Led 0X100
0/0x0]
24.2054191 sd 0:O:6:O: [sda 1 tag861 FAILED Result: hustbyte=DIDGK drioerbyte-DRIUCR SENSE]
24.2549711 bik_update_ request ! 1/0 error , dev sda, sector 937782912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class
21.2752791 buffer_io_error 2 callbacks suppressed
21.2752731 Duffer IO error an dev sda, logical block 117212064, async page read

this bug is caused by commit '59db5a931bbe73f ("scsi: megaraid_sas: Handle sequence JBOD map failure at
 driver level
")'
and can be fixed by not set JOB when reset_devices on

Signed-off-by: Xiaoming Gao <newtongao@tencent.com>
---
 drivers/scsi/megaraid/megaraid_sas_fusion.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Kai Liu June 2, 2020, 12:12 p.m. UTC | #1
On 2020/05/28 Thu 15:31, xiakaixu1987@gmail.com wrote:
>From: Xiaoming Gao <newtongao@tencent.com>
>
>when kernel crash, and kexec into kdump kernel, megaraid_sas will hung and
>print follow error logs
>
>24.1485901 sd 0:0:G:0: [sda 1 tag809 BRCfl Debug mfi stat 0x2(1, data len requested/conpleted 0X100
>0/0x0)]
>24.1867171 sd 0:0:G :9: [sda I tag861 BRCfl Debug mfft stat 0x2d, data len reques ted/conp1e Led 0X100
>0/0x0]
>24.2054191 sd 0:O:6:O: [sda 1 tag861 FAILED Result: hustbyte=DIDGK drioerbyte-DRIUCR SENSE]
>24.2549711 bik_update_ request ! 1/0 error , dev sda, sector 937782912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class
>21.2752791 buffer_io_error 2 callbacks suppressed
>21.2752731 Duffer IO error an dev sda, logical block 117212064, async page read
>
>this bug is caused by commit '59db5a931bbe73f ("scsi: megaraid_sas: Handle sequence JBOD map failure at
> driver level
>")'
>and can be fixed by not set JOB when reset_devices on

I've recently run into this exact issue on a arm64 machine with Avago 
3408 controller. This patch fixed the issue. Thank you.

Tested-by: Kai Liu <kai.liu@suse.com>

Best regards,
Kai

>
>Signed-off-by: Xiaoming Gao <newtongao@tencent.com>
>---
> drivers/scsi/megaraid/megaraid_sas_fusion.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>index b2ad965..24e7f1b 100644
>--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
>+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
>@@ -3127,7 +3127,7 @@ static void megasas_build_ld_nonrw_fusion(struct megasas_instance *instance,
> 		<< MR_RAID_CTX_RAID_FLAGS_IO_SUB_TYPE_SHIFT;
>
> 	/* If FW supports PD sequence number */
>-	if (instance->support_seqnum_jbod_fp) {
>+	if (!reset_devices && instance->support_seqnum_jbod_fp) {
> 		if (instance->use_seqnum_jbod_fp &&
> 			instance->pd_list[pd_index].driveType == TYPE_DISK) {
>
>-- 
>1.8.3.1
>
Martin K. Petersen June 3, 2020, 1:31 a.m. UTC | #2
> when kernel crash, and kexec into kdump kernel, megaraid_sas will hung
> and print follow error logs
>
> 24.1485901 sd 0:0:G:0: [sda 1 tag809 BRCfl Debug mfi stat 0x2(1, data len requested/conpleted 0X100
> 0/0x0)]
> 24.1867171 sd 0:0:G :9: [sda I tag861 BRCfl Debug mfft stat 0x2d, data len reques ted/conp1e Led 0X100
> 0/0x0]
> 24.2054191 sd 0:O:6:O: [sda 1 tag861 FAILED Result: hustbyte=DIDGK drioerbyte-DRIUCR SENSE]
> 24.2549711 bik_update_ request ! 1/0 error , dev sda, sector 937782912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class
> 21.2752791 buffer_io_error 2 callbacks suppressed
> 21.2752731 Duffer IO error an dev sda, logical block 117212064, async page read
>
> this bug is caused by commit '59db5a931bbe73f ("scsi: megaraid_sas:
> Handle sequence JBOD map failure at driver level ")' and can be fixed
> by not set JOB when reset_devices on

Broadcom: Please review.

Thanks!
Chandrakanth Patil June 3, 2020, 1:34 p.m. UTC | #3
>Subject: Re: [PATCH] scsi: megaraid_sas: fix kdump kernel boot hung
caused by JBOD
>
>
>> when kernel crash, and kexec into kdump kernel, megaraid_sas will hung
>> and print follow error logs
>>
>> 24.1485901 sd 0:0:G:0: [sda 1 tag809 BRCfl Debug mfi stat 0x2(1, data
>> len requested/conpleted 0X100 0/0x0)]
>> 24.1867171 sd 0:0:G :9: [sda I tag861 BRCfl Debug mfft stat 0x2d, data
>> len reques ted/conp1e Led 0X100 0/0x0]
>> 24.2054191 sd 0:O:6:O: [sda 1 tag861 FAILED Result: hustbyte=DIDGK
>> drioerbyte-DRIUCR SENSE]
>> 24.2549711 bik_update_ request ! 1/0 error , dev sda, sector 937782912
>> op 0x0:(READ) flags 0x0 phys_seg 1 prio class
>> 21.2752791 buffer_io_error 2 callbacks suppressed
>> 21.2752731 Duffer IO error an dev sda, logical block 117212064, async
>> page read
>>
>> this bug is caused by commit '59db5a931bbe73f ("scsi: megaraid_sas:
>> Handle sequence JBOD map failure at driver level ")' and can be fixed
>> by not set JOB when reset_devices on
>
>Broadcom: Please review.
>
>Thanks!
>
>--
>Martin K. Petersen	Oracle Linux Engineering

We are working on it and will update you at the earliest.

Thanks,
Chandrakanth Patil
Chandrakanth Patil June 4, 2020, 11:09 a.m. UTC | #4
>Subject: RE: [PATCH] scsi: megaraid_sas: fix kdump kernel boot hung
caused by JBOD
>
>>Subject: Re: [PATCH] scsi: megaraid_sas: fix kdump kernel boot hung
>>caused by JBOD
>>
>>
>>> when kernel crash, and kexec into kdump kernel, megaraid_sas will
>>> hung and print follow error logs
>>>
>>> 24.1485901 sd 0:0:G:0: [sda 1 tag809 BRCfl Debug mfi stat 0x2(1, data
>>> len requested/conpleted 0X100 0/0x0)]
>>> 24.1867171 sd 0:0:G :9: [sda I tag861 BRCfl Debug mfft stat 0x2d,
>>> data len reques ted/conp1e Led 0X100 0/0x0]
>>> 24.2054191 sd 0:O:6:O: [sda 1 tag861 FAILED Result: hustbyte=DIDGK
>>> drioerbyte-DRIUCR SENSE]
>>> 24.2549711 bik_update_ request ! 1/0 error , dev sda, sector
>>> 937782912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class
>>> 21.2752791 buffer_io_error 2 callbacks suppressed
>>> 21.2752731 Duffer IO error an dev sda, logical block 117212064, async
>>> page read
>>>
>>> this bug is caused by commit '59db5a931bbe73f ("scsi: megaraid_sas:
>>> Handle sequence JBOD map failure at driver level ")' and can be fixed
>>> by not set JOB when reset_devices on
>>
>>Broadcom: Please review.
>>
>>Thanks!
>>
>>--
>>Martin K. Petersen	Oracle Linux Engineering
>
>We are working on it and will update you at the earliest.
>
>Thanks,
>Chandrakanth Patil

Hi Martin, Xiaoming Gao, Kai Liu,

It is a known firmware issue and has been fixed. Please update to the
latest firmware available in the Broadcom support website.
Please let me know if you need any further information.

Thanks,
Chandrakanth Patil
Kai Liu June 4, 2020, 3:50 p.m. UTC | #5
On 2020/06/04 Thu 16:39, Chandrakanth Patil wrote:
>
>Hi Martin, Xiaoming Gao, Kai Liu,
>
>It is a known firmware issue and has been fixed. Please update to the
>latest firmware available in the Broadcom support website.
>Please let me know if you need any further information.

Hi Chandrakanth,

Could you let me know which megaraid based controllers are affected by 
this issue? All or some models or some generations?

Best regards,
Kai Liu
Chandrakanth Patil June 4, 2020, 7:35 p.m. UTC | #6
>Subject: Re: [PATCH] scsi: megaraid_sas: fix kdump kernel boot hung caused
>by JBOD
>
>On 2020/06/04 Thu 16:39, Chandrakanth Patil wrote:
>>
>>Hi Martin, Xiaoming Gao, Kai Liu,
>>
>>It is a known firmware issue and has been fixed. Please update to the
>>latest firmware available in the Broadcom support website.
>>Please let me know if you need any further information.
>
>Hi Chandrakanth,
>
>Could you let me know which megaraid based controllers are affected by this
>issue? All or
>some models or some generations?
>
>Best regards,
>Kai Liu

Hi Kai Liu,

Gen3 (Invader) and Gen3.5 (Ventura/Aero) generations of controllers are
affected.

Thanks,
Chandrakanth Patil
Kai Liu June 5, 2020, 4:38 a.m. UTC | #7
On 2020/06/05 Fri 01:05, Chandrakanth Patil wrote:
>
>Hi Kai Liu,
>
>Gen3 (Invader) and Gen3.5 (Ventura/Aero) generations of controllers are
>affected.

Hi Chandrakanth,

My card is not one of these but it's also problematic:

# lspci -nn|grep 3408
02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID Tri-Mode SAS3408 [1000:0017] (rev 01)

According to megaraid_sas.h it's Tomcat:

#define PCI_DEVICE_ID_LSI_TOMCAT                    0x0017

According to product information on broadcom.com the card model is 
9440-8i. So I tried to upgrade to the latest firmware version 
51.13.0-3223 but I got these error:

# ./storcli64 /c0 download file=9440-8i_nopad.rom
Download Completed.
Flashing image to adapter...
CLI Version = 007.1316.0000.0000 Mar 12, 2020
Operating system = Linux 5.3.18-0.g6748ac9-default
Controller = 0
Status = Failure
Description = image corrupted

I tried few more versions from broadcom website, they all failed with 
the same "image corrupted" error.

Here is the controller information:

# ./storcli64 /c0 show
Generating detailed summary of the adapter, it may take a while to complete.

CLI Version = 007.1316.0000.0000 Mar 12, 2020
Operating system = Linux 5.3.18-0.g6748ac9-default
Controller = 0
Status = Success
Description = None

Product Name = SAS3408
Serial Number = 033FAT10K8000236
SAS Address =  57c1cf15516f4000
PCI Address = 00:02:00:00
System Time = 06/05/2020 12:36:59
Mfg. Date = 00/00/00
Controller Time = 06/05/2020 04:36:58
FW Package Build = 50.6.3-0109
BIOS Version = 7.06.02.2_0x07060502
FW Version = 5.060.01-2262
Driver Name = megaraid_sas
Driver Version = 07.713.01.00-rc1
Vendor Id = 0x1000
Device Id = 0x17
SubVendor Id = 0x19E5
SubDevice Id = 0xD213
Host Interface = PCI-E
Device Interface = SAS-12G
Bus Number = 2
Device Number = 0
Function Number = 0
Domain ID = 0
Drive Groups = 3


Thanks,
Kai Liu
Chandrakanth Patil June 5, 2020, 3:30 p.m. UTC | #8
>Subject: Re: [PATCH] scsi: megaraid_sas: fix kdump kernel boot hung caused
>by JBOD
>
>On 2020/06/05 Fri 01:05, Chandrakanth Patil wrote:
>>
>>Hi Kai Liu,
>>
>>Gen3 (Invader) and Gen3.5 (Ventura/Aero) generations of controllers are
>>affected.
>
>Hi Chandrakanth,
>
>My card is not one of these but it's also problematic:
>
># lspci -nn|grep 3408
>02:00.0 RAID bus controller [0104]: Broadcom / LSI MegaRAID Tri-Mode
>SAS3408
>[1000:0017] (rev 01)
>
>According to megaraid_sas.h it's Tomcat:
>
>#define PCI_DEVICE_ID_LSI_TOMCAT                    0x0017
>
>According to product information on broadcom.com the card model is 9440-8i.
>So I tried to
>upgrade to the latest firmware version
>51.13.0-3223 but I got these error:
>
># ./storcli64 /c0 download file=9440-8i_nopad.rom Download Completed.
>Flashing image to adapter...
>CLI Version = 007.1316.0000.0000 Mar 12, 2020 Operating system = Linux
>5.3.18-
>0.g6748ac9-default Controller = 0 Status = Failure Description = image
>corrupted
>
>I tried few more versions from broadcom website, they all failed with the
>same "image
>corrupted" error.
>
>Here is the controller information:
>
># ./storcli64 /c0 show
>Generating detailed summary of the adapter, it may take a while to
>complete.
>
>CLI Version = 007.1316.0000.0000 Mar 12, 2020 Operating system = Linux
>5.3.18-
>0.g6748ac9-default Controller = 0 Status = Success Description = None
>
>Product Name = SAS3408
>Serial Number = 033FAT10K8000236
>SAS Address =  57c1cf15516f4000
>PCI Address = 00:02:00:00
>System Time = 06/05/2020 12:36:59
>Mfg. Date = 00/00/00
>Controller Time = 06/05/2020 04:36:58
>FW Package Build = 50.6.3-0109
>BIOS Version = 7.06.02.2_0x07060502
>FW Version = 5.060.01-2262
>Driver Name = megaraid_sas
>Driver Version = 07.713.01.00-rc1
>Vendor Id = 0x1000
>Device Id = 0x17
>SubVendor Id = 0x19E5
>SubDevice Id = 0xD213
>Host Interface = PCI-E
>Device Interface = SAS-12G
>Bus Number = 2
>Device Number = 0
>Function Number = 0
>Domain ID = 0
>Drive Groups = 3
>
>
>Thanks,
>Kai Liu

Hi Kai Liu,

Tomcat (Device ID: 0017) belongs to Gen3.5 controllers (Ventura family of
controllers). So this issue is applicable.
As this is an OEM specific firmware, Please contact Broadcom support team in
order get the correct firmware image.

-Chandrakanth Patil
Kai Liu June 6, 2020, 4:50 a.m. UTC | #9
On 2020/06/05 Fri 21:00, Chandrakanth Patil wrote:
>Hi Kai Liu,
>
>Tomcat (Device ID: 0017) belongs to Gen3.5 controllers (Ventura family of
>controllers). So this issue is applicable.
>As this is an OEM specific firmware, Please contact Broadcom support team in
>order get the correct firmware image.

Thanks for your help, Chandrakanth.

Best regards,
Kai
diff mbox series

Patch

diff --git a/drivers/scsi/megaraid/megaraid_sas_fusion.c b/drivers/scsi/megaraid/megaraid_sas_fusion.c
index b2ad965..24e7f1b 100644
--- a/drivers/scsi/megaraid/megaraid_sas_fusion.c
+++ b/drivers/scsi/megaraid/megaraid_sas_fusion.c
@@ -3127,7 +3127,7 @@  static void megasas_build_ld_nonrw_fusion(struct megasas_instance *instance,
 		<< MR_RAID_CTX_RAID_FLAGS_IO_SUB_TYPE_SHIFT;
 
 	/* If FW supports PD sequence number */
-	if (instance->support_seqnum_jbod_fp) {
+	if (!reset_devices && instance->support_seqnum_jbod_fp) {
 		if (instance->use_seqnum_jbod_fp &&
 			instance->pd_list[pd_index].driveType == TYPE_DISK) {