diff mbox series

scsi: libsas: stop discovering if oob mode is disconnected

Message ID 20191129032413.36092-1-yanaijie@huawei.com (mailing list archive)
State Superseded
Headers show
Series scsi: libsas: stop discovering if oob mode is disconnected | expand

Commit Message

Jason Yan Nov. 29, 2019, 3:24 a.m. UTC
The discovering of sas port is driving by workqueue in libsas. When
libsas is processing port events or phy events in workqueue, new evnets
may rise up and change the state of some structures such as asd_sas_phy.
This may cause some problems such as follows:

==>thread 1                       ==>thread 2

                                  ==>phy up
                                  ==>phy_up_v3_hw()
                                    ==>oob_mode = SATA_OOB_MODE;
                                  ==>phy donw quickly
                                  ==>hisi_sas_phy_down()
                                    ==>sas_ha->notify_phy_event()
                                    ==>sas_phy_disconnected()
                                      ==>oob_mode = OOB_NOT_CONNECTED
==>workqueue wakeup
==>sas_form_port()
  ==>sas_discover_domain()
    ==>sas_get_port_device()
      ==>oob_mode is OOB_NOT_CONNECTED and device
         is wrongly taken as expander

This at last lead to the panic when libsas trying to issue a command to
discover the device.

[183047.614035] Unable to handle kernel NULL pointer dereference at
virtual address 0000000000000058
[183047.622896] Mem abort info:
[183047.625762]   ESR = 0x96000004
[183047.628893]   Exception class = DABT (current EL), IL = 32 bits
[183047.634888]   SET = 0, FnV = 0
[183047.638015]   EA = 0, S1PTW = 0
[183047.641232] Data abort info:
[183047.644189]   ISV = 0, ISS = 0x00000004
[183047.648100]   CM = 0, WnR = 0
[183047.651145] user pgtable: 4k pages, 48-bit VAs, pgdp =
00000000b7df67be
[183047.657834] [0000000000000058] pgd=0000000000000000
[183047.662789] Internal error: Oops: 96000004 [#1] SMP
[183047.667740] Process kworker/u16:2 (pid: 31291, stack limit =
0x00000000417c4974)
[183047.675208] CPU: 0 PID: 3291 Comm: kworker/u16:2 Tainted: G
W  OE 4.19.36-vhulk1907.1.0.h410.eulerosv2r8.aarch64 #1
[183047.687015] Hardware name: N/A N/A/Kunpeng Desktop Board D920S10,
BIOS 0.15 10/22/2019
[183047.695007] Workqueue: 0000:74:02.0_disco_q sas_discover_domain
[183047.700999] pstate: 20c00009 (nzCv daif +PAN +UAO)
[183047.705864] pc : prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
[183047.711510] lr : prep_ata_v3_hw+0xb0/0x230 [hisi_sas_v3_hw]
[183047.717153] sp : ffff00000f28ba60
[183047.720541] x29: ffff00000f28ba60 x28: ffff8026852d7228
[183047.725925] x27: ffff8027dba3e0a8 x26: ffff8027c05fc200
[183047.731310] x25: 0000000000000000 x24: ffff8026bafa8dc0
[183047.736695] x23: ffff8027c05fc218 x22: ffff8026852d7228
[183047.742079] x21: ffff80007c2f2940 x20: ffff8027c05fc200
[183047.747464] x19: 0000000000f80800 x18: 0000000000000010
[183047.752848] x17: 0000000000000000 x16: 0000000000000000
[183047.758232] x15: ffff000089a5a4ff x14: 0000000000000005
[183047.763617] x13: ffff000009a5a50e x12: ffff8026bafa1e20
[183047.769001] x11: ffff0000087453b8 x10: ffff00000f28b870
[183047.774385] x9 : 0000000000000000 x8 : ffff80007e58f9b0
[183047.779770] x7 : 0000000000000000 x6 : 000000000000003f
[183047.785154] x5 : 0000000000000040 x4 : ffffffffffffffe0
[183047.790538] x3 : 00000000000000f8 x2 : 0000000002000007
[183047.795922] x1 : 0000000000000008 x0 : 0000000000000000
[183047.801307] Call trace:
[183047.803827]  prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
[183047.809127]  hisi_sas_task_prep+0x750/0x888 [hisi_sas_main]
[183047.814773]  hisi_sas_task_exec.isra.7+0x88/0x1f0 [hisi_sas_main]
[183047.820939]  hisi_sas_queue_command+0x28/0x38 [hisi_sas_main]
[183047.826757]  smp_execute_task_sg+0xec/0x218
[183047.831013]  smp_execute_task+0x74/0xa0
[183047.834921]  sas_discover_expander.part.7+0x9c/0x5f8
[183047.839959]  sas_discover_root_expander+0x90/0x160
[183047.844822]  sas_discover_domain+0x1b8/0x1e8
[183047.849164]  process_one_work+0x1b4/0x3f8
[183047.853246]  worker_thread+0x54/0x470
[183047.856981]  kthread+0x134/0x138
[183047.860283]  ret_from_fork+0x10/0x18
[183047.863931] Code: f9407a80 528000e2 39409281 72a04002 (b9405800)
[183047.870097] kernel fault(0x1) notification starting on CPU 0
[183047.875828] kernel fault(0x1) notification finished on CPU 0
[183047.881559] Modules linked in: unibsp(OE) hns3(OE) hclge(OE)
hnae3(OE) mem_drv(OE) hisi_sas_v3_hw(OE) hisi_sas_main(OE)
[183047.892418] ---[ end trace 4cc26083fc11b783  ]---
[183047.897107] Kernel panic - not syncing: Fatal exception
[183047.902403] kernel fault(0x5) notification starting on CPU 0
[183047.908134] kernel fault(0x5) notification finished on CPU 0
[183047.913865] SMP: stopping secondary CPUs
[183047.917861] Kernel Offset: disabled
[183047.921422] CPU features: 0x2,a2a00a38
[183047.925243] Memory Limit: none
[183047.928372] kernel reboot(0x2) notification starting on CPU 0
[183047.934190] kernel reboot(0x2) notification finished on CPU 0
[183047.940008] ---[ end Kernel panic - not syncing: Fatal exception
]---

Fixes: 2908d778ab3e ("[SCSI] aic94xx: new driver")
Reported-by: Gao Chuan <gaochuan4@huawei.com>
Signed-off-by: Jason Yan <yanaijie@huawei.com>
---
 drivers/scsi/libsas/sas_discover.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Comments

John Garry Nov. 29, 2019, 1:58 p.m. UTC | #1
On 29/11/2019 03:24, yanaijie wrote:
> The discovering of sas port is driving by workqueue in libsas. When
> libsas is processing port events or phy events in workqueue, new evnets

/s/evnets/events/

> may rise up and change the state of some structures such as asd_sas_phy.
> This may cause some problems such as follows:
> 
> ==>thread 1                       ==>thread 2
> 
>                                    ==>phy up
>                                    ==>phy_up_v3_hw()
>                                      ==>oob_mode = SATA_OOB_MODE;
>                                    ==>phy donw quickly

down

>                                    ==>hisi_sas_phy_down()
>                                      ==>sas_ha->notify_phy_event()
>                                      ==>sas_phy_disconnected()
>                                        ==>oob_mode = OOB_NOT_CONNECTED
> ==>workqueue wakeup
> ==>sas_form_port()
>    ==>sas_discover_domain()
>      ==>sas_get_port_device()
>        ==>oob_mode is OOB_NOT_CONNECTED and device
>           is wrongly taken as expander
> 
> This at last lead to the panic when libsas trying to issue a command to
> discover the device.
> 
> [183047.614035] Unable to handle kernel NULL pointer dereference at
> virtual address 0000000000000058
> [183047.622896] Mem abort info:
> [183047.625762]   ESR = 0x96000004
> [183047.628893]   Exception class = DABT (current EL), IL = 32 bits
> [183047.634888]   SET = 0, FnV = 0
> [183047.638015]   EA = 0, S1PTW = 0
> [183047.641232] Data abort info:
> [183047.644189]   ISV = 0, ISS = 0x00000004
> [183047.648100]   CM = 0, WnR = 0
> [183047.651145] user pgtable: 4k pages, 48-bit VAs, pgdp =
> 00000000b7df67be
> [183047.657834] [0000000000000058] pgd=0000000000000000
> [183047.662789] Internal error: Oops: 96000004 [#1] SMP
> [183047.667740] Process kworker/u16:2 (pid: 31291, stack limit =
> 0x00000000417c4974)
> [183047.675208] CPU: 0 PID: 3291 Comm: kworker/u16:2 Tainted: G
> W  OE 4.19.36-vhulk1907.1.0.h410.eulerosv2r8.aarch64 #1
> [183047.687015] Hardware name: N/A N/A/Kunpeng Desktop Board D920S10,
> BIOS 0.15 10/22/2019
> [183047.695007] Workqueue: 0000:74:02.0_disco_q sas_discover_domain
> [183047.700999] pstate: 20c00009 (nzCv daif +PAN +UAO)
> [183047.705864] pc : prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]

Can you explain how we discover an expander device yet try to send an 
ATA command?

> [183047.711510] lr : prep_ata_v3_hw+0xb0/0x230 [hisi_sas_v3_hw]
> [183047.717153] sp : ffff00000f28ba60
> [183047.720541] x29: ffff00000f28ba60 x28: ffff8026852d7228
> [183047.725925] x27: ffff8027dba3e0a8 x26: ffff8027c05fc200
> [183047.731310] x25: 0000000000000000 x24: ffff8026bafa8dc0
> [183047.736695] x23: ffff8027c05fc218 x22: ffff8026852d7228
> [183047.742079] x21: ffff80007c2f2940 x20: ffff8027c05fc200
> [183047.747464] x19: 0000000000f80800 x18: 0000000000000010
> [183047.752848] x17: 0000000000000000 x16: 0000000000000000
> [183047.758232] x15: ffff000089a5a4ff x14: 0000000000000005
> [183047.763617] x13: ffff000009a5a50e x12: ffff8026bafa1e20
> [183047.769001] x11: ffff0000087453b8 x10: ffff00000f28b870
> [183047.774385] x9 : 0000000000000000 x8 : ffff80007e58f9b0
> [183047.779770] x7 : 0000000000000000 x6 : 000000000000003f
> [183047.785154] x5 : 0000000000000040 x4 : ffffffffffffffe0
> [183047.790538] x3 : 00000000000000f8 x2 : 0000000002000007
> [183047.795922] x1 : 0000000000000008 x0 : 0000000000000000
> [183047.801307] Call trace:
> [183047.803827]  prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
> [183047.809127]  hisi_sas_task_prep+0x750/0x888 [hisi_sas_main]
> [183047.814773]  hisi_sas_task_exec.isra.7+0x88/0x1f0 [hisi_sas_main]
> [183047.820939]  hisi_sas_queue_command+0x28/0x38 [hisi_sas_main]
> [183047.826757]  smp_execute_task_sg+0xec/0x218
> [183047.831013]  smp_execute_task+0x74/0xa0
> [183047.834921]  sas_discover_expander.part.7+0x9c/0x5f8
> [183047.839959]  sas_discover_root_expander+0x90/0x160
> [183047.844822]  sas_discover_domain+0x1b8/0x1e8
> [183047.849164]  process_one_work+0x1b4/0x3f8
> [183047.853246]  worker_thread+0x54/0x470
> [183047.856981]  kthread+0x134/0x138
> [183047.860283]  ret_from_fork+0x10/0x18
> [183047.863931] Code: f9407a80 528000e2 39409281 72a04002 (b9405800)
> [183047.870097] kernel fault(0x1) notification starting on CPU 0
> [183047.875828] kernel fault(0x1) notification finished on CPU 0
> [183047.881559] Modules linked in: unibsp(OE) hns3(OE) hclge(OE)
> hnae3(OE) mem_drv(OE) hisi_sas_v3_hw(OE) hisi_sas_main(OE)
> [183047.892418] ---[ end trace 4cc26083fc11b783  ]---
> [183047.897107] Kernel panic - not syncing: Fatal exception
> [183047.902403] kernel fault(0x5) notification starting on CPU 0
> [183047.908134] kernel fault(0x5) notification finished on CPU 0
> [183047.913865] SMP: stopping secondary CPUs
> [183047.917861] Kernel Offset: disabled
> [183047.921422] CPU features: 0x2,a2a00a38
> [183047.925243] Memory Limit: none
> [183047.928372] kernel reboot(0x2) notification starting on CPU 0
> [183047.934190] kernel reboot(0x2) notification finished on CPU 0
> [183047.940008] ---[ end Kernel panic - not syncing: Fatal exception
> ]---
> 
> Fixes: 2908d778ab3e ("[SCSI] aic94xx: new driver")

I am not sure if it is appropriate to identify this as what we're 
fixing, since we changed significantly the event processing in libsas 
since 2908d778ab3e?

> Reported-by: Gao Chuan <gaochuan4@huawei.com>
> Signed-off-by: Jason Yan <yanaijie@huawei.com>
> ---
>   drivers/scsi/libsas/sas_discover.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
> index f47b4b281b14..23fdbc8fa05a 100644
> --- a/drivers/scsi/libsas/sas_discover.c
> +++ b/drivers/scsi/libsas/sas_discover.c
> @@ -81,12 +81,18 @@ static int sas_get_port_device(struct asd_sas_port *port)
>   		else
>   			dev->dev_type = SAS_SATA_DEV;
>   		dev->tproto = SAS_PROTOCOL_SATA;
> -	} else {
> +	} else if (port->oob_mode == SAS_OOB_MODE) {

This just looks racy. We're sending an event and then checking some 
volatile shared memory in the event processing :(

>   		struct sas_identify_frame *id =
>   			(struct sas_identify_frame *) dev->frame_rcvd;
>   		dev->dev_type = id->dev_type;
>   		dev->iproto = id->initiator_bits;
>   		dev->tproto = id->target_bits;
> +	} else {
> +		/* If the oob mode is OOB_NOT_CONNECTED, the port is
> +		 * disconnected due to race with PHY down. We cannot
> +		 * continue to discover this port */
> +		sas_put_device(dev);
> +		return rc;
>   	}
>   
>   	sas_init_dev(dev);
>
Jason Yan Nov. 30, 2019, 2:18 a.m. UTC | #2
在 2019/11/29 21:58, John Garry 写道:
> On 29/11/2019 03:24, yanaijie wrote:
>> The discovering of sas port is driving by workqueue in libsas. When
>> libsas is processing port events or phy events in workqueue, new evnets
> 
> /s/evnets/events/
> 
>> may rise up and change the state of some structures such as asd_sas_phy.
>> This may cause some problems such as follows:
>>
>> ==>thread 1                       ==>thread 2
>>
>>                                    ==>phy up
>>                                    ==>phy_up_v3_hw()
>>                                      ==>oob_mode = SATA_OOB_MODE;
>>                                    ==>phy donw quickly
> 
> down
> 
>>                                    ==>hisi_sas_phy_down()
>>                                      ==>sas_ha->notify_phy_event()
>>                                      ==>sas_phy_disconnected()
>>                                        ==>oob_mode = OOB_NOT_CONNECTED
>> ==>workqueue wakeup
>> ==>sas_form_port()
>>    ==>sas_discover_domain()
>>      ==>sas_get_port_device()
>>        ==>oob_mode is OOB_NOT_CONNECTED and device
>>           is wrongly taken as expander
>>
>> This at last lead to the panic when libsas trying to issue a command to
>> discover the device.
>>
>> [183047.614035] Unable to handle kernel NULL pointer dereference at
>> virtual address 0000000000000058
>> [183047.622896] Mem abort info:
>> [183047.625762]   ESR = 0x96000004
>> [183047.628893]   Exception class = DABT (current EL), IL = 32 bits
>> [183047.634888]   SET = 0, FnV = 0
>> [183047.638015]   EA = 0, S1PTW = 0
>> [183047.641232] Data abort info:
>> [183047.644189]   ISV = 0, ISS = 0x00000004
>> [183047.648100]   CM = 0, WnR = 0
>> [183047.651145] user pgtable: 4k pages, 48-bit VAs, pgdp =
>> 00000000b7df67be
>> [183047.657834] [0000000000000058] pgd=0000000000000000
>> [183047.662789] Internal error: Oops: 96000004 [#1] SMP
>> [183047.667740] Process kworker/u16:2 (pid: 31291, stack limit =
>> 0x00000000417c4974)
>> [183047.675208] CPU: 0 PID: 3291 Comm: kworker/u16:2 Tainted: G
>> W  OE 4.19.36-vhulk1907.1.0.h410.eulerosv2r8.aarch64 #1
>> [183047.687015] Hardware name: N/A N/A/Kunpeng Desktop Board D920S10,
>> BIOS 0.15 10/22/2019
>> [183047.695007] Workqueue: 0000:74:02.0_disco_q sas_discover_domain
>> [183047.700999] pstate: 20c00009 (nzCv daif +PAN +UAO)
>> [183047.705864] pc : prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
> 
> Can you explain how we discover an expander device yet try to send an 
> ATA command?

In sas_get_port_device() when oob mode is not SATA_OOB_MODE, we get into 
the branch that use the sas_identify_frame instead of dev_to_host_fis to 
determine the device type and protocal, which is a totally wrong 
structure. And the dev_type is expander, but the tproto is 
SAS_PROTOCOL_SATA or SAS_PROTOCOL_STP in this case:



	if (dev->frame_rcvd[0] == 0x34 && port->oob_mode == SATA_OOB_MODE) {
		struct dev_to_host_fis *fis =
			(struct dev_to_host_fis *) dev->frame_rcvd;
		if (fis->interrupt_reason == 1 && fis->lbal == 1 &&
		    fis->byte_count_low==0x69 && fis->byte_count_high == 0x96
		    && (fis->device & ~0x10) == 0)
			dev->dev_type = SAS_SATA_PM;
		else
			dev->dev_type = SAS_SATA_DEV;
		dev->tproto = SAS_PROTOCOL_SATA;
	} else {
		struct sas_identify_frame *id =
			(struct sas_identify_frame *) dev->frame_rcvd;
		dev->dev_type = id->dev_type;
		dev->iproto = id->initiator_bits;
		dev->tproto = id->target_bits;
	}


> 
>> [183047.711510] lr : prep_ata_v3_hw+0xb0/0x230 [hisi_sas_v3_hw]
>> [183047.717153] sp : ffff00000f28ba60
>> [183047.720541] x29: ffff00000f28ba60 x28: ffff8026852d7228
>> [183047.725925] x27: ffff8027dba3e0a8 x26: ffff8027c05fc200
>> [183047.731310] x25: 0000000000000000 x24: ffff8026bafa8dc0
>> [183047.736695] x23: ffff8027c05fc218 x22: ffff8026852d7228
>> [183047.742079] x21: ffff80007c2f2940 x20: ffff8027c05fc200
>> [183047.747464] x19: 0000000000f80800 x18: 0000000000000010
>> [183047.752848] x17: 0000000000000000 x16: 0000000000000000
>> [183047.758232] x15: ffff000089a5a4ff x14: 0000000000000005
>> [183047.763617] x13: ffff000009a5a50e x12: ffff8026bafa1e20
>> [183047.769001] x11: ffff0000087453b8 x10: ffff00000f28b870
>> [183047.774385] x9 : 0000000000000000 x8 : ffff80007e58f9b0
>> [183047.779770] x7 : 0000000000000000 x6 : 000000000000003f
>> [183047.785154] x5 : 0000000000000040 x4 : ffffffffffffffe0
>> [183047.790538] x3 : 00000000000000f8 x2 : 0000000002000007
>> [183047.795922] x1 : 0000000000000008 x0 : 0000000000000000
>> [183047.801307] Call trace:
>> [183047.803827]  prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
>> [183047.809127]  hisi_sas_task_prep+0x750/0x888 [hisi_sas_main]
>> [183047.814773]  hisi_sas_task_exec.isra.7+0x88/0x1f0 [hisi_sas_main]
>> [183047.820939]  hisi_sas_queue_command+0x28/0x38 [hisi_sas_main]
>> [183047.826757]  smp_execute_task_sg+0xec/0x218
>> [183047.831013]  smp_execute_task+0x74/0xa0
>> [183047.834921]  sas_discover_expander.part.7+0x9c/0x5f8
>> [183047.839959]  sas_discover_root_expander+0x90/0x160
>> [183047.844822]  sas_discover_domain+0x1b8/0x1e8
>> [183047.849164]  process_one_work+0x1b4/0x3f8
>> [183047.853246]  worker_thread+0x54/0x470
>> [183047.856981]  kthread+0x134/0x138
>> [183047.860283]  ret_from_fork+0x10/0x18
>> [183047.863931] Code: f9407a80 528000e2 39409281 72a04002 (b9405800)
>> [183047.870097] kernel fault(0x1) notification starting on CPU 0
>> [183047.875828] kernel fault(0x1) notification finished on CPU 0
>> [183047.881559] Modules linked in: unibsp(OE) hns3(OE) hclge(OE)
>> hnae3(OE) mem_drv(OE) hisi_sas_v3_hw(OE) hisi_sas_main(OE)
>> [183047.892418] ---[ end trace 4cc26083fc11b783  ]---
>> [183047.897107] Kernel panic - not syncing: Fatal exception
>> [183047.902403] kernel fault(0x5) notification starting on CPU 0
>> [183047.908134] kernel fault(0x5) notification finished on CPU 0
>> [183047.913865] SMP: stopping secondary CPUs
>> [183047.917861] Kernel Offset: disabled
>> [183047.921422] CPU features: 0x2,a2a00a38
>> [183047.925243] Memory Limit: none
>> [183047.928372] kernel reboot(0x2) notification starting on CPU 0
>> [183047.934190] kernel reboot(0x2) notification finished on CPU 0
>> [183047.940008] ---[ end Kernel panic - not syncing: Fatal exception
>> ]---
>>
>> Fixes: 2908d778ab3e ("[SCSI] aic94xx: new driver")
> 
> I am not sure if it is appropriate to identify this as what we're 
> fixing, since we changed significantly the event processing in libsas 
> since 2908d778ab3e?
> 

It's ok. Even before 2908d778ab3e we have this issue too. The workqueue 
process is always late and far after the interrupt.

>> Reported-by: Gao Chuan <gaochuan4@huawei.com>
>> Signed-off-by: Jason Yan <yanaijie@huawei.com>
>> ---
>>   drivers/scsi/libsas/sas_discover.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/scsi/libsas/sas_discover.c 
>> b/drivers/scsi/libsas/sas_discover.c
>> index f47b4b281b14..23fdbc8fa05a 100644
>> --- a/drivers/scsi/libsas/sas_discover.c
>> +++ b/drivers/scsi/libsas/sas_discover.c
>> @@ -81,12 +81,18 @@ static int sas_get_port_device(struct asd_sas_port 
>> *port)
>>           else
>>               dev->dev_type = SAS_SATA_DEV;
>>           dev->tproto = SAS_PROTOCOL_SATA;
>> -    } else {
>> +    } else if (port->oob_mode == SAS_OOB_MODE) {
> 
> This just looks racy. We're sending an event and then checking some 
> volatile shared memory in the event processing :(

It's not me doing this. I add this here because the branch above which 
already doing this.

And It's true that it is racy. libsas is doing this all the time, not 
only here. The LLDD and libsas share many structures such as 
asd_sas_port, asd_sas_phy and so on and they access these structures in 
deferent contexts. This is a wrong design.

Or we can delete the
"port->oob_mode == SATA_OOB_MODE" above to fix this issue?

Becuase "dev->frame_rcvd[0] == 0x34" is enough to determine that if this 
is a "dev_to_host_fis" or a "sas_identify_frame":


	if (dev->frame_rcvd[0] == 0x34) {
		struct dev_to_host_fis *fis =
			(struct dev_to_host_fis *) dev->frame_rcvd;
		if (fis->interrupt_reason == 1 && fis->lbal == 1 &&
		    fis->byte_count_low==0x69 && fis->byte_count_high == 0x96
		    && (fis->device & ~0x10) == 0)
			dev->dev_type = SAS_SATA_PM;
		else
			dev->dev_type = SAS_SATA_DEV;
		dev->tproto = SAS_PROTOCOL_SATA;
	} else {
		struct sas_identify_frame *id =
			(struct sas_identify_frame *) dev->frame_rcvd;
		dev->dev_type = id->dev_type;
		dev->iproto = id->initiator_bits;
		dev->tproto = id->target_bits;
	}

> 
>>           struct sas_identify_frame *id =
>>               (struct sas_identify_frame *) dev->frame_rcvd;
>>           dev->dev_type = id->dev_type;
>>           dev->iproto = id->initiator_bits;
>>           dev->tproto = id->target_bits;
>> +    } else {
>> +        /* If the oob mode is OOB_NOT_CONNECTED, the port is
>> +         * disconnected due to race with PHY down. We cannot
>> +         * continue to discover this port */
>> +        sas_put_device(dev);
>> +        return rc;
>>       }
>>       sas_init_dev(dev);
>>
> 
> 
> .
John Garry Dec. 2, 2019, 11:26 a.m. UTC | #3
>>> 0x00000000417c4974)
>>> [183047.675208] CPU: 0 PID: 3291 Comm: kworker/u16:2 Tainted: G
>>> W  OE 4.19.36-vhulk1907.1.0.h410.eulerosv2r8.aarch64 #1
>>> [183047.687015] Hardware name: N/A N/A/Kunpeng Desktop Board D920S10,
>>> BIOS 0.15 10/22/2019
>>> [183047.695007] Workqueue: 0000:74:02.0_disco_q sas_discover_domain
>>> [183047.700999] pstate: 20c00009 (nzCv daif +PAN +UAO)
>>> [183047.705864] pc : prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
>>
>> Can you explain how we discover an expander device yet try to send an
>> ATA command?
> 
> In sas_get_port_device() when oob mode is not SATA_OOB_MODE, we get into
> the branch that use the sas_identify_frame instead of dev_to_host_fis to
> determine the device type and protocal, which is a totally wrong
> structure. And the dev_type is expander, but the tproto is
> SAS_PROTOCOL_SATA or SAS_PROTOCOL_STP in this case:
> 
> 
> 
> 	if (dev->frame_rcvd[0] == 0x34 && port->oob_mode == SATA_OOB_MODE) {
> 		struct dev_to_host_fis *fis =
> 			(struct dev_to_host_fis *) dev->frame_rcvd;
> 		if (fis->interrupt_reason == 1 && fis->lbal == 1 &&
> 		    fis->byte_count_low==0x69 && fis->byte_count_high == 0x96
> 		    && (fis->device & ~0x10) == 0)
> 			dev->dev_type = SAS_SATA_PM;
> 		else
> 			dev->dev_type = SAS_SATA_DEV;
> 		dev->tproto = SAS_PROTOCOL_SATA;
> 	} else {
> 		struct sas_identify_frame *id =
> 			(struct sas_identify_frame *) dev->frame_rcvd;
> 		dev->dev_type = id->dev_type;
> 		dev->iproto = id->initiator_bits;
> 		dev->tproto = id->target_bits;
> 	}

Ah, yes, I get it.

> 
> 
>>
>>> [183047.711510] lr : prep_ata_v3_hw+0xb0/0x230 [hisi_sas_v3_hw]
>>> [183047.717153] sp : ffff00000f28ba60
>>> [183047.720541] x29: ffff00000f28ba60 x28: ffff8026852d7228
>>> [183047.725925] x27: ffff8027dba3e0a8 x26: ffff8027c05fc200
>>> [183047.731310] x25: 0000000000000000 x24: ffff8026bafa8dc0
>>> [183047.736695] x23: ffff8027c05fc218 x22: ffff8026852d7228
>>> [183047.742079] x21: ffff80007c2f2940 x20: ffff8027c05fc200
>>> [183047.747464] x19: 0000000000f80800 x18: 0000000000000010
>>> [183047.752848] x17: 0000000000000000 x16: 0000000000000000
>>> [183047.758232] x15: ffff000089a5a4ff x14: 0000000000000005
>>> [183047.763617] x13: ffff000009a5a50e x12: ffff8026bafa1e20
>>> [183047.769001] x11: ffff0000087453b8 x10: ffff00000f28b870
>>> [183047.774385] x9 : 0000000000000000 x8 : ffff80007e58f9b0
>>> [183047.779770] x7 : 0000000000000000 x6 : 000000000000003f
>>> [183047.785154] x5 : 0000000000000040 x4 : ffffffffffffffe0
>>> [183047.790538] x3 : 00000000000000f8 x2 : 0000000002000007
>>> [183047.795922] x1 : 0000000000000008 x0 : 0000000000000000
>>> [183047.801307] Call trace:

is there any guard which we should add in the LLDD for this also?

>>> [183047.803827]  prep_ata_v3_hw+0xf8/0x230 [hisi_sas_v3_hw]
>>> [183047.809127]  hisi_sas_task_prep+0x750/0x888 [hisi_sas_main]
>>> [183047.814773]  hisi_sas_task_exec.isra.7+0x88/0x1f0 [hisi_sas_main]
>>> [183047.820939]  hisi_sas_queue_command+0x28/0x38 [hisi_sas_main]
>>> [183047.826757]  smp_execute_task_sg+0xec/0x218
>>> [183047.831013]  smp_execute_task+0x74/0xa0
>>> [183047.834921]  sas_discover_expander.part.7+0x9c/0x5f8
>>> [183047.839959]  sas_discover_root_expander+0x90/0x160
>>> [183047.844822]  sas_discover_domain+0x1b8/0x1e8
>>> [183047.849164]  process_one_work+0x1b4/0x3f8
>>> [183047.853246]  worker_thread+0x54/0x470
>>> [183047.856981]  kthread+0x134/0x138
>>> [183047.860283]  ret_from_fork+0x10/0x18
>>> [183047.863931] Code: f9407a80 528000e2 39409281 72a04002 (b9405800)
>>> [183047.870097] kernel fault(0x1) notification starting on CPU 0
>>> [183047.875828] kernel fault(0x1) notification finished on CPU 0
>>> [183047.881559] Modules linked in: unibsp(OE) hns3(OE) hclge(OE)
>>> hnae3(OE) mem_drv(OE) hisi_sas_v3_hw(OE) hisi_sas_main(OE)
>>> [183047.892418] ---[ end trace 4cc26083fc11b783  ]---
>>> [183047.897107] Kernel panic - not syncing: Fatal exception
>>> [183047.902403] kernel fault(0x5) notification starting on CPU 0
>>> [183047.908134] kernel fault(0x5) notification finished on CPU 0
>>> [183047.913865] SMP: stopping secondary CPUs
>>> [183047.917861] Kernel Offset: disabled
>>> [183047.921422] CPU features: 0x2,a2a00a38
>>> [183047.925243] Memory Limit: none
>>> [183047.928372] kernel reboot(0x2) notification starting on CPU 0
>>> [183047.934190] kernel reboot(0x2) notification finished on CPU 0
>>> [183047.940008] ---[ end Kernel panic - not syncing: Fatal exception
>>> ]---
>>>
>>> Fixes: 2908d778ab3e ("[SCSI] aic94xx: new driver")
>>
>> I am not sure if it is appropriate to identify this as what we're
>> fixing, since we changed significantly the event processing in libsas
>> since 2908d778ab3e?
>>
> 
> It's ok. Even before 2908d778ab3e we have this issue too. The workqueue
> process is always late and far after the interrupt.
> 
>>> Reported-by: Gao Chuan <gaochuan4@huawei.com>
>>> Signed-off-by: Jason Yan <yanaijie@huawei.com>
>>> ---
>>>    drivers/scsi/libsas/sas_discover.c | 8 +++++++-
>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/scsi/libsas/sas_discover.c
>>> b/drivers/scsi/libsas/sas_discover.c
>>> index f47b4b281b14..23fdbc8fa05a 100644
>>> --- a/drivers/scsi/libsas/sas_discover.c
>>> +++ b/drivers/scsi/libsas/sas_discover.c
>>> @@ -81,12 +81,18 @@ static int sas_get_port_device(struct asd_sas_port
>>> *port)
>>>            else
>>>                dev->dev_type = SAS_SATA_DEV;
>>>            dev->tproto = SAS_PROTOCOL_SATA;
>>> -    } else {
>>> +    } else if (port->oob_mode == SAS_OOB_MODE) {
>>
>> This just looks racy. We're sending an event and then checking some
>> volatile shared memory in the event processing :(
> 
> It's not me doing this. I add this here because the branch above which
> already doing this.

Yes, right, I was just commenting on the current code.

> 
> And It's true that it is racy. libsas is doing this all the time, not
> only here. The LLDD and libsas share many structures such as
> asd_sas_port, asd_sas_phy and so on and they access these structures in
> deferent contexts. This is a wrong design.

But I didn't think any were as bad as this.

>  > Or we can delete the
> "port->oob_mode == SATA_OOB_MODE" above to fix this issue?
> 
> Becuase "dev->frame_rcvd[0] == 0x34" is enough to determine that if this
> is a "dev_to_host_fis" or a "sas_identify_frame":
> 
> 
> 	if (dev->frame_rcvd[0] == 0x34) {
> 		struct dev_to_host_fis *fis =
> 			(struct dev_to_host_fis *) dev->frame_rcvd;
> 		if (fis->interrupt_reason == 1 && fis->lbal == 1 &&
> 		    fis->byte_count_low==0x69 && fis->byte_count_high == 0x96
> 		    && (fis->device & ~0x10) == 0)
> 			dev->dev_type = SAS_SATA_PM;
> 		else
> 			dev->dev_type = SAS_SATA_DEV;
> 		dev->tproto = SAS_PROTOCOL_SATA;
> 	} else {
> 		struct sas_identify_frame *id =
> 			(struct sas_identify_frame *) dev->frame_rcvd;

nit: we could add a add newline. I know other code does not follow this 
rule, but I see no reason to continue doing it.


> 		dev->dev_type = id->dev_type;
> 		dev->iproto = id->initiator_bits;
> 		dev->tproto = id->target_bits;
> 	}
> 
>>
>>>            struct sas_identify_frame *id =
>>>                (struct sas_identify_frame *) dev->frame_rcvd;
>>>            dev->dev_type = id->dev_type;
>>>            dev->iproto = id->initiator_bits;
>>>            dev->tproto = id->target_bits;
>>> +    } else {
>>> +        /* If the oob mode is OOB_NOT_CONNECTED, the port is
>>> +         * disconnected due to race with PHY down. We cannot
>>> +         * continue to discover this port */
>>> +        sas_put_device(dev);
>>> +        return rc;

So your fix in the original patch looks ok. I would prefer if we could 
stop using rc or remove it, but that's just a slight preference.

Another thought is to add some print to warn/inform of this.

>>>        }
>>>        sas_init_dev(dev);
>>>
>>
>>
>> .
> 


thanks,
John
diff mbox series

Patch

diff --git a/drivers/scsi/libsas/sas_discover.c b/drivers/scsi/libsas/sas_discover.c
index f47b4b281b14..23fdbc8fa05a 100644
--- a/drivers/scsi/libsas/sas_discover.c
+++ b/drivers/scsi/libsas/sas_discover.c
@@ -81,12 +81,18 @@  static int sas_get_port_device(struct asd_sas_port *port)
 		else
 			dev->dev_type = SAS_SATA_DEV;
 		dev->tproto = SAS_PROTOCOL_SATA;
-	} else {
+	} else if (port->oob_mode == SAS_OOB_MODE) {
 		struct sas_identify_frame *id =
 			(struct sas_identify_frame *) dev->frame_rcvd;
 		dev->dev_type = id->dev_type;
 		dev->iproto = id->initiator_bits;
 		dev->tproto = id->target_bits;
+	} else {
+		/* If the oob mode is OOB_NOT_CONNECTED, the port is
+		 * disconnected due to race with PHY down. We cannot
+		 * continue to discover this port */
+		sas_put_device(dev);
+		return rc;
 	}
 
 	sas_init_dev(dev);