mbox series

[00/20] libsas and pm8001 fixes

Message ID 20220210114218.632725-1-damien.lemoal@opensource.wdc.com (mailing list archive)
Headers show
Series libsas and pm8001 fixes | expand

Message

Damien Le Moal Feb. 10, 2022, 11:41 a.m. UTC
The first 3 patches fix a problem with libsas handling of NCQ NON DATA
commands and simplify libsas code in a couple of places.
The remaining patches are fixes for the pm8001 driver:
* All sparse warnings are addressed, fixing along the way many le32
  handling bugs for big-endian architectures
* Fix handling of NCQ NON DATA commands
* Fix NCQ error recovery (abort all task execution) that was causing a
  crash
* Simplify the code in many places

With these fixes, libzbc test suite passes all test case. This test
suite was used with an SMR drive for testing because it generates many
NCQ NON DATA commands (for zone management commands) and also generates
many NCQ command errors to check ASC/ASCQ returned by the device. With
the test suite, the error recovery path was extensively exercised.

Note that without these patches, libzbc test suite result in the
controller hanging, or in kernel crashes.

Damien Le Moal (20):
  scsi: libsas: fix sas_ata_qc_issue() handling of NCQ NON DATA commands
  scsi: libsas: simplify sas_ata_qc_issue() detection of NCQ commands
  scsi: libsas: Remove unnecessary initialization in sas_ata_qc_issue()
  scsi: pm8001: fix __iomem pointer use in pm8001_phy_control()
  scsi: pm8001: Remove local variable in pm8001_pci_resume()
  scsi: pm8001: Fix pm8001_update_flash() local variable type
  scsi: pm8001: Fix command initialization in pm80XX_send_read_log()
  scsi: pm8001: Fix local variable declaration in pm80xx_pci_mem_copy()
  scsi: pm8001: Fix command initialization in pm8001_chip_ssp_tm_req()
  scsi: pm8001: fix payload initialization in
    pm80xx_set_thermal_config()
  scsi: pm8001: fix le32 values handling in
    pm80xx_set_sas_protocol_timer_config()
  scsi: pm8001: fix payload initialization in pm80xx_encrypt_update()
  scsi: pm8001: fix le32 values handling in pm80xx_chip_ssp_io_req()
  scsi: pm8001: fix le32 values handling in pm80xx_chip_sata_req()
  scsi: pm8001: fix use of struct set_phy_profile_req fields
  scsi: pm8001: simplify pm8001_get_ncq_tag()
  scsi: pm8001: fix NCQ NON DATA command task initialization
  scsi: pm8001: fix NCQ NON DATA command completion handling
  scsi: pm8001: cleanup pm8001_queue_command()
  scsi: pm8001: fix abort all task initialization

 drivers/scsi/libsas/sas_ata.c     |  12 +-
 drivers/scsi/pm8001/pm8001_ctl.c  |   5 +-
 drivers/scsi/pm8001/pm8001_hwi.c  |  23 +--
 drivers/scsi/pm8001/pm8001_init.c |   8 +-
 drivers/scsi/pm8001/pm8001_sas.c  |  48 +++----
 drivers/scsi/pm8001/pm80xx_hwi.c  | 226 ++++++++++++++++--------------
 drivers/scsi/pm8001/pm80xx_hwi.h  |   2 +-
 7 files changed, 169 insertions(+), 155 deletions(-)

Comments

John Garry Feb. 10, 2022, 3:35 p.m. UTC | #1
On 10/02/2022 11:41, Damien Le Moal wrote:
> The first 3 patches fix a problem with libsas handling of NCQ NON DATA
> commands and simplify libsas code in a couple of places.
> The remaining patches are fixes for the pm8001 driver:
> * All sparse warnings are addressed, fixing along the way many le32
>    handling bugs for big-endian architectures
> * Fix handling of NCQ NON DATA commands
> * Fix NCQ error recovery (abort all task execution) that was causing a
>    crash
> * Simplify the code in many places
> 
> With these fixes, libzbc test suite passes all test case. This test
> suite was used with an SMR drive for testing because it generates many
> NCQ NON DATA commands (for zone management commands) and also generates
> many NCQ command errors to check ASC/ASCQ returned by the device. With
> the test suite, the error recovery path was extensively exercised.
> 
> Note that without these patches, libzbc test suite result in the
> controller hanging, or in kernel crashes.

Unfortunately I still see the hang on my arm64 system with this series :(

Thanks,
John
Damien Le Moal Feb. 10, 2022, 10:44 p.m. UTC | #2
On 2/11/22 00:35, John Garry wrote:
> On 10/02/2022 11:41, Damien Le Moal wrote:
>> The first 3 patches fix a problem with libsas handling of NCQ NON DATA
>> commands and simplify libsas code in a couple of places.
>> The remaining patches are fixes for the pm8001 driver:
>> * All sparse warnings are addressed, fixing along the way many le32
>>    handling bugs for big-endian architectures
>> * Fix handling of NCQ NON DATA commands
>> * Fix NCQ error recovery (abort all task execution) that was causing a
>>    crash
>> * Simplify the code in many places
>>
>> With these fixes, libzbc test suite passes all test case. This test
>> suite was used with an SMR drive for testing because it generates many
>> NCQ NON DATA commands (for zone management commands) and also generates
>> many NCQ command errors to check ASC/ASCQ returned by the device. With
>> the test suite, the error recovery path was extensively exercised.
>>
>> Note that without these patches, libzbc test suite result in the
>> controller hanging, or in kernel crashes.
> 
> Unfortunately I still see the hang on my arm64 system with this series :(

That is unfortunate. Any particular command sequence triggering the hang
? Or is it random ? What workload are you running ?

> 
> Thanks,
> John
John Garry Feb. 11, 2022, 9:24 a.m. UTC | #3
On 10/02/2022 22:44, Damien Le Moal wrote:

Hi Damien,

>>> Note that without these patches, libzbc test suite result in the
>>> controller hanging, or in kernel crashes.
>> Unfortunately I still see the hang on my arm64 system with this series:(
> That is unfortunate. Any particular command sequence triggering the hang
> ? Or is it random ? What workload are you running ?
> 

mount/unmount fails mostly even after as few as one attempt, but then 
even fdisk -l fails sometimes:

root@(none)$ fdisk -l
[   97.924789] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
[   97.930652] sas: sas_scsi_find_task: aborting task 0x(____ptrval____)
[   97.937149] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   97.943232] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   97.952020] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   97.958099] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   97.966881] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   97.972961] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   97.981737] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   97.991241] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.000752] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.010223] pm80xx0:: pm8001_abort_task  1345:rc= -5
[   98.015180] sas: sas_scsi_find_task: querying task 0x(____ptrval____)
[   98.021645] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.027725] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.036495] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.042574] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.051344] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.057423] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.066154] pm80xx: rc= -5
[   98.068854] sas: sas_scsi_find_task: aborting task 0x(____ptrval____)
[   98.075331] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.081411] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.090192] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.096271] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.105053] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.111132] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.119909] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.129414] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.138919] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.148388] pm80xx0:: pm8001_abort_task  1345:rc= -5
[   98.153345] sas: sas_scsi_find_task: querying task 0x(____ptrval____)
[   98.159812] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.165892] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.174661] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.180741] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.189511] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.195590] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.204320] pm80xx: rc= -5
[   98.207019] sas: sas_scsi_find_task: aborting task 0x(____ptrval____)
[   98.213497] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.219577] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.228359] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.234438] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.243220] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.249299] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.258075] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.267580] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.277085] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.286554] pm80xx0:: pm8001_abort_task  1345:rc= -5
[   98.291510] sas: sas_scsi_find_task: querying task 0x(____ptrval____)
[   98.297978] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.304059] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.312827] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.318906] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.327677] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.333756] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.342487] pm80xx: rc= -5
[   98.345186] sas: sas_scsi_find_task: aborting task 0x(____ptrval____)
[   98.351664] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.357743] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.366525] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.372604] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.381386] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.387465] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.396230] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.405734] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.415239] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.424709] pm80xx0:: pm8001_abort_task  1345:rc= -5
[   98.429665] sas: sas_scsi_find_task: querying task 0x(____ptrval____)
[   98.436133] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.442213] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.450982] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.457061] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.465825] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.471904] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.480629] pm80xx: rc= -5
[   98.483327] sas: sas_scsi_find_task: aborting task 0x(____ptrval____)
[   98.489800] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.495880] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.504661] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.510740] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.519523] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.525602] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.534372] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.543877] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.553382] pm80xx0:: pm8001_mpi_task_abort_resp  3682:task abort 
failed status 0x6 ,tag = 0x2, scp= 0x0
[   98.562851] pm80xx0:: pm8001_abort_task  1345:rc= -5
[   98.567807] sas: sas_scsi_find_task: querying task 0x(____ptrval____)
[   98.574275] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.580355] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.589124] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.595203] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.603968] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
[   98.610047] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
Failure Drive:5000c500a7babc61
[   98.618778] pm80xx: rc= -5
[   98.621505] sas: --- Exit sas_scsi_recover_host: busy: 1 failed: 1 
tries: 1

Sometimes I get TMF timeouts, which is a bad situation. I guess it's a 
subtle driver bug, but where ....?

BTW, this following log needs removal/fixing at some stage by someone:

[   98.480629] pm80xx: rc= -5

It's from pm8001_query_task().

Thanks,
John
Damien Le Moal Feb. 11, 2022, 12:37 p.m. UTC | #4
On 2/11/22 18:24, John Garry wrote:
> On 10/02/2022 22:44, Damien Le Moal wrote:
> 
> Hi Damien,
> 
>>>> Note that without these patches, libzbc test suite result in the
>>>> controller hanging, or in kernel crashes.
>>> Unfortunately I still see the hang on my arm64 system with this series:(
>> That is unfortunate. Any particular command sequence triggering the hang
>> ? Or is it random ? What workload are you running ?
>>
> 
> mount/unmount fails mostly even after as few as one attempt, but then 
> even fdisk -l fails sometimes:

Try with patch 21 of my v2. It does fix a bug for scsi/sas case. That
problem would likely lead to a crash though, but never know...

> root@(none)$ fdisk -l
> [   97.924789] sas: Enter sas_scsi_recover_host busy: 1 failed: 1
> [   97.930652] sas: sas_scsi_find_task: aborting task 0x(____ptrval____)
> [   97.937149] pm80xx0:: mpi_ssp_completion  1937:sas IO status 0x3b
> [   97.943232] pm80xx0:: mpi_ssp_completion  1948:SAS Address of IO 
> Failure Drive:5000c500a7babc61
[...]
> 
> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a 
> subtle driver bug, but where ....?

What is the command failing ? Always the same ? Can you try adding scsi
trace to see the commands ?

If you are "lucky", it is always the same type of command like for the
NCQ NON DATA in my case. Though on mount, I would only expect a lot of
read commands and not much else. There may be some writes and a flush
too, so there will be "data" commands and "non data" commands. It may be
an issue with non-data commands too ?

> BTW, this following log needs removal/fixing at some stage by someone:
> 
> [   98.480629] pm80xx: rc= -5
> 
> It's from pm8001_query_task().
> 
> Thanks,
> John
John Garry Feb. 11, 2022, 1:08 p.m. UTC | #5
On 11/02/2022 12:37, Damien Le Moal wrote:

Hi Damien,

>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a
>> subtle driver bug, but where ....?
> What is the command failing ? Always the same ? Can you try adding scsi
> trace to see the commands ?

This is the same issue I have had since day #1.

Generally mount/unmount or even fdisk -l fails after booting into 
miniramfs. I wouldn't ever try to boot a distro.

> 
> If you are "lucky", it is always the same type of command like for the
> NCQ NON DATA in my case.

I'm just trying SAS disks to start with - so it's an SCSI READ command. 
SATA/STP support is generally never as robust for SAS HBAs (HW and LLD 
bugs are common - as this series is evidence) so I start on something 
more basic - however SATA/STP also has this issue.

The command is sent successfully but just never completes. Then 
sometimes the TMFs for error handling timeout and sometimes succeed. I 
don't have much to do on....

> Though on mount, I would only expect a lot of
> read commands and not much else.

Yes, and it is commonly the first SCSI read command which times out. It 
reliably breaks quite early. So I can think we can rule out issues like 
memory barriers/timing.

  There may be some writes and a flush
> too, so there will be "data" commands and "non data" commands. It may be
> an issue with non-data commands too ?
> 

Not sure on that. I guess it isn't.

Thanks,
John
Damien Le Moal Feb. 11, 2022, 1:14 p.m. UTC | #6
On 2/11/22 22:08, John Garry wrote:
> On 11/02/2022 12:37, Damien Le Moal wrote:
> 
> Hi Damien,
> 
>>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a
>>> subtle driver bug, but where ....?
>> What is the command failing ? Always the same ? Can you try adding scsi
>> trace to see the commands ?
> 
> This is the same issue I have had since day #1.
> 
> Generally mount/unmount or even fdisk -l fails after booting into 
> miniramfs. I wouldn't ever try to boot a distro.

busybox ?

> 
>>
>> If you are "lucky", it is always the same type of command like for the
>> NCQ NON DATA in my case.
> 
> I'm just trying SAS disks to start with - so it's an SCSI READ command. 
> SATA/STP support is generally never as robust for SAS HBAs (HW and LLD 
> bugs are common - as this series is evidence) so I start on something 
> more basic - however SATA/STP also has this issue.
> 
> The command is sent successfully but just never completes. Then 
> sometimes the TMFs for error handling timeout and sometimes succeed. I 
> don't have much to do on....

No SAS bus analyzer lying in a corner of the office ? :)
That could help...

I will go to the office Monday. So I will get a chance to add SAS drives
to my setup to see what I get. I have only tested with SATA until now.
My controller is not the same chip as yours though.

> 
>> Though on mount, I would only expect a lot of
>> read commands and not much else.
> 
> Yes, and it is commonly the first SCSI read command which times out. It 
> reliably breaks quite early. So I can think we can rule out issues like 
> memory barriers/timing.
> 
>   There may be some writes and a flush
>> too, so there will be "data" commands and "non data" commands. It may be
>> an issue with non-data commands too ?
>>
> 
> Not sure on that. I guess it isn't.

Anything special with the drives you are using ? Have you tried other
drives to see if you get lucky ?

> 
> Thanks,
> John
John Garry Feb. 11, 2022, 1:54 p.m. UTC | #7
Hi Damien,

>>
>>>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a
>>>> subtle driver bug, but where ....?
>>> What is the command failing ? Always the same ? Can you try adding scsi
>>> trace to see the commands ?
>>
>> This is the same issue I have had since day #1.
>>
>> Generally mount/unmount or even fdisk -l fails after booting into
>> miniramfs. I wouldn't ever try to boot a distro.
> 
> busybox ?
> 

Yes

>>
>>>
>>> If you are "lucky", it is always the same type of command like for the
>>> NCQ NON DATA in my case.
>>
>> I'm just trying SAS disks to start with - so it's an SCSI READ command.
>> SATA/STP support is generally never as robust for SAS HBAs (HW and LLD
>> bugs are common - as this series is evidence) so I start on something
>> more basic - however SATA/STP also has this issue.
>>
>> The command is sent successfully but just never completes. Then
>> sometimes the TMFs for error handling timeout and sometimes succeed. I
>> don't have much to do on....
> 
> No SAS bus analyzer lying in a corner of the office ? :)
> That could help...

None unfortunately

> 
> I will go to the office Monday. So I will get a chance to add SAS drives
> to my setup to see what I get. I have only tested with SATA until now.
> My controller is not the same chip as yours though.

jfyi, Ajish, cc'ed, from microchip says that they see the same issue on 
their arm64 system. Unfortunately fixing it is not a priority for them. 
So it is something which arm64 is exposing.

And I tried an old kernel - like 4.10 - on the same board and the pm8001 
driver was working somewhat reliably (no hangs). It just looks like a 
driver issue.

I'll have a look at the driver code again if I get a chance. It might be 
a DMA issue.

> 
>>
>>> Though on mount, I would only expect a lot of
>>> read commands and not much else.
>>
>> Yes, and it is commonly the first SCSI read command which times out. It
>> reliably breaks quite early. So I can think we can rule out issues like
>> memory barriers/timing.
>>
>>    There may be some writes and a flush
>>> too, so there will be "data" commands and "non data" commands. It may be
>>> an issue with non-data commands too ?
>>>
>>
>> Not sure on that. I guess it isn't.
> 
> Anything special with the drives you are using ? Have you tried other
> drives to see if you get lucky ?
> 

I think that the drives are ok. On the same board I originally had them 
plugged in the hisi_sas HBA and no issues.

thanks,
john
Damien Le Moal Feb. 12, 2022, 6:19 a.m. UTC | #8
On 2/11/22 22:54, John Garry wrote:
> Hi Damien,
> 
>>>
>>>>> Sometimes I get TMF timeouts, which is a bad situation. I guess it's a
>>>>> subtle driver bug, but where ....?
>>>> What is the command failing ? Always the same ? Can you try adding scsi
>>>> trace to see the commands ?
>>>
>>> This is the same issue I have had since day #1.
>>>
>>> Generally mount/unmount or even fdisk -l fails after booting into
>>> miniramfs. I wouldn't ever try to boot a distro.
>>
>> busybox ?
>>
> 
> Yes
> 
>>>
>>>>
>>>> If you are "lucky", it is always the same type of command like for the
>>>> NCQ NON DATA in my case.
>>>
>>> I'm just trying SAS disks to start with - so it's an SCSI READ command.
>>> SATA/STP support is generally never as robust for SAS HBAs (HW and LLD
>>> bugs are common - as this series is evidence) so I start on something
>>> more basic - however SATA/STP also has this issue.
>>>
>>> The command is sent successfully but just never completes. Then
>>> sometimes the TMFs for error handling timeout and sometimes succeed. I
>>> don't have much to do on....
>>
>> No SAS bus analyzer lying in a corner of the office ? :)
>> That could help...
> 
> None unfortunately
> 
>>
>> I will go to the office Monday. So I will get a chance to add SAS drives
>> to my setup to see what I get. I have only tested with SATA until now.
>> My controller is not the same chip as yours though.
> 
> jfyi, Ajish, cc'ed, from microchip says that they see the same issue on 
> their arm64 system. Unfortunately fixing it is not a priority for them. 
> So it is something which arm64 is exposing.
> 
> And I tried an old kernel - like 4.10 - on the same board and the pm8001 
> driver was working somewhat reliably (no hangs). It just looks like a 
> driver issue.
> 
> I'll have a look at the driver code again if I get a chance. It might be 
> a DMA issue.

There is one more thing that I find strange in the driver and that may
cause problems: tag 0 is a perfectly valid tag value that can be
returned by pm8001_tag_alloc() since find_first_zero_bit() will return 0
if the first bit is 0. And yet, there are many places in the driver that
treat !tag as an error. Extremely weird, if not outright broken...

I patched that and tested and everything seems OK... Could it be that
you are not seeing some completions because of that ?

I added the patch to my v3. Will send Monday.