[RFC,v3,00/19] scsi: scsi_error: Introduce new error handle mechanism

Message ID	20250314012927.150860-1-jiangjianjun3@huawei.com (mailing list archive)
Headers	show Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3E01B19BBA; Fri, 14 Mar 2025 01:10:14 +0000 (UTC) From: JiangJianJun <jiangjianjun3@huawei.com> To: <jejb@linux.ibm.com>, <martin.petersen@oracle.com>, <linux-scsi@vger.kernel.org> CC: <hare@suse.de>, <linux-kernel@vger.kernel.org>, <lixiaokeng@huawei.com>, <jiangjianjun3@huawei.com>, <hewenliang4@huawei.com>, <yangkunlin7@huawei.com> Subject: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism Date: Fri, 14 Mar 2025 09:29:08 +0800 Message-ID: <20250314012927.150860-1-jiangjianjun3@huawei.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	scsi: scsi_error: Introduce new error handle mechanism \| expand [RFC,v3,00/19] scsi: scsi_error: Introduce new error handle mechanism [RFC,v3,01/19] scsi: scsi_error: Define framework for LUN/target based error handle [RFC,v3,02/19] scsi: scsi_error: Move complete variable eh_action from shost to sdevice [RFC,v3,03/19] scsi: scsi_error: Check if to do reset in scsi_try_xxx_reset [RFC,v3,04/19] scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT [RFC,v3,05/19] scsi: scsi_error: Add helper scsi_eh_sdev_reset to do lun reset [RFC,v3,06/19] scsi: scsi_error: Add flags to mark error handle steps has done [RFC,v3,07/19] scsi: scsi_error: Add helper to handle scsi device's error command list [RFC,v3,08/19] scsi: scsi_error: Add a general LUN based error handler [RFC,v3,09/19] scsi: core: increase/decrease target_busy without check can_queue [RFC,v3,10/19] scsi: scsi_error: Add helper to handle scsi target's error command list [RFC,v3,11/19] scsi: scsi_error: Add a general target based error handler [RFC,v3,12/19] scsi: scsi_debug: Add param to control LUN bassed error handler [RFC,v3,13/19] scsi: scsi_debug: Add param to control target based error handle [RFC,v3,14/19] scsi: mpt3sas: Add param to control LUN based error handle [RFC,v3,15/19] scsi: mpt3sas: Add param to control target based error handle [RFC,v3,16/19] scsi: smartpqi: Add param to control LUN based error handle [RFC,v3,17/19] scsi: megaraid_sas: Add param to control target based error handle [RFC,v3,18/19] scsi: virtio_scsi: Add param to control LUN based error handle [RFC,v3,19/19] scsi: iscsi_tcp: Add param to control LUN based error handle

JiangJianJun March 14, 2025, 1:29 a.m. UTC

It's unbearable for systems with large scale scsi devices share HBAs to
block all devices' IOs when handle error commands, we need a new error
handle mechanism to address this issue.

I consulted about this issue a year ago, the discuss link can be found in
refenence. Hannes replied about why we have to block the SCSI host
then perform error recovery kindly. I think it's unnecessary to block
SCSI host for all drivers and can try a small level recovery(LUN based for
example) first to avoid block the SCSI host.

The new error handle mechanism introduced in this patchset has been
developed and tested with out self developed hardware since one year
ago, now we want this mechanism can be used by more drivers.

Drivers can decide if using the new error handle mechanism and how to
handle error commands when scsi_device are scanned,the new mechanism
makes SCSI error handle more flexible.

SCSI error recovery strategy after blocking host's IO is mainly
following steps:

- LUN reset
- Target reset
- Bus reset
- Host reset

Some drivers did not implement callbacks for host reset, it's unnecessary
to block host's IO for these drivers. For example, smartpqi only registered
device reset, if device reset failed, it's meaningless to fallback to target
reset, bus reset or host reset any more, because these steps would also
failed.

Here are some drivers we concerned:(there are too many kinds of drivers
to figure out, so here I just list some drivers I am familiar with)

+-------------+--------------+--------------+-----------+------------+
|  drivers    | device_reset | target_reset | bus_reset | host_reset |
+-------------+--------------+--------------+-----------+------------+
| mpt3sas     |     Y        |     Y        |    N      |    Y       |
+-------------+--------------+--------------+-----------+------------+
| smartpqi    |     Y        |     N        |    N      |    N       |
+-------------+--------------+--------------+-----------+------------+
| megaraidsas |     N        |     Y        |    N      |    Y       |
+-------------+--------------+--------------+-----------+------------+
| virtioscsi  |     Y        |     N        |    N      |    N       |
+-------------+--------------+--------------+-----------+------------+
| iscsi_tcp   |     Y        |     Y        |    N      |    N       |
+-------------+--------------+--------------+-----------+------------+
| hisisas     |     Y        |     Y        |    N      |    N       |
+-------------+--------------+--------------+-----------+------------+

For LUN based error handle, when scsi command is classified as error,
we would block the scsi device's IO and try to recover this scsi
device, if still can not recover all error commands, it might
fallback to target or host level recovery.

It's same for target based error handle, but target based error handle
would block the scsi target's IO then try to recover the error commands
of this target.

The first patch defines basic framework to support LUN/target based error
handle mechanism, three key operations are abstracted which are:
 - add error command
 - wake up error handle
 - block IOs when error command is added and recoverying.

Drivers can implement these three function callbacks and setup to SCSI
middle level; I also add a general LUN/target based error handle strategy
which can be called directly from drivers to implement LUN/tartget based
error handle.

The changes of SCSI middle level's error handle are tested with scsi_debug
which support single LUN error injection, the scsi_debug patches can be
found in reference, following scenarios are tested.

Scenario1: LUN based error handle is enabled:
+-----------+---------+-------------------------------------------------------+
| lun reset | TUR     | Desired result                                        |
+ --------- + ------- + ------------------------------------------------------+
| success   | success | retry or finish with  EIO(may offline disk)           |
+ --------- + ------- + ------------------------------------------------------+
| success   | fail    | fallback to host  recovery, retry or finish with      |
|           |         | EIO(may offline disk)                                 |
+ --------- + ------- + ------------------------------------------------------+
| fail      | NA      | fallback to host  recovery, retry or finish with      |
|           |         | EIO(may offline disk)                                 |
+ --------- + ------- + ------------------------------------------------------+

Scenario2: target based error handle is enabled:
+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR     | target reset | TUR     | Desired result               |
+-----------+---------+--------------+---------+------------------------------+
| success   | success | NA           | NA      | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| success   | fail    | success      | success | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | success | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | fail    | fallback to host recovery,   |
|           |         |              |         | retry or finish with EIO(may |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | fail         | NA      | fallback to host  recovery,  |
|           |         |              |         | retry or finish with EIO(may |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+

Scenario3: both LUN and target based error handle are enabled:
+-----------+---------+--------------+---------+------------------------------+
| lun reset | TUR     | target reset | TUR     | Desired result               |
+-----------+---------+--------------+---------+------------------------------+
| success   | success | NA           | NA      | retry or finish with         |
|           |         |              |         | EIO(may offline disk)        |
+-----------+---------+--------------+---------+------------------------------+
| success   | fail    | success      | success | lun recovery fallback to     |
|           |         |              |         | target recovery, retry or    |
|           |         |              |         | finish with EIO(may offline  |
|           |         |              |         | disk                         |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | success | lun recovery fallback to     |
|           |         |              |         | target recovery, retry or    |
|           |         |              |         | finish with EIO(may offline  |
|           |         |              |         | disk                         |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | success      | fail    | lun recovery fallback to     |
|           |         |              |         | target recovery, then fall   |
|           |         |              |         | back to host recovery, retry |
|           |         |              |         | or fhinsi with EIO(may       |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+
| fail      | NA      | fail         | NA      | lun recovery fallback to     |
|           |         |              |         | target recovery, then fall   |
|           |         |              |         | back to host recovery, retry |
|           |         |              |         | or fhinsi with EIO(may       |
|           |         |              |         | offline disk)                |
+-----------+---------+--------------+---------+------------------------------+

References: https://lore.kernel.org/linux-scsi/20230815122316.4129333-1-haowenchao2@huawei.com/
References: https://lore.kernel.org/linux-scsi/71e09bb4-ff0a-23fe-38b4-fe6425670efa@huawei.com/

Wenchao Hao (19):
  scsi: scsi_error: Define framework for LUN/target based error handle
  scsi: scsi_error: Move complete variable eh_action from shost to
    sdevice
  scsi: scsi_error: Check if to do reset in scsi_try_xxx_reset
  scsi: scsi_error: Add helper scsi_eh_sdev_stu to do START_UNIT
  scsi: scsi_error: Add helper scsi_eh_sdev_reset to do lun reset
  scsi: scsi_error: Add flags to mark error handle steps has done
  scsi: scsi_error: Add helper to handle scsi device's error command
    list
  scsi: scsi_error: Add a general LUN based error handler
  scsi: core: increase/decrease target_busy without check can_queue
  scsi: scsi_error: Add helper to handle scsi target's error command
    list
  scsi: scsi_error: Add a general target based error handler
  scsi: scsi_debug: Add param to control LUN bassed error handler
  scsi: scsi_debug: Add param to control target based error handle
  scsi: mpt3sas: Add param to control LUN based error handle
  scsi: mpt3sas: Add param to control target based error handle
  scsi: smartpqi: Add param to control LUN based error handle
  scsi: megaraid_sas: Add param to control target based error handle
  scsi: virtio_scsi: Add param to control LUN based error handle
  scsi: iscsi_tcp: Add param to control LUN based error handle

 drivers/scsi/iscsi_tcp.c                  |  20 +
 drivers/scsi/megaraid/megaraid_sas_base.c |  20 +
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      |  28 +
 drivers/scsi/scsi_debug.c                 |  24 +
 drivers/scsi/scsi_error.c                 | 756 ++++++++++++++++++++--
 drivers/scsi/scsi_lib.c                   |  23 +-
 drivers/scsi/scsi_priv.h                  |  18 +
 drivers/scsi/smartpqi/smartpqi_init.c     |  14 +
 drivers/scsi/virtio_scsi.c                |  16 +-
 include/scsi/scsi_device.h                |  97 +++
 include/scsi/scsi_eh.h                    |   8 +
 include/scsi/scsi_host.h                  |   2 -
 12 files changed, 963 insertions(+), 63 deletions(-)

Hannes Reinecke March 14, 2025, 9:01 a.m. UTC | #1

On 3/14/25 02:29, JiangJianJun wrote:
> It's unbearable for systems with large scale scsi devices share HBAs to
> block all devices' IOs when handle error commands, we need a new error
> handle mechanism to address this issue.
> 
> I consulted about this issue a year ago, the discuss link can be found in
> refenence. Hannes replied about why we have to block the SCSI host
> then perform error recovery kindly. I think it's unnecessary to block
> SCSI host for all drivers and can try a small level recovery(LUN based for
> example) first to avoid block the SCSI host.
> 
Technically, yes.
There are, however, some issues which would need to be addressed if 
someone would design a new error handler.

1. The 'LUN Reset' TMF (as it's currently being used) is badly scoped; 
it will reset the LUN itself, affecting all ports to that LUN.
So in a multipathed/multiported environment all initiators will be 
affected, even if they haven't experienced an error.
Is that what we want?
Shouldn't we rather use the 'Reset IT Nexus' TMF here?
And, of course, the 'Target Reset' TMF has been dropped from SAM,
so I really don't see the point in spending time here ...

2. Irrespective of the EH granularity, any error handing requires
that all activity on the level has to be stopped. If you need to
issue a LUN reset, you need to stop I/O for that LUN.

3. The current EH framework is designed around 'struct scsi_cmnd'.
Which means that the command _initiating_ the error handling can
only be returned once the _entire_ error handling (with all
escalations) is finished. And more often than not, the application
is waiting on that command to be completed before the next I/O
is sent. And that really limits the effectiveness of any improved
error handler; the application ultimatively has to wait for a
host reset before it can contine.

But anyway.
We already have a mechanism for asynchronous command aborts;
have you checked if you can adapt if for LUN reset, too?
That would be the easiest solution, I guess ...

Cheers,

Hannes

Bart Van Assche March 14, 2025, 3:55 p.m. UTC | #2

On 3/14/25 2:01 AM, Hannes Reinecke wrote:
> 3. The current EH framework is designed around 'struct scsi_cmnd'.
> Which means that the command _initiating_ the error handling can
> only be returned once the _entire_ error handling (with all
> escalations) is finished. And more often than not, the application
> is waiting on that command to be completed before the next I/O
> is sent. And that really limits the effectiveness of any improved
> error handler; the application ultimatively has to wait for a
> host reset before it can contine.
> 
> But anyway.
> We already have a mechanism for asynchronous command aborts;
> have you checked if you can adapt if for LUN reset, too?
> That would be the easiest solution, I guess ...

Hmm ... does this mean submitting a LUN reset while concurrently new
SCSI commands can be submitted from another thread? I don't think that's
safe.

Additionally, how could a LUN reset help if a SCSI abort doesn't help?
If a SCSI abort doesn't help, it probably means that the host controller
locked up, e.g. due to a firmware bug. How to recover from this without
resetting the host controller?

Thanks,

Bart.

[RFC,v3,00/19] scsi: scsi_error: Introduce new error handle mechanism

Message

Comments