diff mbox series

[3/3] scsi: handle zone resources errors

Message ID 20200910073952.212130-4-damien.lemoal@wdc.com (mailing list archive)
State Superseded
Headers show
Series Improve error handling | expand

Commit Message

Damien Le Moal Sept. 10, 2020, 7:39 a.m. UTC
ZBC or ZAC disks that have a limit on the number of open zones may fail
a zone open command or a write to a zone that is not already implicitly
or explicitly open if the total number of open zones is already at the
maximum allowed.

For these operations, instead of returning the generic BLK_STS_IOERR,
return BLK_STS_DEV_RESOURCE which is returned as -EBUSY to the I/O
issuer, allowing the device user to act appropriately on these
relatively benign zone resource errors.

With this change the NVMe (ZNS) and sd drivers both return the same
error code for zone resource errors, facilitating the implementation of
IO error handling by the user with a common code base for both device
types.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
---
 drivers/scsi/scsi_lib.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

Damien Le Moal Sept. 10, 2020, 7:45 a.m. UTC | #1
On 2020/09/10 16:40, Damien Le Moal wrote:
> ZBC or ZAC disks that have a limit on the number of open zones may fail
> a zone open command or a write to a zone that is not already implicitly
> or explicitly open if the total number of open zones is already at the
> maximum allowed.
> 
> For these operations, instead of returning the generic BLK_STS_IOERR,
> return BLK_STS_DEV_RESOURCE which is returned as -EBUSY to the I/O
> issuer, allowing the device user to act appropriately on these
> relatively benign zone resource errors.
> 
> With this change the NVMe (ZNS) and sd drivers both return the same
> error code for zone resource errors, facilitating the implementation of
> IO error handling by the user with a common code base for both device
> types.
> 
> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
> ---
>  drivers/scsi/scsi_lib.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 7c6dd6f75190..7eb4a80c3bbb 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -758,6 +758,18 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
>  			/* See SSC3rXX or current. */
>  			action = ACTION_FAIL;
>  			break;
> +		case DATA_PROTECT:
> +			sdev_printk(KERN_INFO, cmd->device,
> +				    "asc/ascq = 0x%02x 0x%02x\n",
> +				    sshdr.asc, sshdr.ascq);

Oops... Forgot to remove my debug message. Re-sending without it.

> +			action = ACTION_FAIL;
> +			if ((sshdr.asc == 0x0C && sshdr.ascq == 0x12) ||
> +			    (sshdr.asc == 0x55 &&
> +			     (sshdr.ascq == 0x0E || sshdr.ascq == 0x0F))) {
> +				/* Insufficient zone resources */
> +				blk_stat = BLK_STS_DEV_RESOURCE;
> +			}
> +			break;
>  		default:
>  			action = ACTION_FAIL;
>  			break;
>
Christoph Hellwig Sept. 10, 2020, 5:53 p.m. UTC | #2
On Thu, Sep 10, 2020 at 04:39:52PM +0900, Damien Le Moal wrote:
> +		case DATA_PROTECT:
> +			sdev_printk(KERN_INFO, cmd->device,
> +				    "asc/ascq = 0x%02x 0x%02x\n",
> +				    sshdr.asc, sshdr.ascq);
> +			action = ACTION_FAIL;
> +			if ((sshdr.asc == 0x0C && sshdr.ascq == 0x12) ||
> +			    (sshdr.asc == 0x55 &&
> +			     (sshdr.ascq == 0x0E || sshdr.ascq == 0x0F))) {
> +				/* Insufficient zone resources */
> +				blk_stat = BLK_STS_DEV_RESOURCE;

BLK_STS_DEV_RESOURCE is a magic error code leading to a retry on the
particular request_queue once it isn't busy any more.  Please don't
abuse it for random other conditions.
Damien Le Moal Sept. 10, 2020, 10:16 p.m. UTC | #3
On 2020/09/11 2:54, Christoph Hellwig wrote:
> On Thu, Sep 10, 2020 at 04:39:52PM +0900, Damien Le Moal wrote:
>> +		case DATA_PROTECT:
>> +			sdev_printk(KERN_INFO, cmd->device,
>> +				    "asc/ascq = 0x%02x 0x%02x\n",
>> +				    sshdr.asc, sshdr.ascq);
>> +			action = ACTION_FAIL;
>> +			if ((sshdr.asc == 0x0C && sshdr.ascq == 0x12) ||
>> +			    (sshdr.asc == 0x55 &&
>> +			     (sshdr.ascq == 0x0E || sshdr.ascq == 0x0F))) {
>> +				/* Insufficient zone resources */
>> +				blk_stat = BLK_STS_DEV_RESOURCE;
> 
> BLK_STS_DEV_RESOURCE is a magic error code leading to a retry on the
> particular request_queue once it isn't busy any more.  Please don't
> abuse it for random other conditions.

Yes, but that is for the submission path, isn't it ? This change is in the
completion path and action is set to ACTION_FAIL, so the request is terminated
right away without any retry (tested). More importantly, this leads to the block
layer returning -EBUSY which allows the user to differentiate this
temporary/trivial error from the potentially more serious -EIO.

Keith sent a patch for NVMe ZNS doing something similar, which will result in
the block layer returning -EBUSY for zone resource errors. I would like to unify
scsi and nvme behavior for these recoverable zone resource errors.

So should we define a new BLK_STS_BUSYERR status to differentiate from the
default BLK_STS_IOERR and not overload BLK_STS_DEV_RESOURCE (or
BLK_STS_ZONE_RESOURCE) ?
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 7c6dd6f75190..7eb4a80c3bbb 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -758,6 +758,18 @@  static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
 			/* See SSC3rXX or current. */
 			action = ACTION_FAIL;
 			break;
+		case DATA_PROTECT:
+			sdev_printk(KERN_INFO, cmd->device,
+				    "asc/ascq = 0x%02x 0x%02x\n",
+				    sshdr.asc, sshdr.ascq);
+			action = ACTION_FAIL;
+			if ((sshdr.asc == 0x0C && sshdr.ascq == 0x12) ||
+			    (sshdr.asc == 0x55 &&
+			     (sshdr.ascq == 0x0E || sshdr.ascq == 0x0F))) {
+				/* Insufficient zone resources */
+				blk_stat = BLK_STS_DEV_RESOURCE;
+			}
+			break;
 		default:
 			action = ACTION_FAIL;
 			break;