diff mbox series

[v2,4/5] scsi: Retry unaligned zoned writes

Message ID 20230710180210.1582299-5-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series Enable zoned write pipelining for UFS devices | expand

Commit Message

Bart Van Assche July 10, 2023, 6:01 p.m. UTC
From ZBC-2: "The device server terminates with CHECK CONDITION status, with
the sense key set to ILLEGAL REQUEST, and the additional sense code set to
UNALIGNED WRITE COMMAND a write command, other than an entire medium write
same command, that specifies: a) the starting LBA in a sequential write
required zone set to a value that is not equal to the write pointer for that
sequential write required zone; or b) an ending LBA that is not equal to the
last logical block within a physical block (see SBC-5)."

I am not aware of any other conditions that may trigger the UNALIGNED
WRITE COMMAND response.

Send commands that failed with an unaligned write error to the SCSI error
handler. Let the SCSI error handler sort SCSI commands per LBA before
resubmitting these.

Increase the number of retries for write commands sent to a sequential
zone to the maximum number of outstanding commands.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 37 +++++++++++++++++++++++++++++++++++++
 drivers/scsi/scsi_lib.c   |  1 +
 drivers/scsi/sd.c         |  3 +++
 include/scsi/scsi.h       |  1 +
 4 files changed, 42 insertions(+)

Comments

Damien Le Moal July 18, 2023, 6:47 a.m. UTC | #1
On 7/11/23 03:01, Bart Van Assche wrote:
> From ZBC-2: "The device server terminates with CHECK CONDITION status, with
> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
> same command, that specifies: a) the starting LBA in a sequential write
> required zone set to a value that is not equal to the write pointer for that
> sequential write required zone; or b) an ending LBA that is not equal to the
> last logical block within a physical block (see SBC-5)."
> 
> I am not aware of any other conditions that may trigger the UNALIGNED
> WRITE COMMAND response.

Trying to write less than 4KB on a 4Kn drive. But that should not ever happen
for kernel issued commands.

> 
> Send commands that failed with an unaligned write error to the SCSI error
> handler. Let the SCSI error handler sort SCSI commands per LBA before
> resubmitting these.
> 
> Increase the number of retries for write commands sent to a sequential
> zone to the maximum number of outstanding commands.

I think I mentioned this before. When we started btrfs work, we did something
similar (but at the IO scheduler level) to try to avoid adding a big lock in
btrfs to serialize (and thus order) writes. What we discovered is that it was
extremely easy to fall into a situation were the maximum number of possible
outstanding request is already issued, but they all are behind a "hole" and
indefinitely delayed because the missing request cannot be issued due to the max
nr request limit being reached. No forward progress and deadlock.

I do not see how your change addresses this problem. The same will happen with
this and I do not have any suggestion how to solve this. For btrfs, we ended up
using cone append emulation for scsi to avoid the big lock and avoid the FS from
having to order writes. That solution guarantees forward progress. Delaying
already issued writes that are not sequential has no such guarantees.
Bart Van Assche July 18, 2023, 10:53 p.m. UTC | #2
On 7/17/23 23:47, Damien Le Moal wrote:
> On 7/11/23 03:01, Bart Van Assche wrote:
>> Send commands that failed with an unaligned write error to the SCSI error
>> handler. Let the SCSI error handler sort SCSI commands per LBA before
>> resubmitting these.
>>
>> Increase the number of retries for write commands sent to a sequential
>> zone to the maximum number of outstanding commands.
> 
> I think I mentioned this before. When we started btrfs work, we did something
> similar (but at the IO scheduler level) to try to avoid adding a big lock in
> btrfs to serialize (and thus order) writes. What we discovered is that it was
> extremely easy to fall into a situation were the maximum number of possible
> outstanding request is already issued, but they all are behind a "hole" and
> indefinitely delayed because the missing request cannot be issued due to the max
> nr request limit being reached. No forward progress and deadlock.
> 
> I do not see how your change addresses this problem. The same will happen with
> this and I do not have any suggestion how to solve this. For btrfs, we ended up
> using cone append emulation for scsi to avoid the big lock and avoid the FS from
> having to order writes. That solution guarantees forward progress. Delaying
> already issued writes that are not sequential has no such guarantees.

Hi Damien,

Thank you for having explained in detail the scenario that you ran into.

I think what has been explained above is a scenario in which the filesystem
allocates requests per zone in another order than the LBA order. How about
requiring that the filesystem allocates and submits zoned writes in LBA order
per zone? I think that this is how F2FS supports zoned storage.

Thanks,

Bart.
Damien Le Moal July 19, 2023, 9:59 a.m. UTC | #3
On 7/19/23 07:53, Bart Van Assche wrote:
> On 7/17/23 23:47, Damien Le Moal wrote:
>> On 7/11/23 03:01, Bart Van Assche wrote:
>>> Send commands that failed with an unaligned write error to the SCSI
>>> error
>>> handler. Let the SCSI error handler sort SCSI commands per LBA before
>>> resubmitting these.
>>>
>>> Increase the number of retries for write commands sent to a sequential
>>> zone to the maximum number of outstanding commands.
>>
>> I think I mentioned this before. When we started btrfs work, we did
>> something
>> similar (but at the IO scheduler level) to try to avoid adding a big
>> lock in
>> btrfs to serialize (and thus order) writes. What we discovered is that
>> it was
>> extremely easy to fall into a situation were the maximum number of
>> possible
>> outstanding request is already issued, but they all are behind a
>> "hole" and
>> indefinitely delayed because the missing request cannot be issued due
>> to the max
>> nr request limit being reached. No forward progress and deadlock.
>>
>> I do not see how your change addresses this problem. The same will
>> happen with
>> this and I do not have any suggestion how to solve this. For btrfs, we
>> ended up
>> using cone append emulation for scsi to avoid the big lock and avoid
>> the FS from
>> having to order writes. That solution guarantees forward progress.
>> Delaying
>> already issued writes that are not sequential has no such guarantees.
> 
> Hi Damien,
> 
> Thank you for having explained in detail the scenario that you ran into.
> 
> I think what has been explained above is a scenario in which the filesystem
> allocates requests per zone in another order than the LBA order. How about
> requiring that the filesystem allocates and submits zoned writes in LBA
> order
> per zone? I think that this is how F2FS supports zoned storage.

Sure. But what if an application uses the drive directly ? You loose
guarantees of forward progress then. Given that an application has to
use direct IO for writes to sequential zones, this is unlikely to happen
in a "good" scenario, but it also would not be hard to write an
application that can deadlock the drive forever by simply missing one
write in a sequence of writes for a zone... That is my concern. While
f2fs would likely be OK, the delay approach is not solid enough for all
cases.
Bart Van Assche July 19, 2023, 4:31 p.m. UTC | #4
On 7/19/23 02:59, Damien Le Moal wrote:
> On 7/19/23 07:53, Bart Van Assche wrote:
>> I think what has been explained above is a scenario in which the filesystem
>> allocates requests per zone in another order than the LBA order. How about
>> requiring that the filesystem allocates and submits zoned writes in LBA
>> order
>> per zone? I think that this is how F2FS supports zoned storage.
> 
> Sure. But what if an application uses the drive directly ? You loose
> guarantees of forward progress then. Given that an application has to
> use direct IO for writes to sequential zones, this is unlikely to happen
> in a "good" scenario, but it also would not be hard to write an
> application that can deadlock the drive forever by simply missing one
> write in a sequence of writes for a zone... That is my concern. While
> f2fs would likely be OK, the delay approach is not solid enough for all
> cases.

Hi Damien,

This patch series increases the number of retries for zoned writes
submitted by a filesystem. Direct I/O from user space is not affected
since the code that increases the number of retries occurs in
sd_setup_read_write_cmnd(). As you know scsi_prepare_cmd() only calls
sd_setup_read_write_cmnd() for requests that come from a filesystem
and not for pass-through requests.

Does this address your concern?

Thanks,

Bart.
Damien Le Moal July 19, 2023, 11:07 p.m. UTC | #5
On 7/20/23 01:31, Bart Van Assche wrote:
> On 7/19/23 02:59, Damien Le Moal wrote:
>> On 7/19/23 07:53, Bart Van Assche wrote:
>>> I think what has been explained above is a scenario in which the filesystem
>>> allocates requests per zone in another order than the LBA order. How about
>>> requiring that the filesystem allocates and submits zoned writes in LBA
>>> order
>>> per zone? I think that this is how F2FS supports zoned storage.
>>
>> Sure. But what if an application uses the drive directly ? You loose
>> guarantees of forward progress then. Given that an application has to
>> use direct IO for writes to sequential zones, this is unlikely to happen
>> in a "good" scenario, but it also would not be hard to write an
>> application that can deadlock the drive forever by simply missing one
>> write in a sequence of writes for a zone... That is my concern. While
>> f2fs would likely be OK, the delay approach is not solid enough for all
>> cases.
> 
> Hi Damien,
> 
> This patch series increases the number of retries for zoned writes
> submitted by a filesystem. Direct I/O from user space is not affected
> since the code that increases the number of retries occurs in
> sd_setup_read_write_cmnd(). As you know scsi_prepare_cmd() only calls
> sd_setup_read_write_cmnd() for requests that come from a filesystem
> and not for pass-through requests.
> 
> Does this address your concern?

Not really. I was not thinking about passthrough requests but rather write
syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for
these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible.

> 
> Thanks,
> 
> Bart.
>
Bart Van Assche July 20, 2023, 6:18 p.m. UTC | #6
On 7/19/23 16:07, Damien Le Moal wrote:
> Not really. I was not thinking about passthrough requests but rather write
> syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for
> these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible.

How about introducing a new request flag (REQ_*) such that f2fs can disable
zone locking while it remains enabled for all other users (BTRFS and direct
I/O)?

Thanks,

Bart.
Damien Le Moal July 21, 2023, 12:20 a.m. UTC | #7
On 7/21/23 03:18, Bart Van Assche wrote:
> On 7/19/23 16:07, Damien Le Moal wrote:
>> Not really. I was not thinking about passthrough requests but rather write
>> syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for
>> these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible.
> 
> How about introducing a new request flag (REQ_*) such that f2fs can disable
> zone locking while it remains enabled for all other users (BTRFS and direct
> I/O)?

That could be a safer approach. Still a little worried about any DM used between
the FS and the device though. The DMs that support zoned devices should be OK
but only under the assumption that the bio issuer (f2fs) guarantees that writes
to the same zone are not issued in parallel when that REQ_NO_ZONE_WRITE (or
whatever the name) is used. This flag definition will need to have a big fat
warning mentioning this constraint, that is, make it clear that it is not a
replacement for zone append (in which case we do not write lock zones and do not
care about ordering).
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 3ec8bfd4090f..a6d562f3a085 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -27,6 +27,7 @@ 
 #include <linux/blkdev.h>
 #include <linux/delay.h>
 #include <linux/jiffies.h>
+#include <linux/list_sort.h>
 
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
@@ -681,6 +682,17 @@  enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. This indicates that zoned writes
+		 * have been received by the device in the wrong order. If zoned
+		 * write pipelining is enabled, retry after all pending commands
+		 * have completed.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 &&
+		    blk_queue_pipeline_zoned_writes(sdev->request_queue) &&
+		    !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd))
+			return NEEDS_DELAYED_RETRY;
+
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */
@@ -2177,6 +2189,25 @@  void scsi_eh_flush_done_q(struct list_head *done_q)
 }
 EXPORT_SYMBOL(scsi_eh_flush_done_q);
 
+/*
+ * Returns a negative value if @_a has a lower LBA than @_b, zero if
+ * both have the same LBA and a positive value otherwise.
+ */
+static int scsi_cmp_lba(void *priv, const struct list_head *_a,
+			const struct list_head *_b)
+{
+	struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry);
+	struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry);
+	const sector_t pos_a = blk_rq_pos(scsi_cmd_to_rq(a));
+	const sector_t pos_b = blk_rq_pos(scsi_cmd_to_rq(b));
+
+	if (pos_a < pos_b)
+		return -1;
+	if (pos_a > pos_b)
+		return 1;
+	return 0;
+}
+
 /**
  * scsi_unjam_host - Attempt to fix a host which has a cmd that failed.
  * @shost:	Host to unjam.
@@ -2212,6 +2243,12 @@  static void scsi_unjam_host(struct Scsi_Host *shost)
 
 	SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q));
 
+	/*
+	 * Sort pending SCSI commands in LBA order. This is important if write
+	 * pipelining is enabled for a zoned SCSI device.
+	 */
+	list_sort(NULL, &eh_work_q, scsi_cmp_lba);
+
 	if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q))
 		scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q);
 
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 0226c9279cef..a6cfdc4bfbf1 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1445,6 +1445,7 @@  static void scsi_complete(struct request *rq)
 	case ADD_TO_MLQUEUE:
 		scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY);
 		break;
+	case NEEDS_DELAYED_RETRY:
 	default:
 		scsi_eh_scmd_add(cmd);
 		break;
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index ab216976dbdc..6cf7495c6b56 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1206,6 +1206,9 @@  static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	if (blk_queue_pipeline_zoned_writes(rq->q) &&
+	    blk_rq_is_seq_zoned_write(rq))
+		cmd->allowed += rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index ec093594ba53..6600db046227 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -93,6 +93,7 @@  static inline int scsi_status_is_check_condition(int status)
  * Internal return values.
  */
 enum scsi_disposition {
+	NEEDS_DELAYED_RETRY	= 0x2000,
 	NEEDS_RETRY		= 0x2001,
 	SUCCESS			= 0x2002,
 	FAILED			= 0x2003,