Message ID | 20230710180210.1582299-5-bvanassche@acm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Enable zoned write pipelining for UFS devices | expand |
On 7/11/23 03:01, Bart Van Assche wrote: > From ZBC-2: "The device server terminates with CHECK CONDITION status, with > the sense key set to ILLEGAL REQUEST, and the additional sense code set to > UNALIGNED WRITE COMMAND a write command, other than an entire medium write > same command, that specifies: a) the starting LBA in a sequential write > required zone set to a value that is not equal to the write pointer for that > sequential write required zone; or b) an ending LBA that is not equal to the > last logical block within a physical block (see SBC-5)." > > I am not aware of any other conditions that may trigger the UNALIGNED > WRITE COMMAND response. Trying to write less than 4KB on a 4Kn drive. But that should not ever happen for kernel issued commands. > > Send commands that failed with an unaligned write error to the SCSI error > handler. Let the SCSI error handler sort SCSI commands per LBA before > resubmitting these. > > Increase the number of retries for write commands sent to a sequential > zone to the maximum number of outstanding commands. I think I mentioned this before. When we started btrfs work, we did something similar (but at the IO scheduler level) to try to avoid adding a big lock in btrfs to serialize (and thus order) writes. What we discovered is that it was extremely easy to fall into a situation were the maximum number of possible outstanding request is already issued, but they all are behind a "hole" and indefinitely delayed because the missing request cannot be issued due to the max nr request limit being reached. No forward progress and deadlock. I do not see how your change addresses this problem. The same will happen with this and I do not have any suggestion how to solve this. For btrfs, we ended up using cone append emulation for scsi to avoid the big lock and avoid the FS from having to order writes. That solution guarantees forward progress. Delaying already issued writes that are not sequential has no such guarantees.
On 7/17/23 23:47, Damien Le Moal wrote: > On 7/11/23 03:01, Bart Van Assche wrote: >> Send commands that failed with an unaligned write error to the SCSI error >> handler. Let the SCSI error handler sort SCSI commands per LBA before >> resubmitting these. >> >> Increase the number of retries for write commands sent to a sequential >> zone to the maximum number of outstanding commands. > > I think I mentioned this before. When we started btrfs work, we did something > similar (but at the IO scheduler level) to try to avoid adding a big lock in > btrfs to serialize (and thus order) writes. What we discovered is that it was > extremely easy to fall into a situation were the maximum number of possible > outstanding request is already issued, but they all are behind a "hole" and > indefinitely delayed because the missing request cannot be issued due to the max > nr request limit being reached. No forward progress and deadlock. > > I do not see how your change addresses this problem. The same will happen with > this and I do not have any suggestion how to solve this. For btrfs, we ended up > using cone append emulation for scsi to avoid the big lock and avoid the FS from > having to order writes. That solution guarantees forward progress. Delaying > already issued writes that are not sequential has no such guarantees. Hi Damien, Thank you for having explained in detail the scenario that you ran into. I think what has been explained above is a scenario in which the filesystem allocates requests per zone in another order than the LBA order. How about requiring that the filesystem allocates and submits zoned writes in LBA order per zone? I think that this is how F2FS supports zoned storage. Thanks, Bart.
On 7/19/23 07:53, Bart Van Assche wrote: > On 7/17/23 23:47, Damien Le Moal wrote: >> On 7/11/23 03:01, Bart Van Assche wrote: >>> Send commands that failed with an unaligned write error to the SCSI >>> error >>> handler. Let the SCSI error handler sort SCSI commands per LBA before >>> resubmitting these. >>> >>> Increase the number of retries for write commands sent to a sequential >>> zone to the maximum number of outstanding commands. >> >> I think I mentioned this before. When we started btrfs work, we did >> something >> similar (but at the IO scheduler level) to try to avoid adding a big >> lock in >> btrfs to serialize (and thus order) writes. What we discovered is that >> it was >> extremely easy to fall into a situation were the maximum number of >> possible >> outstanding request is already issued, but they all are behind a >> "hole" and >> indefinitely delayed because the missing request cannot be issued due >> to the max >> nr request limit being reached. No forward progress and deadlock. >> >> I do not see how your change addresses this problem. The same will >> happen with >> this and I do not have any suggestion how to solve this. For btrfs, we >> ended up >> using cone append emulation for scsi to avoid the big lock and avoid >> the FS from >> having to order writes. That solution guarantees forward progress. >> Delaying >> already issued writes that are not sequential has no such guarantees. > > Hi Damien, > > Thank you for having explained in detail the scenario that you ran into. > > I think what has been explained above is a scenario in which the filesystem > allocates requests per zone in another order than the LBA order. How about > requiring that the filesystem allocates and submits zoned writes in LBA > order > per zone? I think that this is how F2FS supports zoned storage. Sure. But what if an application uses the drive directly ? You loose guarantees of forward progress then. Given that an application has to use direct IO for writes to sequential zones, this is unlikely to happen in a "good" scenario, but it also would not be hard to write an application that can deadlock the drive forever by simply missing one write in a sequence of writes for a zone... That is my concern. While f2fs would likely be OK, the delay approach is not solid enough for all cases.
On 7/19/23 02:59, Damien Le Moal wrote: > On 7/19/23 07:53, Bart Van Assche wrote: >> I think what has been explained above is a scenario in which the filesystem >> allocates requests per zone in another order than the LBA order. How about >> requiring that the filesystem allocates and submits zoned writes in LBA >> order >> per zone? I think that this is how F2FS supports zoned storage. > > Sure. But what if an application uses the drive directly ? You loose > guarantees of forward progress then. Given that an application has to > use direct IO for writes to sequential zones, this is unlikely to happen > in a "good" scenario, but it also would not be hard to write an > application that can deadlock the drive forever by simply missing one > write in a sequence of writes for a zone... That is my concern. While > f2fs would likely be OK, the delay approach is not solid enough for all > cases. Hi Damien, This patch series increases the number of retries for zoned writes submitted by a filesystem. Direct I/O from user space is not affected since the code that increases the number of retries occurs in sd_setup_read_write_cmnd(). As you know scsi_prepare_cmd() only calls sd_setup_read_write_cmnd() for requests that come from a filesystem and not for pass-through requests. Does this address your concern? Thanks, Bart.
On 7/20/23 01:31, Bart Van Assche wrote: > On 7/19/23 02:59, Damien Le Moal wrote: >> On 7/19/23 07:53, Bart Van Assche wrote: >>> I think what has been explained above is a scenario in which the filesystem >>> allocates requests per zone in another order than the LBA order. How about >>> requiring that the filesystem allocates and submits zoned writes in LBA >>> order >>> per zone? I think that this is how F2FS supports zoned storage. >> >> Sure. But what if an application uses the drive directly ? You loose >> guarantees of forward progress then. Given that an application has to >> use direct IO for writes to sequential zones, this is unlikely to happen >> in a "good" scenario, but it also would not be hard to write an >> application that can deadlock the drive forever by simply missing one >> write in a sequence of writes for a zone... That is my concern. While >> f2fs would likely be OK, the delay approach is not solid enough for all >> cases. > > Hi Damien, > > This patch series increases the number of retries for zoned writes > submitted by a filesystem. Direct I/O from user space is not affected > since the code that increases the number of retries occurs in > sd_setup_read_write_cmnd(). As you know scsi_prepare_cmd() only calls > sd_setup_read_write_cmnd() for requests that come from a filesystem > and not for pass-through requests. > > Does this address your concern? Not really. I was not thinking about passthrough requests but rather write syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible. > > Thanks, > > Bart. >
On 7/19/23 16:07, Damien Le Moal wrote: > Not really. I was not thinking about passthrough requests but rather write > syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for > these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible. How about introducing a new request flag (REQ_*) such that f2fs can disable zone locking while it remains enabled for all other users (BTRFS and direct I/O)? Thanks, Bart.
On 7/21/23 03:18, Bart Van Assche wrote: > On 7/19/23 16:07, Damien Le Moal wrote: >> Not really. I was not thinking about passthrough requests but rather write >> syscalls to /dev/sdX by applications. sd_setup_read_write_cmnd() is used for >> these too. After all, /dev/sdX *is* a file system too, albeit the simplest possible. > > How about introducing a new request flag (REQ_*) such that f2fs can disable > zone locking while it remains enabled for all other users (BTRFS and direct > I/O)? That could be a safer approach. Still a little worried about any DM used between the FS and the device though. The DMs that support zoned devices should be OK but only under the assumption that the bio issuer (f2fs) guarantees that writes to the same zone are not issued in parallel when that REQ_NO_ZONE_WRITE (or whatever the name) is used. This flag definition will need to have a big fat warning mentioning this constraint, that is, make it clear that it is not a replacement for zone append (in which case we do not write lock zones and do not care about ordering).
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index 3ec8bfd4090f..a6d562f3a085 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -27,6 +27,7 @@ #include <linux/blkdev.h> #include <linux/delay.h> #include <linux/jiffies.h> +#include <linux/list_sort.h> #include <scsi/scsi.h> #include <scsi/scsi_cmnd.h> @@ -681,6 +682,17 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd) fallthrough; case ILLEGAL_REQUEST: + /* + * Unaligned write command. This indicates that zoned writes + * have been received by the device in the wrong order. If zoned + * write pipelining is enabled, retry after all pending commands + * have completed. + */ + if (sshdr.asc == 0x21 && sshdr.ascq == 0x04 && + blk_queue_pipeline_zoned_writes(sdev->request_queue) && + !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd)) + return NEEDS_DELAYED_RETRY; + if (sshdr.asc == 0x20 || /* Invalid command operation code */ sshdr.asc == 0x21 || /* Logical block address out of range */ sshdr.asc == 0x22 || /* Invalid function */ @@ -2177,6 +2189,25 @@ void scsi_eh_flush_done_q(struct list_head *done_q) } EXPORT_SYMBOL(scsi_eh_flush_done_q); +/* + * Returns a negative value if @_a has a lower LBA than @_b, zero if + * both have the same LBA and a positive value otherwise. + */ +static int scsi_cmp_lba(void *priv, const struct list_head *_a, + const struct list_head *_b) +{ + struct scsi_cmnd *a = list_entry(_a, typeof(*a), eh_entry); + struct scsi_cmnd *b = list_entry(_b, typeof(*b), eh_entry); + const sector_t pos_a = blk_rq_pos(scsi_cmd_to_rq(a)); + const sector_t pos_b = blk_rq_pos(scsi_cmd_to_rq(b)); + + if (pos_a < pos_b) + return -1; + if (pos_a > pos_b) + return 1; + return 0; +} + /** * scsi_unjam_host - Attempt to fix a host which has a cmd that failed. * @shost: Host to unjam. @@ -2212,6 +2243,12 @@ static void scsi_unjam_host(struct Scsi_Host *shost) SCSI_LOG_ERROR_RECOVERY(1, scsi_eh_prt_fail_stats(shost, &eh_work_q)); + /* + * Sort pending SCSI commands in LBA order. This is important if write + * pipelining is enabled for a zoned SCSI device. + */ + list_sort(NULL, &eh_work_q, scsi_cmp_lba); + if (!scsi_eh_get_sense(&eh_work_q, &eh_done_q)) scsi_eh_ready_devs(shost, &eh_work_q, &eh_done_q); diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 0226c9279cef..a6cfdc4bfbf1 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1445,6 +1445,7 @@ static void scsi_complete(struct request *rq) case ADD_TO_MLQUEUE: scsi_queue_insert(cmd, SCSI_MLQUEUE_DEVICE_BUSY); break; + case NEEDS_DELAYED_RETRY: default: scsi_eh_scmd_add(cmd); break; diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c index ab216976dbdc..6cf7495c6b56 100644 --- a/drivers/scsi/sd.c +++ b/drivers/scsi/sd.c @@ -1206,6 +1206,9 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd) cmd->transfersize = sdp->sector_size; cmd->underflow = nr_blocks << 9; cmd->allowed = sdkp->max_retries; + if (blk_queue_pipeline_zoned_writes(rq->q) && + blk_rq_is_seq_zoned_write(rq)) + cmd->allowed += rq->q->nr_requests; cmd->sdb.length = nr_blocks * sdp->sector_size; SCSI_LOG_HLQUEUE(1, diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h index ec093594ba53..6600db046227 100644 --- a/include/scsi/scsi.h +++ b/include/scsi/scsi.h @@ -93,6 +93,7 @@ static inline int scsi_status_is_check_condition(int status) * Internal return values. */ enum scsi_disposition { + NEEDS_DELAYED_RETRY = 0x2000, NEEDS_RETRY = 0x2001, SUCCESS = 0x2002, FAILED = 0x2003,
From ZBC-2: "The device server terminates with CHECK CONDITION status, with the sense key set to ILLEGAL REQUEST, and the additional sense code set to UNALIGNED WRITE COMMAND a write command, other than an entire medium write same command, that specifies: a) the starting LBA in a sequential write required zone set to a value that is not equal to the write pointer for that sequential write required zone; or b) an ending LBA that is not equal to the last logical block within a physical block (see SBC-5)." I am not aware of any other conditions that may trigger the UNALIGNED WRITE COMMAND response. Send commands that failed with an unaligned write error to the SCSI error handler. Let the SCSI error handler sort SCSI commands per LBA before resubmitting these. Increase the number of retries for write commands sent to a sequential zone to the maximum number of outstanding commands. Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- drivers/scsi/scsi_error.c | 37 +++++++++++++++++++++++++++++++++++++ drivers/scsi/scsi_lib.c | 1 + drivers/scsi/sd.c | 3 +++ include/scsi/scsi.h | 1 + 4 files changed, 42 insertions(+)