diff mbox series

[2/5] scsi: Retry unaligned zoned writes

Message ID 20220614174943.611369-3-bvanassche@acm.org (mailing list archive)
State New, archived
Headers show
Series Improve zoned storage write performance | expand

Commit Message

Bart Van Assche June 14, 2022, 5:49 p.m. UTC
From ZBC-2: "The device server terminates with CHECK CONDITION status, with
the sense key set to ILLEGAL REQUEST, and the additional sense code set to
UNALIGNED WRITE COMMAND a write command, other than an entire medium write
same command, that specifies: a) the starting LBA in a sequential write
required zone set to a value that is not equal to the write pointer for that
sequential write required zone; or b) an ending LBA that is not equal to the
last logical block within a physical block (see SBC-5)."

I am not aware of any other conditions that may trigger the UNALIGNED
WRITE COMMAND response.

Retry unaligned writes in preparation of removing zone locking.

Increase the number of retries for write commands sent to a sequential
zone to the maximum number of outstanding commands.

Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 6 ++++++
 drivers/scsi/sd.c         | 2 ++
 2 files changed, 8 insertions(+)

Comments

Khazhy Kumykov June 14, 2022, 9:47 p.m. UTC | #1
On Tue, Jun 14, 2022 at 10:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> From ZBC-2: "The device server terminates with CHECK CONDITION status, with
> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
> same command, that specifies: a) the starting LBA in a sequential write
> required zone set to a value that is not equal to the write pointer for that
> sequential write required zone; or b) an ending LBA that is not equal to the
> last logical block within a physical block (see SBC-5)."
>
> I am not aware of any other conditions that may trigger the UNALIGNED
> WRITE COMMAND response.
>
> Retry unaligned writes in preparation of removing zone locking.
Is /just/ retrying effective here? A series of writes to the same zone
would all need to be sent in order - in the worst case (requests
somehow ordered in reverse order) this becomes quadratic as only 1
request "succeeds" out of the N outstanding requests, with the rest
all needing to retry. (Imagine a user writes an entire "zone" - which
could be split into hundreds of requests).

Block layer / schedulers are free to do this reordering, which I
understand does happen whenever we need to requeue - and would result
in a retry of all writes after the first re-ordered request. (side
note: fwiw "requests somehow in reverse order" can happen - bfq
inherited cfq's odd behavior of sometimes issuing sequential IO in
reverse order due to back_seek, e.g.)

>
> Increase the number of retries for write commands sent to a sequential
> zone to the maximum number of outstanding commands.
>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/scsi/scsi_error.c | 6 ++++++
>  drivers/scsi/sd.c         | 2 ++
>  2 files changed, 8 insertions(+)
>
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 49ef864df581..8e22d4ba22a3 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -674,6 +674,12 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>                 fallthrough;
>
>         case ILLEGAL_REQUEST:
> +               /*
> +                * Unaligned write command. Retry immediately to handle
> +                * out-of-order zoned writes.
> +                */
> +               if (sshdr.asc == 0x21 && sshdr.ascq == 0x04)
> +                       return NEEDS_RETRY;
>                 if (sshdr.asc == 0x20 || /* Invalid command operation code */
>                     sshdr.asc == 0x21 || /* Logical block address out of range */
>                     sshdr.asc == 0x22 || /* Invalid function */
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index a1a2ac09066f..8d68bd20723e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1202,6 +1202,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>         cmd->transfersize = sdp->sector_size;
>         cmd->underflow = nr_blocks << 9;
>         cmd->allowed = sdkp->max_retries;
> +       if (blk_rq_is_seq_write(rq))
> +               cmd->allowed += rq->q->nr_hw_queues * rq->q->nr_requests;
>         cmd->sdb.length = nr_blocks * sdp->sector_size;
>
>         SCSI_LOG_HLQUEUE(1,
Bart Van Assche June 14, 2022, 10:39 p.m. UTC | #2
On 6/14/22 14:47, Khazhy Kumykov wrote:
> On Tue, Jun 14, 2022 at 10:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>>  From ZBC-2: "The device server terminates with CHECK CONDITION status, with
>> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
>> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
>> same command, that specifies: a) the starting LBA in a sequential write
>> required zone set to a value that is not equal to the write pointer for that
>> sequential write required zone; or b) an ending LBA that is not equal to the
>> last logical block within a physical block (see SBC-5)."
>>
>> I am not aware of any other conditions that may trigger the UNALIGNED
>> WRITE COMMAND response.
>>
>> Retry unaligned writes in preparation of removing zone locking.
> Is /just/ retrying effective here? A series of writes to the same zone
> would all need to be sent in order - in the worst case (requests
> somehow ordered in reverse order) this becomes quadratic as only 1
> request "succeeds" out of the N outstanding requests, with the rest
> all needing to retry. (Imagine a user writes an entire "zone" - which
> could be split into hundreds of requests).
> 
> Block layer / schedulers are free to do this reordering, which I
> understand does happen whenever we need to requeue - and would result
> in a retry of all writes after the first re-ordered request. (side
> note: fwiw "requests somehow in reverse order" can happen - bfq
> inherited cfq's odd behavior of sometimes issuing sequential IO in
> reverse order due to back_seek, e.g.)

Hi Khazhy,

For zoned block devices I propose to only support those I/O schedulers 
that either preserve the LBA order or fix the LBA order if two or more 
out-of-order requests are received by the I/O scheduler.

I agree that in the worst case the number of retries is proportional to 
the square of the number of pending requests. However, for the use case 
that matters most to me, F2FS on top of a UFS device, we haven't seen 
any retries in our tests without I/O scheduler. This is probably because 
of how F2FS submits writes combined with the UFS controller only 
supporting a single hardware queue. I expect to see a small number of 
retries once UFS controllers become available that support multiple 
hardware queues.

Thanks,

Bart.
Damien Le Moal June 14, 2022, 11:29 p.m. UTC | #3
On 6/15/22 02:49, Bart Van Assche wrote:
> From ZBC-2: "The device server terminates with CHECK CONDITION status, with
> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
> same command, that specifies: a) the starting LBA in a sequential write
> required zone set to a value that is not equal to the write pointer for that
> sequential write required zone; or b) an ending LBA that is not equal to the
> last logical block within a physical block (see SBC-5)."
> 
> I am not aware of any other conditions that may trigger the UNALIGNED
> WRITE COMMAND response.
> 
> Retry unaligned writes in preparation of removing zone locking.

Arg. No. No way. AHCI will totally break with that because most AHCI
adapters do not send commands to the drive in the order they are delivered
to the LLD. In more details, the order in which tag bit in the AHCI ready
register are set does not determine the order of command delivery to the
disk. So if zone locking is removed, you constantly get unaligned write
errors.

> 
> Increase the number of retries for write commands sent to a sequential
> zone to the maximum number of outstanding commands.
> 
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  drivers/scsi/scsi_error.c | 6 ++++++
>  drivers/scsi/sd.c         | 2 ++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 49ef864df581..8e22d4ba22a3 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -674,6 +674,12 @@ enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
>  		fallthrough;
>  
>  	case ILLEGAL_REQUEST:
> +		/*
> +		 * Unaligned write command. Retry immediately to handle
> +		 * out-of-order zoned writes.
> +		 */
> +		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04)
> +			return NEEDS_RETRY;
>  		if (sshdr.asc == 0x20 || /* Invalid command operation code */
>  		    sshdr.asc == 0x21 || /* Logical block address out of range */
>  		    sshdr.asc == 0x22 || /* Invalid function */
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index a1a2ac09066f..8d68bd20723e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1202,6 +1202,8 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>  	cmd->transfersize = sdp->sector_size;
>  	cmd->underflow = nr_blocks << 9;
>  	cmd->allowed = sdkp->max_retries;
> +	if (blk_rq_is_seq_write(rq))
> +		cmd->allowed += rq->q->nr_hw_queues * rq->q->nr_requests;
>  	cmd->sdb.length = nr_blocks * sdp->sector_size;
>  
>  	SCSI_LOG_HLQUEUE(1,
Damien Le Moal June 14, 2022, 11:50 p.m. UTC | #4
On 6/15/22 07:39, Bart Van Assche wrote:
> On 6/14/22 14:47, Khazhy Kumykov wrote:
>> On Tue, Jun 14, 2022 at 10:49 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>>  From ZBC-2: "The device server terminates with CHECK CONDITION status, with
>>> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
>>> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
>>> same command, that specifies: a) the starting LBA in a sequential write
>>> required zone set to a value that is not equal to the write pointer for that
>>> sequential write required zone; or b) an ending LBA that is not equal to the
>>> last logical block within a physical block (see SBC-5)."
>>>
>>> I am not aware of any other conditions that may trigger the UNALIGNED
>>> WRITE COMMAND response.
>>>
>>> Retry unaligned writes in preparation of removing zone locking.
>> Is /just/ retrying effective here? A series of writes to the same zone
>> would all need to be sent in order - in the worst case (requests
>> somehow ordered in reverse order) this becomes quadratic as only 1
>> request "succeeds" out of the N outstanding requests, with the rest
>> all needing to retry. (Imagine a user writes an entire "zone" - which
>> could be split into hundreds of requests).
>>
>> Block layer / schedulers are free to do this reordering, which I
>> understand does happen whenever we need to requeue - and would result
>> in a retry of all writes after the first re-ordered request. (side
>> note: fwiw "requests somehow in reverse order" can happen - bfq
>> inherited cfq's odd behavior of sometimes issuing sequential IO in
>> reverse order due to back_seek, e.g.)
> 
> Hi Khazhy,
> 
> For zoned block devices I propose to only support those I/O schedulers 
> that either preserve the LBA order or fix the LBA order if two or more 
> out-of-order requests are received by the I/O scheduler.

We try that "fix" with the work for zoned btrfs. It does not work. Even
adding a delay to wait for out of order requests (if there is a hole in a
write sequence) does not reliably work as FSes may sometimes take 10s of
seconds to issue all write requests that can be all ordered into a nice
write stream. Even with that delay increased to minutes, we were still
seeing unaligned write errors.

> 
> I agree that in the worst case the number of retries is proportional to 
> the square of the number of pending requests. However, for the use case 
> that matters most to me, F2FS on top of a UFS device, we haven't seen 
> any retries in our tests without I/O scheduler. This is probably because 
> of how F2FS submits writes combined with the UFS controller only 
> supporting a single hardware queue. I expect to see a small number of 
> retries once UFS controllers become available that support multiple 
> hardware queues.
> 
> Thanks,
> 
> Bart.
Bart Van Assche June 14, 2022, 11:54 p.m. UTC | #5
On 6/14/22 16:50, Damien Le Moal wrote:
> We try that "fix" with the work for zoned btrfs. It does not work. Even
> adding a delay to wait for out of order requests (if there is a hole in a
> write sequence) does not reliably work as FSes may sometimes take 10s of
> seconds to issue all write requests that can be all ordered into a nice
> write stream. Even with that delay increased to minutes, we were still
> seeing unaligned write errors.

Please clarify. Doesn't BTRFS use REQ_OP_ZONE_APPEND for zoned storage? 
REQ_OP_ZONE_APPEND requests do not trigger write errors if reordered.

Additionally, our tests with F2FS on top of this patch series pass.

Thanks,

Bart.
Bart Van Assche June 14, 2022, 11:56 p.m. UTC | #6
On 6/14/22 16:29, Damien Le Moal wrote:
> On 6/15/22 02:49, Bart Van Assche wrote:
>>  From ZBC-2: "The device server terminates with CHECK CONDITION status, with
>> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
>> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
>> same command, that specifies: a) the starting LBA in a sequential write
>> required zone set to a value that is not equal to the write pointer for that
>> sequential write required zone; or b) an ending LBA that is not equal to the
>> last logical block within a physical block (see SBC-5)."
>>
>> I am not aware of any other conditions that may trigger the UNALIGNED
>> WRITE COMMAND response.
>>
>> Retry unaligned writes in preparation of removing zone locking.
> 
> Arg. No. No way. AHCI will totally break with that because most AHCI
> adapters do not send commands to the drive in the order they are delivered
> to the LLD. In more details, the order in which tag bit in the AHCI ready
> register are set does not determine the order of command delivery to the
> disk. So if zone locking is removed, you constantly get unaligned write
> errors.

The performance penalty of zone locking is not acceptable for our use 
case. Does this mean that zone locking needs to be preserved for AHCI 
but not for UFS?

Thanks,

Bart.
Damien Le Moal June 15, 2022, 12:55 a.m. UTC | #7
On 6/15/22 08:54, Bart Van Assche wrote:
> On 6/14/22 16:50, Damien Le Moal wrote:
>> We try that "fix" with the work for zoned btrfs. It does not work. Even
>> adding a delay to wait for out of order requests (if there is a hole in a
>> write sequence) does not reliably work as FSes may sometimes take 10s of
>> seconds to issue all write requests that can be all ordered into a nice
>> write stream. Even with that delay increased to minutes, we were still
>> seeing unaligned write errors.
> 
> Please clarify. Doesn't BTRFS use REQ_OP_ZONE_APPEND for zoned storage? 
> REQ_OP_ZONE_APPEND requests do not trigger write errors if reordered.

The trial I am mentioning above was before we moved to using zone append
and implementing scsi emulation for it.

Note that btrfs uses zone append write for data only. Metadata writes
still require regular write commands.

> 
> Additionally, our tests with F2FS on top of this patch series pass.

Try on an AHCI SMR drive and I am 100% sure that you will see unaligned
write errors.

> 
> Thanks,
> 
> Bart.
Damien Le Moal June 15, 2022, 1:09 a.m. UTC | #8
On 6/15/22 08:56, Bart Van Assche wrote:
> On 6/14/22 16:29, Damien Le Moal wrote:
>> On 6/15/22 02:49, Bart Van Assche wrote:
>>>  From ZBC-2: "The device server terminates with CHECK CONDITION status, with
>>> the sense key set to ILLEGAL REQUEST, and the additional sense code set to
>>> UNALIGNED WRITE COMMAND a write command, other than an entire medium write
>>> same command, that specifies: a) the starting LBA in a sequential write
>>> required zone set to a value that is not equal to the write pointer for that
>>> sequential write required zone; or b) an ending LBA that is not equal to the
>>> last logical block within a physical block (see SBC-5)."
>>>
>>> I am not aware of any other conditions that may trigger the UNALIGNED
>>> WRITE COMMAND response.
>>>
>>> Retry unaligned writes in preparation of removing zone locking.
>>
>> Arg. No. No way. AHCI will totally break with that because most AHCI
>> adapters do not send commands to the drive in the order they are delivered
>> to the LLD. In more details, the order in which tag bit in the AHCI ready
>> register are set does not determine the order of command delivery to the
>> disk. So if zone locking is removed, you constantly get unaligned write
>> errors.
> 
> The performance penalty of zone locking is not acceptable for our use 
> case. Does this mean that zone locking needs to be preserved for AHCI 
> but not for UFS?

I did mention that: if for a UFS device it is OK to not have zone write
locking, then sure, have mq-deadline not use it and eventually even do not
set ELEVATOR_F_ZBD_SEQ_WRITE for the device queue. But AHCI and SAS HBAs
definitely still need it. NVMe too since all it would take to see an
unaligned write is to have the writer context being rescheduled to a
different CPU or multiple contexts simultaneously writing. Also note that
the command requeue path uses a workqueue and that also results in
reordering, potentially with large delays. I seriously doubt that any
reasonable amount of retry will prevent unaligned write errors if there is
a requeue.

Another solution would be to try to hold the zone write lock for a shorter
interval. All we need is to guarantee in order delivery to the device. We
do not care about completion order. So theoretically, all we need, is to
have the LLD unlock the zone after it issues a write to a device. That is
very tricky to do though as that could be very racy. And that is not
always possible too. E.g., for AHCI, the "command delivered to the device"
essentially boils down to "command tag marked as ready in ready register".
But then you need to wait for that bit to be cleared before setting any
other bit for the next write command in sequence (the bit being cleared
means that the drive got the command). And with the current dispatch push
model, that is not easily possible. We would need to go back to legacy
command pull model.

Also note that for ATA & SAS, with recent drives, the performance penalty
of zone write locking is almost nill as long as the drive is running with
write-cache enabled. And even with write-cache disabled, recent drives are
almost as fast as the WCE case.

> 
> Thanks,
> 
> Bart.
>
Christoph Hellwig June 15, 2022, 5:49 a.m. UTC | #9
On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
> The performance penalty of zone locking is not acceptable for our use case. 
> Does this mean that zone locking needs to be preserved for AHCI but not for 
> UFS?

It means you use case needs to use zone append, and we need to make sure
it is added to SCSI assuming your are on SCSI based on your other comments.
Damien Le Moal June 15, 2022, 7:21 a.m. UTC | #10
On 6/15/22 14:49, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>> The performance penalty of zone locking is not acceptable for our use case. 
>> Does this mean that zone locking needs to be preserved for AHCI but not for 
>> UFS?
> 
> It means you use case needs to use zone append, and we need to make sure
> it is added to SCSI assuming your are on SCSI based on your other comments.

For scsi, we already have the zone append emulation in the sd driver which
issues regular writes together with taking the zone lock. So UFS has zone
append, not as a native command though. But if for UFS device there are no
issue of command reordering underneath sd, then the zone append emulation
can be done without taking the zone write lock, which would result in the
same performance as what a native zone append command would give.

At least for the short term, that could be a good solution until native
zone append is added to zbc specs. That last part may face a lot of
pushback because of the difficulty of having that same command on ATA side.
Bart Van Assche June 15, 2022, 7:38 p.m. UTC | #11
On 6/15/22 00:21, Damien Le Moal wrote:
> On 6/15/22 14:49, Christoph Hellwig wrote:
>> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>>> The performance penalty of zone locking is not acceptable for our use case.
>>> Does this mean that zone locking needs to be preserved for AHCI but not for
>>> UFS?
>>
>> It means you use case needs to use zone append, and we need to make sure
>> it is added to SCSI assuming your are on SCSI based on your other comments.
> 
> For scsi, we already have the zone append emulation in the sd driver which
> issues regular writes together with taking the zone lock. So UFS has zone
> append, not as a native command though. But if for UFS device there are no
> issue of command reordering underneath sd, then the zone append emulation
> can be done without taking the zone write lock, which would result in the
> same performance as what a native zone append command would give.
> 
> At least for the short term, that could be a good solution until native
> zone append is added to zbc specs. That last part may face a lot of
> pushback because of the difficulty of having that same command on ATA side.

Is there more information available about the difficulties of defining a 
ZONE APPEND command for ATA? It seems to me that there is enough space 
available in the ATA Status Return Descriptor to report the LBA back to 
the host?

Thanks,

Bart.
Bart Van Assche June 15, 2022, 7:42 p.m. UTC | #12
On 6/14/22 22:49, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>> The performance penalty of zone locking is not acceptable for our use case.
>> Does this mean that zone locking needs to be preserved for AHCI but not for
>> UFS?
> 
> It means you use case needs to use zone append, and we need to make sure
> it is added to SCSI assuming your are on SCSI based on your other comments.

Unfortunately I do not know enough about F2FS to comment on whether or 
not making it use REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE is feasible.

Bart.
Damien Le Moal June 16, 2022, 12:14 a.m. UTC | #13
On 2022/06/16 4:38, Bart Van Assche wrote:
> On 6/15/22 00:21, Damien Le Moal wrote:
>> On 6/15/22 14:49, Christoph Hellwig wrote:
>>> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>>>> The performance penalty of zone locking is not acceptable for our use case.
>>>> Does this mean that zone locking needs to be preserved for AHCI but not for
>>>> UFS?
>>>
>>> It means you use case needs to use zone append, and we need to make sure
>>> it is added to SCSI assuming your are on SCSI based on your other comments.
>>
>> For scsi, we already have the zone append emulation in the sd driver which
>> issues regular writes together with taking the zone lock. So UFS has zone
>> append, not as a native command though. But if for UFS device there are no
>> issue of command reordering underneath sd, then the zone append emulation
>> can be done without taking the zone write lock, which would result in the
>> same performance as what a native zone append command would give.
>>
>> At least for the short term, that could be a good solution until native
>> zone append is added to zbc specs. That last part may face a lot of
>> pushback because of the difficulty of having that same command on ATA side.
> 
> Is there more information available about the difficulties of defining a 
> ZONE APPEND command for ATA? It seems to me that there is enough space 
> available in the ATA Status Return Descriptor to report the LBA back to 
> the host?

Unfortunately no, there is not enough space. See table 354 of ACS-5
specifications for the normal output of NCQ commands: 1B error, 1B status and 4B
sactive field. That is all you get. No room for an LBA. And even worse: sata IO
does not define any FIS to do that (return an LBA). So adding zone append to ATA
without relying on an extra command to get the written LBA (similarly to how you
get sense data from ATA drives using an extra command) would need both ATA/ACS
changes *and* SATA-IO changes, which mean that all HBAs/AHCI/SATA adapters
existing today would not work.

But in the context of UFS only, it should be possible to add zone append to ZBC
without going into ZAC. Since we have zone append emulation already, that would
not break anything in Linux stack.

> 
> Thanks,
> 
> Bart.
Bart Van Assche June 16, 2022, 6:36 p.m. UTC | #14
On 6/14/22 22:49, Christoph Hellwig wrote:
> On Tue, Jun 14, 2022 at 04:56:52PM -0700, Bart Van Assche wrote:
>> The performance penalty of zone locking is not acceptable for our use case.
>> Does this mean that zone locking needs to be preserved for AHCI but not for
>> UFS?
> 
> It means you use case needs to use zone append, and we need to make sure
> it is added to SCSI assuming your are on SCSI based on your other comments.

Using ZONE APPEND instead of SCSI WRITE commands would have the 
following consequences:
* Data of a single file might be stored out-of-order on the storage medium.
* A translation table would have to be stored on the storage medium with 
one entry per ZONE APPEND command with the LBAs of where the ZONE APPEND 
commands wrote their data. I think that translation table can be 
considered as another L2P table. One of our motivations to switch to 
zoned storage is to get rid of nested L2P tables.

Both aspects above would have a negative impact on sequential read 
performance. Hence our preference to use WRITE commands for zoned 
storage instead of ZONE APPEND commands.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 49ef864df581..8e22d4ba22a3 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -674,6 +674,12 @@  enum scsi_disposition scsi_check_sense(struct scsi_cmnd *scmd)
 		fallthrough;
 
 	case ILLEGAL_REQUEST:
+		/*
+		 * Unaligned write command. Retry immediately to handle
+		 * out-of-order zoned writes.
+		 */
+		if (sshdr.asc == 0x21 && sshdr.ascq == 0x04)
+			return NEEDS_RETRY;
 		if (sshdr.asc == 0x20 || /* Invalid command operation code */
 		    sshdr.asc == 0x21 || /* Logical block address out of range */
 		    sshdr.asc == 0x22 || /* Invalid function */
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index a1a2ac09066f..8d68bd20723e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1202,6 +1202,8 @@  static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
 	cmd->transfersize = sdp->sector_size;
 	cmd->underflow = nr_blocks << 9;
 	cmd->allowed = sdkp->max_retries;
+	if (blk_rq_is_seq_write(rq))
+		cmd->allowed += rq->q->nr_hw_queues * rq->q->nr_requests;
 	cmd->sdb.length = nr_blocks * sdp->sector_size;
 
 	SCSI_LOG_HLQUEUE(1,