mbox series

[0/2] Submit split bios in LBA order

Message ID 20230317195938.1745318-1-bvanassche@acm.org (mailing list archive)
Headers show
Series Submit split bios in LBA order | expand

Message

Bart Van Assche March 17, 2023, 7:59 p.m. UTC
Hi Jens,

For zoned storage it is essential that split bios are submitted in LBA order.
This patch series realizes this by modifying __bio_split_to_limits() such that
it submits the first bio fragment and returns the remainder instead of
submitting the remainder and returning the first bio fragment. Please consider
this patch series for the next merge window.

Thanks,

Bart.

Bart Van Assche (2):
  block: Split blk_recalc_rq_segments()
  block: Split and submit bios in LBA order

 block/blk-merge.c      | 33 +++++++++++++++++++--------------
 block/blk-mq.c         |  7 +++++--
 block/blk.h            |  1 +
 include/linux/blk-mq.h |  6 ++++++
 4 files changed, 31 insertions(+), 16 deletions(-)

Comments

Christoph Hellwig March 18, 2023, 6:29 a.m. UTC | #1
On Fri, Mar 17, 2023 at 12:59:36PM -0700, Bart Van Assche wrote:
> Hi Jens,
> 
> For zoned storage it is essential that split bios are submitted in LBA order.
> This patch series realizes this by modifying __bio_split_to_limits() such that
> it submits the first bio fragment and returns the remainder instead of
> submitting the remainder and returning the first bio fragment. Please consider
> this patch series for the next merge window.

Why are you sending large writes using REQ_OP_WRITE and not
using REQ_OP_ZONE_APPEND which side steps all these issues?
Bart Van Assche March 20, 2023, 5:22 p.m. UTC | #2
On 3/17/23 23:29, Christoph Hellwig wrote:
> On Fri, Mar 17, 2023 at 12:59:36PM -0700, Bart Van Assche wrote:
>> For zoned storage it is essential that split bios are submitted in LBA order.
>> This patch series realizes this by modifying __bio_split_to_limits() such that
>> it submits the first bio fragment and returns the remainder instead of
>> submitting the remainder and returning the first bio fragment. Please consider
>> this patch series for the next merge window.
> 
> Why are you sending large writes using REQ_OP_WRITE and not
> using REQ_OP_ZONE_APPEND which side steps all these issues?

Hi Christoph,

How to achieve optimal performance with REQ_OP_ZONE_APPEND for SCSI 
devices? My understanding of how REQ_OP_ZONE_APPEND works for SCSI 
devices is as follows:
* ATA devices cannot support this operation directly because there are
   not enough bits in the ATA sense data to report where appended data
   has been written.
* T10 has not yet started with standardizing a zone append operation.
* The code that emulates REQ_OP_ZONE_APPEND for SCSI devices (in
   sd_zbc.c) serializes REQ_OP_ZONE_APPEND operations (QD=1).
* To achieve optimal performance, QD > 1 is required.

Thanks,

Bart.
Khazhy Kumykov March 20, 2023, 9:06 p.m. UTC | #3
On Mon, Mar 20, 2023 at 10:28 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 3/17/23 23:29, Christoph Hellwig wrote:
> > On Fri, Mar 17, 2023 at 12:59:36PM -0700, Bart Van Assche wrote:
> >> For zoned storage it is essential that split bios are submitted in LBA order.
> >> This patch series realizes this by modifying __bio_split_to_limits() such that
> >> it submits the first bio fragment and returns the remainder instead of
> >> submitting the remainder and returning the first bio fragment. Please consider
> >> this patch series for the next merge window.
> >
> > Why are you sending large writes using REQ_OP_WRITE and not
> > using REQ_OP_ZONE_APPEND which side steps all these issues?
>
> Hi Christoph,
>
> How to achieve optimal performance with REQ_OP_ZONE_APPEND for SCSI
> devices? My understanding of how REQ_OP_ZONE_APPEND works for SCSI
> devices is as follows:
> * ATA devices cannot support this operation directly because there are
>    not enough bits in the ATA sense data to report where appended data
>    has been written.
> * T10 has not yet started with standardizing a zone append operation.
> * The code that emulates REQ_OP_ZONE_APPEND for SCSI devices (in
>    sd_zbc.c) serializes REQ_OP_ZONE_APPEND operations (QD=1).
> * To achieve optimal performance, QD > 1 is required.
I recall there were dragons lurking particularly with how we handle
requeues wherein just submitting in order was not sufficient to
guarantee IO is actually dispatched in order. (of note: when
requeueing a request, we splice it to the _end_ of the hctx dispatch
list, so if you get a requeue in the middle of a multi-segment IO, it
will get re-ordered. I recall this change went in specifically to
re-order requests in case there was a passthrough lurking to un-jam a
device.) Have you looked at this? Perhaps requeues are slowpath
anyways, so we could sort there? There may also be other requeue
weirdness with layered devices...

Khazhy
Christoph Hellwig March 23, 2023, 8:27 a.m. UTC | #4
On Mon, Mar 20, 2023 at 10:22:41AM -0700, Bart Van Assche wrote:
> How to achieve optimal performance with REQ_OP_ZONE_APPEND for SCSI 
> devices? My understanding of how REQ_OP_ZONE_APPEND works for SCSI devices 
> is as follows:
> * ATA devices cannot support this operation directly because there are
>   not enough bits in the ATA sense data to report where appended data
>   has been written.

ATA doesn't really have autosense in the SCSI way.  It could be handled
the same way that CDL completions are handled.  That is a complete
mess, and between CDL and Zone Append we'll probably eventually need
an extended FIS for SATA if we want to keep ATA alive.

> * T10 has not yet started with standardizing a zone append operation.

Time to get it started then!

> * The code that emulates REQ_OP_ZONE_APPEND for SCSI devices (in
>   sd_zbc.c) serializes REQ_OP_ZONE_APPEND operations (QD=1).

Because that's the only thing that actually works.

> * To achieve optimal performance, QD > 1 is required.

If you have something magic that works, this code is the place to take
advantage of it.
Bart Van Assche March 24, 2023, 5:05 p.m. UTC | #5
On 3/23/23 01:27, Christoph Hellwig wrote:
> On Mon, Mar 20, 2023 at 10:22:41AM -0700, Bart Van Assche wrote:
>> * T10 has not yet started with standardizing a zone append operation.
> 
> Time to get it started then!

Hi Christoph,

If someone else wants to work on this that would be great. I do not plan 
to work on this because I do not expect that a SCSI zone append command 
would be standardized by the time we need it. Although there are 
references to T10 drafts in the UFS standard, since a few months JEDEC 
strongly prefers to refer to finalized external standards in its own 
standards. Hence, standardizing zoned storage for UFS would have to wait 
until T10 has published a standard that supports a zone append command. 
INCITS published ZBC-1 in 2016, two years after the first ZBC-1 draft 
was uploaded to the T10 servers. INCITS approved ZBC-2 this month, six 
years after the first ZBC-2 draft was uploaded to the T10 servers. 
Because of the long time it takes to complete new versions of T10 
standards we plan not to wait until T10 has standardized a zone append 
operation.

Thanks,

Bart.
Damien Le Moal March 25, 2023, 2:15 a.m. UTC | #6
On 3/25/23 02:05, Bart Van Assche wrote:
> On 3/23/23 01:27, Christoph Hellwig wrote:
>> On Mon, Mar 20, 2023 at 10:22:41AM -0700, Bart Van Assche wrote:
>>> * T10 has not yet started with standardizing a zone append operation.
>>
>> Time to get it started then!
> 
> Hi Christoph,
> 
> If someone else wants to work on this that would be great. I do not plan 
> to work on this because I do not expect that a SCSI zone append command 
> would be standardized by the time we need it. Although there are 
> references to T10 drafts in the UFS standard, since a few months JEDEC 
> strongly prefers to refer to finalized external standards in its own 
> standards. Hence, standardizing zoned storage for UFS would have to wait 
> until T10 has published a standard that supports a zone append command. 
> INCITS published ZBC-1 in 2016, two years after the first ZBC-1 draft 
> was uploaded to the T10 servers. INCITS approved ZBC-2 this month, six 
> years after the first ZBC-2 draft was uploaded to the T10 servers. 
> Because of the long time it takes to complete new versions of T10 
> standards we plan not to wait until T10 has standardized a zone append 
> operation.

Such standardization effort is likely to face a lot of headwind because defining
a zone append command for ATA (T13 ACS) is not possible with a single
self-contained command (as one cannot return the written sector using sense data
like with scsi). And when it comes to ZBC, keeping it in sync with ZAC is desired...
Christoph Hellwig March 26, 2023, 11:42 p.m. UTC | #7
On Sat, Mar 25, 2023 at 11:15:40AM +0900, Damien Le Moal wrote:
> Such standardization effort is likely to face a lot of headwind because
> defining a zone append command for ATA (T13 ACS) is not possible with a
> single self-contained command (as one cannot return the written sector
> using sense data like with scsi).

The same was true for CDL and it got in anyway.  And yes, CDL on ATA
is a complete f**king mess, and needs to be fixed.  So ATA needs to byte
the bullet and extent the FIS anyway, so we might as well get started on
it ASAP.  Fortunately the only implementations that really matter now
are AHCI and SAS expanders, so it sounds very doable to get there.

> And when it comes to ZBC, keeping it in sync with ZAC is desired...

There is so many features in SCSI and not ATA, most notably protection
information that this sounds like a BS argument to me.  That being
said supporting Zone Append and properly doing CDL in ATA would be
very useful.
Christoph Hellwig March 26, 2023, 11:44 p.m. UTC | #8
On Fri, Mar 24, 2023 at 10:05:48AM -0700, Bart Van Assche wrote:
> If someone else wants to work on this that would be great. I do not plan to 
> work on this because I do not expect that a SCSI zone append command would 
> be standardized by the time we need it. Although there are references to 
> T10 drafts in the UFS standard, since a few months JEDEC strongly prefers 
> to refer to finalized external standards in its own standards. Hence, 
> standardizing zoned storage for UFS would have to wait until T10 has 
> published a standard that supports a zone append command. INCITS published 
> ZBC-1 in 2016, two years after the first ZBC-1 draft was uploaded to the 
> T10 servers. INCITS approved ZBC-2 this month, six years after the first 
> ZBC-2 draft was uploaded to the T10 servers. Because of the long time it 
> takes to complete new versions of T10 standards we plan not to wait until 
> T10 has standardized a zone append operation.

Which is why we need to start the work now.  Note that I don't think
your time frames matter too much - the first draft of zbc2 is where
people opened up the process again.  The more relevant time frame is
between getting the main new feature in and publusing, which is way
shorter.
Bart Van Assche April 6, 2023, 8:32 p.m. UTC | #9
On 3/26/23 16:44, Christoph Hellwig wrote:
> On Fri, Mar 24, 2023 at 10:05:48AM -0700, Bart Van Assche wrote:
>> If someone else wants to work on this that would be great. I do not plan to
>> work on this because I do not expect that a SCSI zone append command would
>> be standardized by the time we need it. Although there are references to
>> T10 drafts in the UFS standard, since a few months JEDEC strongly prefers
>> to refer to finalized external standards in its own standards. Hence,
>> standardizing zoned storage for UFS would have to wait until T10 has
>> published a standard that supports a zone append command. INCITS published
>> ZBC-1 in 2016, two years after the first ZBC-1 draft was uploaded to the
>> T10 servers. INCITS approved ZBC-2 this month, six years after the first
>> ZBC-2 draft was uploaded to the T10 servers. Because of the long time it
>> takes to complete new versions of T10 standards we plan not to wait until
>> T10 has standardized a zone append operation.
> 
> Which is why we need to start the work now.  Note that I don't think
> your time frames matter too much - the first draft of zbc2 is where
> people opened up the process again.  The more relevant time frame is
> between getting the main new feature in and publusing, which is way
> shorter.

Hi Christoph,

If you help with the npo2 zone size patch series making progress towards 
being integrated in the upstream kernel I will help with the 
standardization of a write append command in the T10 ZBC standard.

Thanks,

Bart.