mbox series

[v3,00/14] Pass data temperature information to SCSI disk devices

Message ID 20231017204739.3409052-1-bvanassche@acm.org (mailing list archive)
Headers show
Series Pass data temperature information to SCSI disk devices | expand

Message

Bart Van Assche Oct. 17, 2023, 8:47 p.m. UTC
Hi Jens,

UFS vendors need the data lifetime information to achieve good performance.
Without this information there is significantly higher write amplification due
to garbage collection. Hence this patch series that add support in F2FS and
also in the block layer for data lifetime information. The SCSI disk (sd)
driver is modified such that it passes write hint information to SCSI devices
via the GROUP NUMBER field.

Please consider this patch series for the next merge window.

Thanks,

Bart.

Changes compared to v1:
- Instead of storing data lifetime information in bi_ioprio, introduce the
  new struct bio member bi_lifetime and also the struct request member
  'lifetime'.
- Removed the bio_set_data_lifetime() and bio_get_data_lifetime() functions
  and replaced these with direct assignments.
- Dropped all changes related to I/O priority.
- Improved patch descriptions.

Changes compared to v1:
- Use six bits from the ioprio field for data lifetime information. The
  bio->bi_write_hint / req->write_hint / iocb->ki_hint members that were
  introduced in v1 have been removed again.
- The F_GET_FILE_RW_HINT and F_SET_FILE_RW_HINT fcntls have been removed.
- In the SCSI disk (sd) driver, query the stream status and check the PERM bit.
- The GET STREAM STATUS command has been implemented in the scsi_debug driver.

Bart Van Assche (14):
  fs: Move enum rw_hint into a new header file
  block: Restore data lifetime support in struct bio and struct request
  fs: Restore write hint support
  fs/f2fs: Restore data lifetime support
  scsi: core: Query the Block Limits Extension VPD page
  scsi_proto: Add structures and constants related to I/O groups and
    streams
  sd: Translate data lifetime information
  scsi_debug: Reduce code duplication
  scsi_debug: Support the block limits extension VPD page
  scsi_debug: Rework page code error handling
  scsi_debug: Rework subpage code error handling
  scsi_debug: Implement the IO Advice Hints Grouping mode page
  scsi_debug: Implement GET STREAM STATUS
  scsi_debug: Maintain write statistics per group number

 Documentation/filesystems/f2fs.rst |  70 ++++++++
 block/bio.c                        |   2 +
 block/blk-crypto-fallback.c        |   1 +
 block/blk-merge.c                  |   6 +
 block/blk-mq.c                     |   1 +
 block/bounce.c                     |   1 +
 block/fops.c                       |   3 +
 drivers/scsi/scsi.c                |   2 +
 drivers/scsi/scsi_debug.c          | 247 +++++++++++++++++++++--------
 drivers/scsi/scsi_sysfs.c          |  10 ++
 drivers/scsi/sd.c                  | 111 ++++++++++++-
 drivers/scsi/sd.h                  |   3 +
 fs/f2fs/data.c                     |   2 +
 fs/f2fs/f2fs.h                     |  10 ++
 fs/f2fs/segment.c                  |  95 +++++++++++
 fs/f2fs/super.c                    |  32 +++-
 fs/fcntl.c                         |   1 +
 fs/inode.c                         |   1 +
 fs/iomap/buffered-io.c             |   2 +
 fs/iomap/direct-io.c               |   1 +
 fs/mpage.c                         |   1 +
 include/linux/blk-mq.h             |   2 +
 include/linux/blk_types.h          |   2 +
 include/linux/fs.h                 |  16 +-
 include/linux/rw_hint.h            |  20 +++
 include/scsi/scsi_device.h         |   1 +
 include/scsi/scsi_proto.h          |  75 +++++++++
 27 files changed, 635 insertions(+), 83 deletions(-)
 create mode 100644 include/linux/rw_hint.h

Comments

Jens Axboe Oct. 18, 2023, 7:09 p.m. UTC | #1
On 10/17/23 2:47 PM, Bart Van Assche wrote:
> Hi Jens,
> 
> UFS vendors need the data lifetime information to achieve good performance.
> Without this information there is significantly higher write amplification due
> to garbage collection. Hence this patch series that add support in F2FS and
> also in the block layer for data lifetime information. The SCSI disk (sd)
> driver is modified such that it passes write hint information to SCSI devices
> via the GROUP NUMBER field.
> 
> Please consider this patch series for the next merge window.

My main hesitation with this is that there's a big gap between what
makes theoretical sense and practical sense. When we previously tried
this, turns out devices retained the data temperature on media, as
expected, but tossed it out when data was GC'ed. That made it more of a
benchmarking case than anything else. How do we know that things are
better now? In previous postings I've seen you point at some papers, but
I'm mostly concerned with practical use cases and devices. Are there any
results, at all, from that? Or is this a case of vendors asking for
something to check some marketing boxes or have value add?

I can take a closer look once this is fully understood. Not adding
something like this without proper justification.

I'm also really against growing struct bio just for this. Why is patch 2
not just using the ioprio field at least?
Bart Van Assche Oct. 18, 2023, 7:34 p.m. UTC | #2
On 10/18/23 12:09, Jens Axboe wrote:
> My main hesitation with this is that there's a big gap between what
> makes theoretical sense and practical sense. When we previously tried
> this, turns out devices retained the data temperature on media, as
> expected, but tossed it out when data was GC'ed. That made it more of a
> benchmarking case than anything else. How do we know that things are
> better now? In previous postings I've seen you point at some papers, but
> I'm mostly concerned with practical use cases and devices. Are there any
> results, at all, from that? Or is this a case of vendors asking for
> something to check some marketing boxes or have value add?

Hi Jens,

Multiple UFS vendors made it clear to me that this feature is essential 
for their UFS devices to perform well. I will reach out to some of these
vendors off-list and will ask them to share performance numbers.

A note: persistent stream support is a feature that was only added
recently in the latest SCSI SBC-5 draft. This SCSI specification change
allows SCSI device vendors to interpret the GROUP NUMBER field as a data
lifetime. UFS device vendors interpret the GROUP NUMBER field as a data
lifetime since a long time - long before this was allowed by the SCSI
standards. See also the "ContextID" feature in the UFS specification.
That feature is mentioned in every version of the UFS specification I
have access to. The oldest version of the UFS specification I have
access to is version 2.2, published in 2016.
(https://www.jedec.org/system/files/docs/JESD220C-2_2.pdf). This
document is available free of charge after an account has been created 
on the JEDEC website.

> I'm also really against growing struct bio just for this. Why is patch 2
> not just using the ioprio field at least?

Hmm ... shouldn't the bits in the ioprio field in struct bio have the
same meaning as in the ioprio fields used in interfaces between user
space and the kernel? Damien Le Moal asked me not to use any of the
ioprio bits passing data lifetime information from user space to the kernel.

Is it clear that the size of struct bio has not been changed because the
new bi_lifetime member fills a hole in struct bio?

Thanks,

Bart.
Damien Le Moal Oct. 19, 2023, 12:33 a.m. UTC | #3
On 10/19/23 04:34, Bart Van Assche wrote:
> 
> On 10/18/23 12:09, Jens Axboe wrote:
>> My main hesitation with this is that there's a big gap between what
>> makes theoretical sense and practical sense. When we previously tried
>> this, turns out devices retained the data temperature on media, as
>> expected, but tossed it out when data was GC'ed. That made it more of a
>> benchmarking case than anything else. How do we know that things are
>> better now? In previous postings I've seen you point at some papers, but
>> I'm mostly concerned with practical use cases and devices. Are there any
>> results, at all, from that? Or is this a case of vendors asking for
>> something to check some marketing boxes or have value add?
> 
> Hi Jens,
> 
> Multiple UFS vendors made it clear to me that this feature is essential 
> for their UFS devices to perform well. I will reach out to some of these
> vendors off-list and will ask them to share performance numbers.
> 
> A note: persistent stream support is a feature that was only added
> recently in the latest SCSI SBC-5 draft. This SCSI specification change
> allows SCSI device vendors to interpret the GROUP NUMBER field as a data
> lifetime. UFS device vendors interpret the GROUP NUMBER field as a data
> lifetime since a long time - long before this was allowed by the SCSI
> standards. See also the "ContextID" feature in the UFS specification.
> That feature is mentioned in every version of the UFS specification I
> have access to. The oldest version of the UFS specification I have
> access to is version 2.2, published in 2016.
> (https://www.jedec.org/system/files/docs/JESD220C-2_2.pdf). This
> document is available free of charge after an account has been created 
> on the JEDEC website.
> 
>> I'm also really against growing struct bio just for this. Why is patch 2
>> not just using the ioprio field at least?
> 
> Hmm ... shouldn't the bits in the ioprio field in struct bio have the
> same meaning as in the ioprio fields used in interfaces between user
> space and the kernel? Damien Le Moal asked me not to use any of the
> ioprio bits passing data lifetime information from user space to the kernel.

I said so in the context that if lifetime is a per-inode property, then ioprio
is the wrong interface since the ioprio API is per process or per IO. There is a
mismatch.

One version of your patch series used fnctl() to set the lifetime per inoe,
which is fine, and then used the BIO ioprio to pass the lifetime down to the
device driver. That is in theory a nice trick, but that creates conflicts with
the userspace ioprio API if the user uses that at the same time.

So may be we should change bio ioprio from int to u16 and use the freedup u16
for lifetime. With that, things are cleanly separated without growing struct bio.

> 
> Is it clear that the size of struct bio has not been changed because the
> new bi_lifetime member fills a hole in struct bio?

When the struct is randomized, holes move or disappear. Don't count on that...

> 
> Thanks,
> 
> Bart.
> 
>
Bart Van Assche Oct. 19, 2023, 4:48 p.m. UTC | #4
On 10/18/23 17:33, Damien Le Moal wrote:
> On 10/19/23 04:34, Bart Van Assche wrote:
 >> On 10/18/23 12:09, Jens Axboe wrote:
>>> I'm also really against growing struct bio just for this. Why is patch 2
>>> not just using the ioprio field at least?
>>
>> Hmm ... shouldn't the bits in the ioprio field in struct bio have the
>> same meaning as in the ioprio fields used in interfaces between user
>> space and the kernel? Damien Le Moal asked me not to use any of the
>> ioprio bits passing data lifetime information from user space to the kernel.
> 
> I said so in the context that if lifetime is a per-inode property, then ioprio
> is the wrong interface since the ioprio API is per process or per IO. There is a
> mismatch.
> 
> One version of your patch series used fnctl() to set the lifetime per inode,
> which is fine, and then used the BIO ioprio to pass the lifetime down to the
> device driver. That is in theory a nice trick, but that creates conflicts with
> the userspace ioprio API if the user uses that at the same time.
> 
> So may be we should change bio ioprio from int to u16 and use the freedup u16
> for lifetime. With that, things are cleanly separated without growing struct bio.

Hmm ... I think that bi_ioprio has been 16 bits wide since the 
introduction of that data structure member in 2016?

>> Is it clear that the size of struct bio has not been changed because the
>> new bi_lifetime member fills a hole in struct bio?
> 
> When the struct is randomized, holes move or disappear. Don't count on that...

We should aim to maximize performance for users who do not use data 
structure layout randomization.

Additionally, I doubt that anyone is using full structure layout 
randomization for SCSI devices. No SCSI driver has any 
__no_randomize_layout / __randomize_layout annotations although I'm sure 
there are plenty of data structures in SCSI drivers for which the layout 
matters.

Thanks,

Bart.
Damien Le Moal Oct. 19, 2023, 10:40 p.m. UTC | #5
On 10/20/23 01:48, Bart Van Assche wrote:
> On 10/18/23 17:33, Damien Le Moal wrote:
>> On 10/19/23 04:34, Bart Van Assche wrote:
>  >> On 10/18/23 12:09, Jens Axboe wrote:
>>>> I'm also really against growing struct bio just for this. Why is patch 2
>>>> not just using the ioprio field at least?
>>>
>>> Hmm ... shouldn't the bits in the ioprio field in struct bio have the
>>> same meaning as in the ioprio fields used in interfaces between user
>>> space and the kernel? Damien Le Moal asked me not to use any of the
>>> ioprio bits passing data lifetime information from user space to the kernel.
>>
>> I said so in the context that if lifetime is a per-inode property, then ioprio
>> is the wrong interface since the ioprio API is per process or per IO. There is a
>> mismatch.
>>
>> One version of your patch series used fnctl() to set the lifetime per inode,
>> which is fine, and then used the BIO ioprio to pass the lifetime down to the
>> device driver. That is in theory a nice trick, but that creates conflicts with
>> the userspace ioprio API if the user uses that at the same time.
>>
>> So may be we should change bio ioprio from int to u16 and use the freedup u16
>> for lifetime. With that, things are cleanly separated without growing struct bio.
> 
> Hmm ... I think that bi_ioprio has been 16 bits wide since the 
> introduction of that data structure member in 2016?

My bad. struct bio->bi_ioprio is an unsigned short. I got confused with the user
API and kernel functions using an int in many places. We really should change
the kernel functions to use unsigned short for ioprio everywhere.

>>> Is it clear that the size of struct bio has not been changed because the
>>> new bi_lifetime member fills a hole in struct bio?
>>
>> When the struct is randomized, holes move or disappear. Don't count on that...
> 
> We should aim to maximize performance for users who do not use data 
> structure layout randomization.
> 
> Additionally, I doubt that anyone is using full structure layout 
> randomization for SCSI devices. No SCSI driver has any 
> __no_randomize_layout / __randomize_layout annotations although I'm sure 
> there are plenty of data structures in SCSI drivers for which the layout 
> matters.

Well, if Jens is OK with adding another "unsigned short bi_lifetime" in a hole
in struct bio, that's fine with me. Otherwise, we are back to discussing how to
pack bi_ioprio in a sensible manner so that we do not create a mess between the
use cases and APIs:
1) inode based lifetime with FS setting up the bi_ioprio field
2) Direct IOs to files of an FS with lifetime set by user per IO (e.g.
aio/io_uring/ioprio_set()) and/or fcntl()
3) Direct IOs to raw block devices with lifetime set by user per IO (e.g.
aio/io_uring/ioprio_set())

Any of the above case should also allow using ioprio class/level and CDL hint.

I think the most problematic part is (2) when lifetime are set with both fcntl()
and per IO: which lifetime is the valid one ? The one set with fcntl() or the
one specified for the IO ? I think the former is the one we want here.

If we can clarify that, then I guess using 3 or 4 bits from the 10 bits ioprio
hint should be OK. That would  give you 7 or 15 lifetime values. Enough no ?
Damien Le Moal Oct. 19, 2023, 11 p.m. UTC | #6
On 10/20/23 07:40, Damien Le Moal wrote:
> On 10/20/23 01:48, Bart Van Assche wrote:
>> On 10/18/23 17:33, Damien Le Moal wrote:
>>> On 10/19/23 04:34, Bart Van Assche wrote:
>>  >> On 10/18/23 12:09, Jens Axboe wrote:
>>>>> I'm also really against growing struct bio just for this. Why is patch 2
>>>>> not just using the ioprio field at least?
>>>>
>>>> Hmm ... shouldn't the bits in the ioprio field in struct bio have the
>>>> same meaning as in the ioprio fields used in interfaces between user
>>>> space and the kernel? Damien Le Moal asked me not to use any of the
>>>> ioprio bits passing data lifetime information from user space to the kernel.
>>>
>>> I said so in the context that if lifetime is a per-inode property, then ioprio
>>> is the wrong interface since the ioprio API is per process or per IO. There is a
>>> mismatch.
>>>
>>> One version of your patch series used fnctl() to set the lifetime per inode,
>>> which is fine, and then used the BIO ioprio to pass the lifetime down to the
>>> device driver. That is in theory a nice trick, but that creates conflicts with
>>> the userspace ioprio API if the user uses that at the same time.
>>>
>>> So may be we should change bio ioprio from int to u16 and use the freedup u16
>>> for lifetime. With that, things are cleanly separated without growing struct bio.
>>
>> Hmm ... I think that bi_ioprio has been 16 bits wide since the 
>> introduction of that data structure member in 2016?
> 
> My bad. struct bio->bi_ioprio is an unsigned short. I got confused with the user
> API and kernel functions using an int in many places. We really should change
> the kernel functions to use unsigned short for ioprio everywhere.
> 
>>>> Is it clear that the size of struct bio has not been changed because the
>>>> new bi_lifetime member fills a hole in struct bio?
>>>
>>> When the struct is randomized, holes move or disappear. Don't count on that...
>>
>> We should aim to maximize performance for users who do not use data 
>> structure layout randomization.
>>
>> Additionally, I doubt that anyone is using full structure layout 
>> randomization for SCSI devices. No SCSI driver has any 
>> __no_randomize_layout / __randomize_layout annotations although I'm sure 
>> there are plenty of data structures in SCSI drivers for which the layout 
>> matters.
> 
> Well, if Jens is OK with adding another "unsigned short bi_lifetime" in a hole
> in struct bio, that's fine with me. Otherwise, we are back to discussing how to
> pack bi_ioprio in a sensible manner so that we do not create a mess between the
> use cases and APIs:
> 1) inode based lifetime with FS setting up the bi_ioprio field
> 2) Direct IOs to files of an FS with lifetime set by user per IO (e.g.
> aio/io_uring/ioprio_set()) and/or fcntl()
> 3) Direct IOs to raw block devices with lifetime set by user per IO (e.g.
> aio/io_uring/ioprio_set())
> 
> Any of the above case should also allow using ioprio class/level and CDL hint.
> 
> I think the most problematic part is (2) when lifetime are set with both fcntl()
> and per IO: which lifetime is the valid one ? The one set with fcntl() or the
> one specified for the IO ? I think the former is the one we want here.
> 
> If we can clarify that, then I guess using 3 or 4 bits from the 10 bits ioprio
> hint should be OK. That would  give you 7 or 15 lifetime values. Enough no ?

To be clear, we have to deal with these cases:
1) File IOs
  - User uses fcntl() only for lifetime
  - User uses per direct IO ioprio with lifetime (and maybe class/level/cdl)
  - User uses all of the above
2) Raw block device direct IOs
  - Per IO ioprio with lifetime (and maybe class/level/cdl)

(2) is easy. No real change needed beside the UFS driver bits.
But the cases for (1) need clarification about how things should work.
Bart Van Assche Oct. 20, 2023, 8:45 p.m. UTC | #7
On 10/18/23 12:09, Jens Axboe wrote:
> I'm also really against growing struct bio just for this. Why is patch 2
> not just using the ioprio field at least?

Hi Jens,

Can you please clarify whether your concern is about the size of struct 
bio only or also about the runtime impact of the comparisons that have 
been added in attempt_merge() and blk_rq_merge_ok()? It may be possible 
to eliminate the overhead of the new comparisons as follows:
* Introduce a union of struct { I/O priority; data lifetime; } and u32.
* Use that union in struct bio instead of bi_ioprio and bi_lifetime.
* Use that union in struct request instead of the ioprio and lifetime
   members.
* In attempt_merge() and blk_rq_merge_ok(), compare the u32 union member
   instead of comparing the I/O priority and data lifetime separately.

Thanks,

Bart.