Message ID | 20231017204739.3409052-1-bvanassche@acm.org (mailing list archive) |
---|---|
Headers | show |
Series | Pass data temperature information to SCSI disk devices | expand |
On 10/17/23 2:47 PM, Bart Van Assche wrote: > Hi Jens, > > UFS vendors need the data lifetime information to achieve good performance. > Without this information there is significantly higher write amplification due > to garbage collection. Hence this patch series that add support in F2FS and > also in the block layer for data lifetime information. The SCSI disk (sd) > driver is modified such that it passes write hint information to SCSI devices > via the GROUP NUMBER field. > > Please consider this patch series for the next merge window. My main hesitation with this is that there's a big gap between what makes theoretical sense and practical sense. When we previously tried this, turns out devices retained the data temperature on media, as expected, but tossed it out when data was GC'ed. That made it more of a benchmarking case than anything else. How do we know that things are better now? In previous postings I've seen you point at some papers, but I'm mostly concerned with practical use cases and devices. Are there any results, at all, from that? Or is this a case of vendors asking for something to check some marketing boxes or have value add? I can take a closer look once this is fully understood. Not adding something like this without proper justification. I'm also really against growing struct bio just for this. Why is patch 2 not just using the ioprio field at least?
On 10/18/23 12:09, Jens Axboe wrote: > My main hesitation with this is that there's a big gap between what > makes theoretical sense and practical sense. When we previously tried > this, turns out devices retained the data temperature on media, as > expected, but tossed it out when data was GC'ed. That made it more of a > benchmarking case than anything else. How do we know that things are > better now? In previous postings I've seen you point at some papers, but > I'm mostly concerned with practical use cases and devices. Are there any > results, at all, from that? Or is this a case of vendors asking for > something to check some marketing boxes or have value add? Hi Jens, Multiple UFS vendors made it clear to me that this feature is essential for their UFS devices to perform well. I will reach out to some of these vendors off-list and will ask them to share performance numbers. A note: persistent stream support is a feature that was only added recently in the latest SCSI SBC-5 draft. This SCSI specification change allows SCSI device vendors to interpret the GROUP NUMBER field as a data lifetime. UFS device vendors interpret the GROUP NUMBER field as a data lifetime since a long time - long before this was allowed by the SCSI standards. See also the "ContextID" feature in the UFS specification. That feature is mentioned in every version of the UFS specification I have access to. The oldest version of the UFS specification I have access to is version 2.2, published in 2016. (https://www.jedec.org/system/files/docs/JESD220C-2_2.pdf). This document is available free of charge after an account has been created on the JEDEC website. > I'm also really against growing struct bio just for this. Why is patch 2 > not just using the ioprio field at least? Hmm ... shouldn't the bits in the ioprio field in struct bio have the same meaning as in the ioprio fields used in interfaces between user space and the kernel? Damien Le Moal asked me not to use any of the ioprio bits passing data lifetime information from user space to the kernel. Is it clear that the size of struct bio has not been changed because the new bi_lifetime member fills a hole in struct bio? Thanks, Bart.
On 10/19/23 04:34, Bart Van Assche wrote: > > On 10/18/23 12:09, Jens Axboe wrote: >> My main hesitation with this is that there's a big gap between what >> makes theoretical sense and practical sense. When we previously tried >> this, turns out devices retained the data temperature on media, as >> expected, but tossed it out when data was GC'ed. That made it more of a >> benchmarking case than anything else. How do we know that things are >> better now? In previous postings I've seen you point at some papers, but >> I'm mostly concerned with practical use cases and devices. Are there any >> results, at all, from that? Or is this a case of vendors asking for >> something to check some marketing boxes or have value add? > > Hi Jens, > > Multiple UFS vendors made it clear to me that this feature is essential > for their UFS devices to perform well. I will reach out to some of these > vendors off-list and will ask them to share performance numbers. > > A note: persistent stream support is a feature that was only added > recently in the latest SCSI SBC-5 draft. This SCSI specification change > allows SCSI device vendors to interpret the GROUP NUMBER field as a data > lifetime. UFS device vendors interpret the GROUP NUMBER field as a data > lifetime since a long time - long before this was allowed by the SCSI > standards. See also the "ContextID" feature in the UFS specification. > That feature is mentioned in every version of the UFS specification I > have access to. The oldest version of the UFS specification I have > access to is version 2.2, published in 2016. > (https://www.jedec.org/system/files/docs/JESD220C-2_2.pdf). This > document is available free of charge after an account has been created > on the JEDEC website. > >> I'm also really against growing struct bio just for this. Why is patch 2 >> not just using the ioprio field at least? > > Hmm ... shouldn't the bits in the ioprio field in struct bio have the > same meaning as in the ioprio fields used in interfaces between user > space and the kernel? Damien Le Moal asked me not to use any of the > ioprio bits passing data lifetime information from user space to the kernel. I said so in the context that if lifetime is a per-inode property, then ioprio is the wrong interface since the ioprio API is per process or per IO. There is a mismatch. One version of your patch series used fnctl() to set the lifetime per inoe, which is fine, and then used the BIO ioprio to pass the lifetime down to the device driver. That is in theory a nice trick, but that creates conflicts with the userspace ioprio API if the user uses that at the same time. So may be we should change bio ioprio from int to u16 and use the freedup u16 for lifetime. With that, things are cleanly separated without growing struct bio. > > Is it clear that the size of struct bio has not been changed because the > new bi_lifetime member fills a hole in struct bio? When the struct is randomized, holes move or disappear. Don't count on that... > > Thanks, > > Bart. > >
On 10/18/23 17:33, Damien Le Moal wrote: > On 10/19/23 04:34, Bart Van Assche wrote: >> On 10/18/23 12:09, Jens Axboe wrote: >>> I'm also really against growing struct bio just for this. Why is patch 2 >>> not just using the ioprio field at least? >> >> Hmm ... shouldn't the bits in the ioprio field in struct bio have the >> same meaning as in the ioprio fields used in interfaces between user >> space and the kernel? Damien Le Moal asked me not to use any of the >> ioprio bits passing data lifetime information from user space to the kernel. > > I said so in the context that if lifetime is a per-inode property, then ioprio > is the wrong interface since the ioprio API is per process or per IO. There is a > mismatch. > > One version of your patch series used fnctl() to set the lifetime per inode, > which is fine, and then used the BIO ioprio to pass the lifetime down to the > device driver. That is in theory a nice trick, but that creates conflicts with > the userspace ioprio API if the user uses that at the same time. > > So may be we should change bio ioprio from int to u16 and use the freedup u16 > for lifetime. With that, things are cleanly separated without growing struct bio. Hmm ... I think that bi_ioprio has been 16 bits wide since the introduction of that data structure member in 2016? >> Is it clear that the size of struct bio has not been changed because the >> new bi_lifetime member fills a hole in struct bio? > > When the struct is randomized, holes move or disappear. Don't count on that... We should aim to maximize performance for users who do not use data structure layout randomization. Additionally, I doubt that anyone is using full structure layout randomization for SCSI devices. No SCSI driver has any __no_randomize_layout / __randomize_layout annotations although I'm sure there are plenty of data structures in SCSI drivers for which the layout matters. Thanks, Bart.
On 10/20/23 01:48, Bart Van Assche wrote: > On 10/18/23 17:33, Damien Le Moal wrote: >> On 10/19/23 04:34, Bart Van Assche wrote: > >> On 10/18/23 12:09, Jens Axboe wrote: >>>> I'm also really against growing struct bio just for this. Why is patch 2 >>>> not just using the ioprio field at least? >>> >>> Hmm ... shouldn't the bits in the ioprio field in struct bio have the >>> same meaning as in the ioprio fields used in interfaces between user >>> space and the kernel? Damien Le Moal asked me not to use any of the >>> ioprio bits passing data lifetime information from user space to the kernel. >> >> I said so in the context that if lifetime is a per-inode property, then ioprio >> is the wrong interface since the ioprio API is per process or per IO. There is a >> mismatch. >> >> One version of your patch series used fnctl() to set the lifetime per inode, >> which is fine, and then used the BIO ioprio to pass the lifetime down to the >> device driver. That is in theory a nice trick, but that creates conflicts with >> the userspace ioprio API if the user uses that at the same time. >> >> So may be we should change bio ioprio from int to u16 and use the freedup u16 >> for lifetime. With that, things are cleanly separated without growing struct bio. > > Hmm ... I think that bi_ioprio has been 16 bits wide since the > introduction of that data structure member in 2016? My bad. struct bio->bi_ioprio is an unsigned short. I got confused with the user API and kernel functions using an int in many places. We really should change the kernel functions to use unsigned short for ioprio everywhere. >>> Is it clear that the size of struct bio has not been changed because the >>> new bi_lifetime member fills a hole in struct bio? >> >> When the struct is randomized, holes move or disappear. Don't count on that... > > We should aim to maximize performance for users who do not use data > structure layout randomization. > > Additionally, I doubt that anyone is using full structure layout > randomization for SCSI devices. No SCSI driver has any > __no_randomize_layout / __randomize_layout annotations although I'm sure > there are plenty of data structures in SCSI drivers for which the layout > matters. Well, if Jens is OK with adding another "unsigned short bi_lifetime" in a hole in struct bio, that's fine with me. Otherwise, we are back to discussing how to pack bi_ioprio in a sensible manner so that we do not create a mess between the use cases and APIs: 1) inode based lifetime with FS setting up the bi_ioprio field 2) Direct IOs to files of an FS with lifetime set by user per IO (e.g. aio/io_uring/ioprio_set()) and/or fcntl() 3) Direct IOs to raw block devices with lifetime set by user per IO (e.g. aio/io_uring/ioprio_set()) Any of the above case should also allow using ioprio class/level and CDL hint. I think the most problematic part is (2) when lifetime are set with both fcntl() and per IO: which lifetime is the valid one ? The one set with fcntl() or the one specified for the IO ? I think the former is the one we want here. If we can clarify that, then I guess using 3 or 4 bits from the 10 bits ioprio hint should be OK. That would give you 7 or 15 lifetime values. Enough no ?
On 10/20/23 07:40, Damien Le Moal wrote: > On 10/20/23 01:48, Bart Van Assche wrote: >> On 10/18/23 17:33, Damien Le Moal wrote: >>> On 10/19/23 04:34, Bart Van Assche wrote: >> >> On 10/18/23 12:09, Jens Axboe wrote: >>>>> I'm also really against growing struct bio just for this. Why is patch 2 >>>>> not just using the ioprio field at least? >>>> >>>> Hmm ... shouldn't the bits in the ioprio field in struct bio have the >>>> same meaning as in the ioprio fields used in interfaces between user >>>> space and the kernel? Damien Le Moal asked me not to use any of the >>>> ioprio bits passing data lifetime information from user space to the kernel. >>> >>> I said so in the context that if lifetime is a per-inode property, then ioprio >>> is the wrong interface since the ioprio API is per process or per IO. There is a >>> mismatch. >>> >>> One version of your patch series used fnctl() to set the lifetime per inode, >>> which is fine, and then used the BIO ioprio to pass the lifetime down to the >>> device driver. That is in theory a nice trick, but that creates conflicts with >>> the userspace ioprio API if the user uses that at the same time. >>> >>> So may be we should change bio ioprio from int to u16 and use the freedup u16 >>> for lifetime. With that, things are cleanly separated without growing struct bio. >> >> Hmm ... I think that bi_ioprio has been 16 bits wide since the >> introduction of that data structure member in 2016? > > My bad. struct bio->bi_ioprio is an unsigned short. I got confused with the user > API and kernel functions using an int in many places. We really should change > the kernel functions to use unsigned short for ioprio everywhere. > >>>> Is it clear that the size of struct bio has not been changed because the >>>> new bi_lifetime member fills a hole in struct bio? >>> >>> When the struct is randomized, holes move or disappear. Don't count on that... >> >> We should aim to maximize performance for users who do not use data >> structure layout randomization. >> >> Additionally, I doubt that anyone is using full structure layout >> randomization for SCSI devices. No SCSI driver has any >> __no_randomize_layout / __randomize_layout annotations although I'm sure >> there are plenty of data structures in SCSI drivers for which the layout >> matters. > > Well, if Jens is OK with adding another "unsigned short bi_lifetime" in a hole > in struct bio, that's fine with me. Otherwise, we are back to discussing how to > pack bi_ioprio in a sensible manner so that we do not create a mess between the > use cases and APIs: > 1) inode based lifetime with FS setting up the bi_ioprio field > 2) Direct IOs to files of an FS with lifetime set by user per IO (e.g. > aio/io_uring/ioprio_set()) and/or fcntl() > 3) Direct IOs to raw block devices with lifetime set by user per IO (e.g. > aio/io_uring/ioprio_set()) > > Any of the above case should also allow using ioprio class/level and CDL hint. > > I think the most problematic part is (2) when lifetime are set with both fcntl() > and per IO: which lifetime is the valid one ? The one set with fcntl() or the > one specified for the IO ? I think the former is the one we want here. > > If we can clarify that, then I guess using 3 or 4 bits from the 10 bits ioprio > hint should be OK. That would give you 7 or 15 lifetime values. Enough no ? To be clear, we have to deal with these cases: 1) File IOs - User uses fcntl() only for lifetime - User uses per direct IO ioprio with lifetime (and maybe class/level/cdl) - User uses all of the above 2) Raw block device direct IOs - Per IO ioprio with lifetime (and maybe class/level/cdl) (2) is easy. No real change needed beside the UFS driver bits. But the cases for (1) need clarification about how things should work.
On 10/18/23 12:09, Jens Axboe wrote: > I'm also really against growing struct bio just for this. Why is patch 2 > not just using the ioprio field at least? Hi Jens, Can you please clarify whether your concern is about the size of struct bio only or also about the runtime impact of the comparisons that have been added in attempt_merge() and blk_rq_merge_ok()? It may be possible to eliminate the overhead of the new comparisons as follows: * Introduce a union of struct { I/O priority; data lifetime; } and u32. * Use that union in struct bio instead of bi_ioprio and bi_lifetime. * Use that union in struct request instead of the ioprio and lifetime members. * In attempt_merge() and blk_rq_merge_ok(), compare the u32 union member instead of comparing the I/O priority and data lifetime separately. Thanks, Bart.