Message ID | 20231005194129.1882245-4-bvanassche@acm.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Pass data temperature information to UFS devices | expand |
On Thu, Oct 05, 2023 at 12:40:49PM -0700, Bart Van Assche wrote: >The NVMe and SCSI standards define 64 different data lifetimes. Support >storing this information in the I/O priority bitfield. > >The current allocation of the 16 bits in the I/O priority bitfield is as >follows: >* 15..13: I/O priority class >* 12..6: unused >* 5..3: I/O hint (CDL) >* 2..0: I/O priority level > >This patch changes this into the following: >* 15..13: I/O priority class >* 12: unused >* 11..6: data lifetime >* 5..3: I/O hint (CDL) >* 2..0: I/O priority level > >Cc: Damien Le Moal <dlemoal@kernel.org> >Cc: Niklas Cassel <niklas.cassel@wdc.com> >Signed-off-by: Bart Van Assche <bvanassche@acm.org> >--- > include/uapi/linux/ioprio.h | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > >diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h >index bee2bdb0eedb..efe9bc450872 100644 >--- a/include/uapi/linux/ioprio.h >+++ b/include/uapi/linux/ioprio.h >@@ -71,7 +71,7 @@ enum { > * class and level. > */ > #define IOPRIO_HINT_SHIFT IOPRIO_LEVEL_NR_BITS >-#define IOPRIO_HINT_NR_BITS 10 >+#define IOPRIO_HINT_NR_BITS 3 Should the comment[*] also be modified to reflect this change? [*] /* * The 10 bits between the priority class and the priority level are used to * optionally define I/O hints for any combination of I/O priority class and
On 10/6/23 04:40, Bart Van Assche wrote: > The NVMe and SCSI standards define 64 different data lifetimes. Support > storing this information in the I/O priority bitfield. > > The current allocation of the 16 bits in the I/O priority bitfield is as > follows: > * 15..13: I/O priority class > * 12..6: unused > * 5..3: I/O hint (CDL) > * 2..0: I/O priority level > > This patch changes this into the following: > * 15..13: I/O priority class > * 12: unused > * 11..6: data lifetime > * 5..3: I/O hint (CDL) > * 2..0: I/O priority level > > Cc: Damien Le Moal <dlemoal@kernel.org> > Cc: Niklas Cassel <niklas.cassel@wdc.com> > Signed-off-by: Bart Van Assche <bvanassche@acm.org> > --- > include/uapi/linux/ioprio.h | 8 +++++++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h > index bee2bdb0eedb..efe9bc450872 100644 > --- a/include/uapi/linux/ioprio.h > +++ b/include/uapi/linux/ioprio.h > @@ -71,7 +71,7 @@ enum { > * class and level. > */ > #define IOPRIO_HINT_SHIFT IOPRIO_LEVEL_NR_BITS > -#define IOPRIO_HINT_NR_BITS 10 > +#define IOPRIO_HINT_NR_BITS 3 > #define IOPRIO_NR_HINTS (1 << IOPRIO_HINT_NR_BITS) > #define IOPRIO_HINT_MASK (IOPRIO_NR_HINTS - 1) > #define IOPRIO_PRIO_HINT(ioprio) \ > @@ -102,6 +102,12 @@ enum { > IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7, > }; > > +#define IOPRIO_LIFETIME_SHIFT (IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS) > +#define IOPRIO_LIFETIME_NR_BITS 6 > +#define IOPRIO_LIFETIME_MASK ((1u << IOPRIO_LIFETIME_NR_BITS) - 1) > +#define IOPRIO_PRIO_LIFETIME(ioprio) \ > + ((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK) > + > #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max)) I am really not a fan of this. This essentially limits prio hints to CDL, while the initial intent was to define the hints as something generic that depend on the device features. With your change, we will not be able to support new features in the future. Your change seem to assume that it makes sense to be able to combine CDL with lifetime hints. But does it really ? CDL is of dubious value for solid state media and as far as I know, UFS world has not expressed interest. Conversely, data lifetime hints do not make much sense for spin rust media where CDL is important. So I would say that the combination of CDL and lifetime hints is of dubious value. Given this, why not simply define the 64 possible lifetime values as plain hint values (8 to 71, following 1 to 7 for CDL) ? The other question here if you really want to keep the bit separation approach is: do we really need up to 64 different lifetime hints ? While the scsi standard allows that much, does this many different lifetime make sense in practice ? Can we ever think of a usecase that needs more than say 8 different liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8, then we can keep 4 bits unused in the hint field for future features.
On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote: > On 10/6/23 04:40, Bart Van Assche wrote: > > The NVMe and SCSI standards define 64 different data lifetimes. Support > > storing this information in the I/O priority bitfield. > > > > The current allocation of the 16 bits in the I/O priority bitfield is as > > follows: > > * 15..13: I/O priority class > > * 12..6: unused > > * 5..3: I/O hint (CDL) > > * 2..0: I/O priority level > > > > This patch changes this into the following: > > * 15..13: I/O priority class > > * 12: unused > > * 11..6: data lifetime > > * 5..3: I/O hint (CDL) > > * 2..0: I/O priority level > > > > Cc: Damien Le Moal <dlemoal@kernel.org> > > Cc: Niklas Cassel <niklas.cassel@wdc.com> > > Signed-off-by: Bart Van Assche <bvanassche@acm.org> > > --- > > include/uapi/linux/ioprio.h | 8 +++++++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h > > index bee2bdb0eedb..efe9bc450872 100644 > > --- a/include/uapi/linux/ioprio.h > > +++ b/include/uapi/linux/ioprio.h > > @@ -71,7 +71,7 @@ enum { > > * class and level. > > */ > > #define IOPRIO_HINT_SHIFT IOPRIO_LEVEL_NR_BITS > > -#define IOPRIO_HINT_NR_BITS 10 > > +#define IOPRIO_HINT_NR_BITS 3 Can we really redefine this? This is defined to 10 in released kernels, e.g. kernel v6.5. Perhaps a better option would be to keep this defined to 10, and then add new macros that define "specific" IO prio hint "classes". something like IOPRIO_HINT_NR_BITS 10 IOPRIO_HINT_CDL IOPRIO_HINT_CDL_NR_BITS IOPRIO_HINT_LIFETIME IOPRIO_HINT_LIFETIME_NR_BITS lifetime is really a hint, so I think it makes sense for it to be part of the "IOPRIO_HINT bits". > > #define IOPRIO_NR_HINTS (1 << IOPRIO_HINT_NR_BITS) > > #define IOPRIO_HINT_MASK (IOPRIO_NR_HINTS - 1) > > #define IOPRIO_PRIO_HINT(ioprio) \ > > @@ -102,6 +102,12 @@ enum { > > IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7, > > }; > > > > +#define IOPRIO_LIFETIME_SHIFT (IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS) > > +#define IOPRIO_LIFETIME_NR_BITS 6 > > +#define IOPRIO_LIFETIME_MASK ((1u << IOPRIO_LIFETIME_NR_BITS) - 1) > > +#define IOPRIO_PRIO_LIFETIME(ioprio) \ > > + ((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK) > > + > > #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max)) > > I am really not a fan of this. This essentially limits prio hints to CDL, while > the initial intent was to define the hints as something generic that depend on > the device features. With your change, we will not be able to support new > features in the future. > > Your change seem to assume that it makes sense to be able to combine CDL with > lifetime hints. But does it really ? CDL is of dubious value for solid state > media and as far as I know, UFS world has not expressed interest. Conversely, > data lifetime hints do not make much sense for spin rust media where CDL is > important. So I would say that the combination of CDL and lifetime hints is of > dubious value. > > Given this, why not simply define the 64 possible lifetime values as plain hint > values (8 to 71, following 1 to 7 for CDL) ? > > The other question here if you really want to keep the bit separation approach > is: do we really need up to 64 different lifetime hints ? While the scsi > standard allows that much, does this many different lifetime make sense in > practice ? Can we ever think of a usecase that needs more than say 8 different > liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8, > then we can keep 4 bits unused in the hint field for future features. I think the question is: do we ever want to allow ioprio hints to be combined or not? If the answer is no, then go ahead and add the life time hints as values 8 to 71. If the answer is yes, then life time hints need to use unique bits, even if it might not make sense for CDL and life times to be combined. Kind regards, Niklas
On 10/6/23 01:19, Damien Le Moal wrote: > Your change seem to assume that it makes sense to be able to combine CDL with > lifetime hints. But does it really ? CDL is of dubious value for solid state > media and as far as I know, UFS world has not expressed interest. Conversely, > data lifetime hints do not make much sense for spin rust media where CDL is > important. So I would say that the combination of CDL and lifetime hints is of > dubious value. > > Given this, why not simply define the 64 possible lifetime values as plain hint > values (8 to 71, following 1 to 7 for CDL) ? > > The other question here if you really want to keep the bit separation approach > is: do we really need up to 64 different lifetime hints ? While the scsi > standard allows that much, does this many different lifetime make sense in > practice ? Can we ever think of a usecase that needs more than say 8 different > liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8, > then we can keep 4 bits unused in the hint field for future features. Hi Damien, Not supporting CDL for solid state media and supporting eight different lifetime values sounds good to me. Is this perhaps what you had in mind? Thanks, Bart. --- a/include/uapi/linux/ioprio.h +++ b/include/uapi/linux/ioprio.h @@ -100,6 +100,14 @@ enum { IOPRIO_HINT_DEV_DURATION_LIMIT_5 = 5, IOPRIO_HINT_DEV_DURATION_LIMIT_6 = 6, IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7, + IOPRIO_HINT_DATA_LIFE_TIME_0 = 8, + IOPRIO_HINT_DATA_LIFE_TIME_1 = 9, + IOPRIO_HINT_DATA_LIFE_TIME_2 = 10, + IOPRIO_HINT_DATA_LIFE_TIME_3 = 11, + IOPRIO_HINT_DATA_LIFE_TIME_4 = 12, + IOPRIO_HINT_DATA_LIFE_TIME_5 = 13, + IOPRIO_HINT_DATA_LIFE_TIME_6 = 14, + IOPRIO_HINT_DATA_LIFE_TIME_7 = 15, };
On 10/6/23 11:07, Bart Van Assche wrote: > On 10/6/23 01:19, Damien Le Moal wrote: >> Your change seem to assume that it makes sense to be able to combine >> CDL with >> lifetime hints. But does it really ? CDL is of dubious value for solid >> state >> media and as far as I know, UFS world has not expressed interest. >> Conversely, >> data lifetime hints do not make much sense for spin rust media where >> CDL is >> important. So I would say that the combination of CDL and lifetime >> hints is of >> dubious value. >> >> Given this, why not simply define the 64 possible lifetime values as >> plain hint >> values (8 to 71, following 1 to 7 for CDL) ? >> >> The other question here if you really want to keep the bit separation >> approach >> is: do we really need up to 64 different lifetime hints ? While the scsi >> standard allows that much, does this many different lifetime make >> sense in >> practice ? Can we ever think of a usecase that needs more than say 8 >> different >> liftimes (3 bits) ? If you limit the number of possible lifetime hints >> to 8, >> then we can keep 4 bits unused in the hint field for future features. > > Hi Damien, > > Not supporting CDL for solid state media and supporting eight different > lifetime values sounds good to me. Is this perhaps what you had in mind? > > Thanks, > > Bart. > > --- a/include/uapi/linux/ioprio.h > +++ b/include/uapi/linux/ioprio.h > @@ -100,6 +100,14 @@ enum { > IOPRIO_HINT_DEV_DURATION_LIMIT_5 = 5, > IOPRIO_HINT_DEV_DURATION_LIMIT_6 = 6, > IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7, > + IOPRIO_HINT_DATA_LIFE_TIME_0 = 8, > + IOPRIO_HINT_DATA_LIFE_TIME_1 = 9, > + IOPRIO_HINT_DATA_LIFE_TIME_2 = 10, > + IOPRIO_HINT_DATA_LIFE_TIME_3 = 11, > + IOPRIO_HINT_DATA_LIFE_TIME_4 = 12, > + IOPRIO_HINT_DATA_LIFE_TIME_5 = 13, > + IOPRIO_HINT_DATA_LIFE_TIME_6 = 14, > + IOPRIO_HINT_DATA_LIFE_TIME_7 = 15, > }; (replying to my own e-mail) Hi Damien, Does the above look good to you? Thanks, Bart.
On 10/12/23 05:51, Bart Van Assche wrote: > On 10/6/23 11:07, Bart Van Assche wrote: >> On 10/6/23 01:19, Damien Le Moal wrote: >>> Your change seem to assume that it makes sense to be able to combine CDL with >>> lifetime hints. But does it really ? CDL is of dubious value for solid state >>> media and as far as I know, UFS world has not expressed interest. Conversely, >>> data lifetime hints do not make much sense for spin rust media where CDL is >>> important. So I would say that the combination of CDL and lifetime hints is of >>> dubious value. >>> >>> Given this, why not simply define the 64 possible lifetime values as plain hint >>> values (8 to 71, following 1 to 7 for CDL) ? >>> >>> The other question here if you really want to keep the bit separation approach >>> is: do we really need up to 64 different lifetime hints ? While the scsi >>> standard allows that much, does this many different lifetime make sense in >>> practice ? Can we ever think of a usecase that needs more than say 8 different >>> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8, >>> then we can keep 4 bits unused in the hint field for future features. >> >> Hi Damien, >> >> Not supporting CDL for solid state media and supporting eight different >> lifetime values sounds good to me. Is this perhaps what you had in mind? >> >> Thanks, >> >> Bart. >> >> --- a/include/uapi/linux/ioprio.h >> +++ b/include/uapi/linux/ioprio.h >> @@ -100,6 +100,14 @@ enum { >> IOPRIO_HINT_DEV_DURATION_LIMIT_5 = 5, >> IOPRIO_HINT_DEV_DURATION_LIMIT_6 = 6, >> IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7, >> + IOPRIO_HINT_DATA_LIFE_TIME_0 = 8, >> + IOPRIO_HINT_DATA_LIFE_TIME_1 = 9, >> + IOPRIO_HINT_DATA_LIFE_TIME_2 = 10, >> + IOPRIO_HINT_DATA_LIFE_TIME_3 = 11, >> + IOPRIO_HINT_DATA_LIFE_TIME_4 = 12, >> + IOPRIO_HINT_DATA_LIFE_TIME_5 = 13, >> + IOPRIO_HINT_DATA_LIFE_TIME_6 = 14, >> + IOPRIO_HINT_DATA_LIFE_TIME_7 = 15, >> }; > > (replying to my own e-mail) > > Hi Damien, > > Does the above look good to you? Yes, it is what I was thinking of and I think it looks much better (and simpler) than coding with different bits. However, I think this is acceptable only if everyone agrees that combining CDL and lifetime (and potentially other hints) is not a sensible thing to do. I stated that my thinking is that CDL is more geared toward rotating media while lifetime is more suitable for solid state media. But does everyone agree ? Some have stated interest in CDL in NVMe-oF context, which could imply that combining CDL and lifetime may be something useful to do in that space... Getting more opinions about this would be nice. If we do not get any, I would say that we should go with this.
On 10/11/23 18:02, Damien Le Moal wrote: > Some have stated interest in CDL in NVMe-oF context, which could > imply that combining CDL and lifetime may be something useful to do > in that space... We are having this discussion because bi_ioprio is sixteen bits wide and because we don't want to make struct bio larger. How about expanding the bi_ioprio field from 16 to 32 bits and to use separate bits for CDL information and data lifetimes? This patch does not make struct bio bigger because it changes a three byte hole with a one byte hole: --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -268,8 +268,8 @@ struct bio { * req_flags. */ unsigned short bi_flags; /* BIO_* below */ - unsigned short bi_ioprio; blk_status_t bi_status; + u32 bi_ioprio; atomic_t __bi_remaining; struct bvec_iter bi_iter; From the pahole output: struct bio { struct bio * bi_next; /* 0 8 */ struct block_device * bi_bdev; /* 8 8 */ blk_opf_t bi_opf; /* 16 4 */ short unsigned int bi_flags; /* 20 2 */ blk_status_t bi_status; /* 22 1 */ /* XXX 1 byte hole, try to pack */ u32 bi_ioprio; /* 24 4 */ atomic_t __bi_remaining; /* 28 4 */ struct bvec_iter bi_iter; /* 32 20 */ blk_qc_t bi_cookie; /* 52 4 */ bio_end_io_t * bi_end_io; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ void * bi_private; /* 64 8 */ struct bio_crypt_ctx * bi_crypt_context; /* 72 8 */ union { struct bio_integrity_payload * bi_integrity; /* 80 8 */ }; /* 80 8 */ short unsigned int bi_vcnt; /* 88 2 */ short unsigned int bi_max_vecs; /* 90 2 */ atomic_t __bi_cnt; /* 92 4 */ struct bio_vec * bi_io_vec; /* 96 8 */ struct bio_set * bi_pool; /* 104 8 */ struct bio_vec bi_inline_vecs[]; /* 112 0 */ /* size: 112, cachelines: 2, members: 19 */ /* sum members: 111, holes: 1, sum holes: 1 */ /* last cacheline: 48 bytes */ }; Thanks, Bart.
On 10/13/23 03:00, Bart Van Assche wrote: > On 10/11/23 18:02, Damien Le Moal wrote: >> Some have stated interest in CDL in NVMe-oF context, which could >> imply that combining CDL and lifetime may be something useful to do >> in that space... > > We are having this discussion because bi_ioprio is sixteen bits wide and > because we don't want to make struct bio larger. How about expanding the > bi_ioprio field from 16 to 32 bits and to use separate bits for CDL > information and data lifetimes? I guess we could do that as well. User side aio_reqprio field of struct aiocb, which is used by io_uring and libaio, is an int, so 32-bits also. Changing bi_ioprio to match that should not cause regressions or break user space I think. Kernel uapi ioprio.h will need some massaging though. > This patch does not make struct bio bigger because it changes a three > byte hole with a one byte hole: Yeah, but if the kernel is compiled with struct randomization, that does not really apply, doesn't it ? Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for the lifetime may have one drawback: that information will be propagated to the device only for direct IOs, no ? For buffered IOs, the information will be lost. The other potential disadvantage of the ioprio interface is that we cannot define ioprio+hint per file (or per inode really), unlike the old write_hint that you initially reintroduced. Are these points blockers for the user API you were thinking of ? How do you envision the user specifying lifetime ? Per file ? Or are you thinking of not relying on the user to specify that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I think ioprio+hint is fine (it is simple). But if it is the former, the ioprio API may not be the best suited for the job at hand.
On Fri, Oct 13, 2023 at 10:08:29AM +0900, Damien Le Moal wrote: > On 10/13/23 03:00, Bart Van Assche wrote: > > On 10/11/23 18:02, Damien Le Moal wrote: > >> Some have stated interest in CDL in NVMe-oF context, which could > >> imply that combining CDL and lifetime may be something useful to do > >> in that space... > > > > We are having this discussion because bi_ioprio is sixteen bits wide and > > because we don't want to make struct bio larger. How about expanding the > > bi_ioprio field from 16 to 32 bits and to use separate bits for CDL > > information and data lifetimes? > > I guess we could do that as well. User side aio_reqprio field of struct aiocb, > which is used by io_uring and libaio, is an int, so 32-bits also. Changing > bi_ioprio to match that should not cause regressions or break user space I > think. Kernel uapi ioprio.h will need some massaging though. > > > This patch does not make struct bio bigger because it changes a three > > byte hole with a one byte hole: > > Yeah, but if the kernel is compiled with struct randomization, that does not > really apply, doesn't it ? > > Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for > the lifetime may have one drawback: that information will be propagated to the > device only for direct IOs, no ? For buffered IOs, the information will be > lost. The other potential disadvantage of the ioprio interface is that we > cannot define ioprio+hint per file (or per inode really), unlike the old > write_hint that you initially reintroduced. Are these points blockers for the > user API you were thinking of ? How do you envision the user specifying > lifetime ? Per file ? Or are you thinking of not relying on the user to specify > that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I > think ioprio+hint is fine (it is simple). But if it is the former, the ioprio > API may not be the best suited for the job at hand. Hello Damien, If you look closer at this series, you will see that even V2 of this series still uses fcntl F_SET_RW_HINT as the user facing API. This series simply take the value from fcntl F_SET_RW_HINT (inode->i_write_hint) and stores it in bio->ioprio. So it is not really using the ioprio user API. See the patch to e.g. buffered-io.c: --- a/fs/iomap/buffered-io.c +++ b/fs/iomap/buffered-io.c @@ -6,6 +6,7 @@ #include <linux/module.h> #include <linux/compiler.h> #include <linux/fs.h> +#include <linux/fs-lifetime.h> #include <linux/iomap.h> #include <linux/pagemap.h> #include <linux/uio.h> @@ -1660,6 +1661,7 @@ iomap_alloc_ioend(struct inode *inode, struct iomap_writepage_ctx *wpc, REQ_OP_WRITE | wbc_to_write_flags(wbc), GFP_NOFS, &iomap_ioend_bioset); bio->bi_iter.bi_sector = sector; + bio_set_data_lifetime(bio, inode->i_write_hint); wbc_init_bio(wbc, bio); ioend = container_of(bio, struct iomap_ioend, io_inline_bio); In commit c75e707fe1aa ("block: remove the per-bio/request write hint") this line from fs/direct-io.c was removed: - bio->bi_write_hint = dio->iocb->ki_hint; I'm not sure why this series does not readd a similar line to set the lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c. I still don't understand what happens if one uses io_uring to write to a file on a f2fs filesystem using buffered-io, with both inode->i_write_hint set using fcntl F_SET_RW_HINT, and bits belonging to life time hints set in the io_uring SQE (sqe->ioprio). I'm guessing that the: bio_set_data_lifetime(bio, inode->i_write_hint); call above in buffered-io.c, will simply overwrite whatever value that was already stored in bio->ioprio. (Because if I understand correctly, bio->ioprio will initially be set to the value in the io_uring SQE (sqe->ioprio).) Kind regards, Niklas
On 10/12/23 18:08, Damien Le Moal wrote: > On 10/13/23 03:00, Bart Van Assche wrote: >> We are having this discussion because bi_ioprio is sixteen bits wide and >> because we don't want to make struct bio larger. How about expanding the >> bi_ioprio field from 16 to 32 bits and to use separate bits for CDL >> information and data lifetimes? > > I guess we could do that as well. User side aio_reqprio field of struct aiocb, > which is used by io_uring and libaio, is an int, so 32-bits also. Changing > bi_ioprio to match that should not cause regressions or break user space I > think. Kernel uapi ioprio.h will need some massaging though. Hmm ... are we perhaps looking at different kernel versions? This is what I found: $ git grep -nHE 'ioprio;|reqprio;' include/uapi/linux/{io_uring,aio_abi}.h include/uapi/linux/aio_abi.h:89: __s16 aio_reqprio; include/uapi/linux/io_uring.h:33: __u16 ioprio; /* ioprio for the request */ The struct iocb used for asynchronous I/O has a size of 64 bytes and does not have any holes. struct io_uring_sqe also has a size of 64 bytes and does not have any holes either. The ioprio_set() and ioprio_get() system calls use the data type int so these wouldn't need any changes to increase the number of ioprio bits. > Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for > the lifetime may have one drawback: that information will be propagated to the > device only for direct IOs, no ? For buffered IOs, the information will be > lost. The other potential disadvantage of the ioprio interface is that we > cannot define ioprio+hint per file (or per inode really), unlike the old > write_hint that you initially reintroduced. Are these points blockers for the > user API you were thinking of ? How do you envision the user specifying > lifetime ? Per file ? Or are you thinking of not relying on the user to specify > that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I > think ioprio+hint is fine (it is simple). But if it is the former, the ioprio > API may not be the best suited for the job at hand. The way I see it is that the primary purpose of the bits in the bi_ioprio member that are used for the data lifetime is to allow filesystems to provide data lifetime information to block drivers. Specifying data lifetime information for direct I/O is convenient when writing test scripts that verify whether data lifetime supports works correctly. There may be other use cases but this is not my primary focus. I think that applications that want to specify data lifetime information should use fcntl(fd, F_SET_RW_HINT, ...). It is up to the filesystem to make sure that this information ends up in the bi_ioprio field. The block layer is responsible for passing the information in the bi_ioprio member to block drivers. Filesystems can support multiple policies for combining the i_write_hint and other information into a data lifetime. See also the whint_mode restored by patch 05/15 in this series. Thanks, Bart.
On 10/13/23 02:33, Niklas Cassel wrote: > In commit c75e707fe1aa ("block: remove the per-bio/request write hint") > this line from fs/direct-io.c was removed: > - bio->bi_write_hint = dio->iocb->ki_hint; > > I'm not sure why this series does not readd a similar line to set the > lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c. It depends on how we want the user to specify the data lifetime for direct I/O. This assignment is not modified by this patch series and copies the data lifetime information from the ioprio bitfield from user space into the bio: bio->bi_ioprio = dio->iocb->ki_ioprio; > I still don't understand what happens if one uses io_uring to write > to a file on a f2fs filesystem using buffered-io, with both > inode->i_write_hint set using fcntl F_SET_RW_HINT, and bits belonging > to life time hints set in the io_uring SQE (sqe->ioprio). Is the documentation of the whint_mode mount option in patch 5/15 of this series sufficient to answer the above question? Thanks, Bart.
On 10/14/23 05:18, Bart Van Assche wrote: > On 10/12/23 18:08, Damien Le Moal wrote: >> On 10/13/23 03:00, Bart Van Assche wrote: >>> We are having this discussion because bi_ioprio is sixteen bits wide and >>> because we don't want to make struct bio larger. How about expanding the >>> bi_ioprio field from 16 to 32 bits and to use separate bits for CDL >>> information and data lifetimes? >> >> I guess we could do that as well. User side aio_reqprio field of struct aiocb, >> which is used by io_uring and libaio, is an int, so 32-bits also. Changing >> bi_ioprio to match that should not cause regressions or break user space I >> think. Kernel uapi ioprio.h will need some massaging though. > > Hmm ... are we perhaps looking at different kernel versions? This is > what I found: > > $ git grep -nHE 'ioprio;|reqprio;' include/uapi/linux/{io_uring,aio_abi}.h > include/uapi/linux/aio_abi.h:89: __s16 aio_reqprio; > include/uapi/linux/io_uring.h:33: __u16 ioprio; /* ioprio for the > request */ My bad. I looked at "man aio" but that is the posix AIO API, not Linux native. > The struct iocb used for asynchronous I/O has a size of 64 bytes and > does not have any holes. struct io_uring_sqe also has a size of 64 bytes > and does not have any holes either. The ioprio_set() and ioprio_get() > system calls use the data type int so these wouldn't need any changes to > increase the number of ioprio bits. Yes, but I think it would be better to keep the bio bi_ioprio field size synced with the per AIO aio_reqprio/ioprio for libaio and io_uring, that is, 16-bits. >> Reading Niklas's reply to Kanchan, I was reminded that using ioprio hint for >> the lifetime may have one drawback: that information will be propagated to the >> device only for direct IOs, no ? For buffered IOs, the information will be >> lost. The other potential disadvantage of the ioprio interface is that we >> cannot define ioprio+hint per file (or per inode really), unlike the old >> write_hint that you initially reintroduced. Are these points blockers for the >> user API you were thinking of ? How do you envision the user specifying >> lifetime ? Per file ? Or are you thinking of not relying on the user to specify >> that but rather the FS (e.g. f2fs) deciding on its own ? If it is the latter, I >> think ioprio+hint is fine (it is simple). But if it is the former, the ioprio >> API may not be the best suited for the job at hand. > > The way I see it is that the primary purpose of the bits in the > bi_ioprio member that are used for the data lifetime is to allow > filesystems to provide data lifetime information to block drivers. > > Specifying data lifetime information for direct I/O is convenient when > writing test scripts that verify whether data lifetime supports works > correctly. There may be other use cases but this is not my primary > focus. > > I think that applications that want to specify data lifetime information > should use fcntl(fd, F_SET_RW_HINT, ...). It is up to the filesystem to > make sure that this information ends up in the bi_ioprio field. The > block layer is responsible for passing the information in the bi_ioprio > member to block drivers. Filesystems can support multiple policies for > combining the i_write_hint and other information into a data lifetime. > See also the whint_mode restored by patch 05/15 in this series. Explaining this in the cover letter of the series would be helpful for one to understand your view of how the information is propagated from user to device. I am not a fan of having a fcntl() call ending up modifying the ioprio of IOs using hints, given that hints in themselves are already a user facing information/API. This is confusing... What if we have a user issue direct IOs with a lifetime value hint on a file that has a different lifetime set with fcntl() ? And I am sure there are other corner cases like this. Given that lifetime is per file (inode) and IO prio is per process or per I/O, having different user APIs makes sense. The issue of not growing (if possible) the bio and request structures remains. For bio, you identified a hole already, so what about using another 16-bits field for lifetime ? Not sure for requests. I thought also of a union with bi_ioprio, but that would prevent using lifetime and IO priority together, which is not ideal. > > Thanks, > > Bart.
On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote: > Your change seem to assume that it makes sense to be able to combine CDL with > lifetime hints. But does it really ? Yes, it does. > CDL is of dubious value for solid state > media and as far as I know, No, it's pretty useful and I'd bet my 2 cents that it will eventually show up in relevant standards and devices. Even if it wasn't making our user interfaces exclusive would be a massive pain. > The other question here if you really want to keep the bit separation approach > is: do we really need up to 64 different lifetime hints ? While the scsi > standard allows that much, does this many different lifetime make sense in > practice ? Can we ever think of a usecase that needs more than say 8 different > liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8, > then we can keep 4 bits unused in the hint field for future features. Yes, I think this is the smoking gun. We should be fine with a much more limited number of lifetime hints, i.e. the user interface only exposes 5 hints, and supporting more in the in-kernel interfaces seems of rather doubtfuŀ use.
On Fri, Oct 13, 2023 at 02:20:23PM -0700, Bart Van Assche wrote: > On 10/13/23 02:33, Niklas Cassel wrote: > > In commit c75e707fe1aa ("block: remove the per-bio/request write hint") > > this line from fs/direct-io.c was removed: > > - bio->bi_write_hint = dio->iocb->ki_hint; > > > > I'm not sure why this series does not readd a similar line to set the > > lifetime (using bio_set_data_lifetime()) also for fs/direct-io.c. > > It depends on how we want the user to specify the data lifetime for > direct I/O. This assignment is not modified by this patch series and > copies the data lifetime information from the ioprio bitfield from user > space into the bio: > > bio->bi_ioprio = dio->iocb->ki_ioprio; Before per-bio/request write hints were removed, things looked like this: io_uring.c: req->rw.kiocb.ki_hint = ki_hint_validate(file_write_hint(req->file)); fs/fcntl.c: static inline enum rw_hint file_write_hint(struct file *file) { if (file->f_write_hint != WRITE_LIFE_NOT_SET) return file->f_write_hint; return file_inode(file)->i_write_hint; } direct-io.c: bio->bi_write_hint = dio->iocb->ki_hint; buffered-io.c: bio->bi_write_hint = inode->i_write_hint; After this series, things instead look like this: direct-io.c: bio->bi_ioprio = dio->iocb->ki_ioprio; buffered-io.c: bio_set_data_lifetime(bio, inode->i_write_hint); So when you say: "It depends on how we want the user to specify the data lifetime for direct I/O.", do you mean that buffered I/O should use fcntl() to specify data lifetime, but direct I/O should used Linux IO priority API to specify the same? Because, to me that seems to be how the series is currently working. (I am sorry if I am missing something.) Kind regards, Niklas
On 10/15/23 15:22, Damien Le Moal wrote: > Given that lifetime is per file (inode) and IO prio is per process or per I/O, > having different user APIs makes sense. The issue of not growing (if possible) > the bio and request structures remains. For bio, you identified a hole already, > so what about using another 16-bits field for lifetime ? Not sure for requests. > I thought also of a union with bi_ioprio, but that would prevent using lifetime > and IO priority together, which is not ideal. There is a challenge: F_SET_RW_HINT / F_GET_RW_HINT are per inode and hence submitting direct I/O with different lifetimes to a single file by opening that file multiple times and by using F_SET_FILE_RW_HINT won't be possible. fio supports this use case. See also fio commit bd553af6c849 ("Update API for file write hints"). If nobody objects I won't restore F_SET_FILE_RW_HINT support. Furthermore, I plan to drop the bi_ioprio changes and introduce a new u8 struct bio member instead: bi_lifetime. I think 8 bits is enough since the NVMe and SCSI data lifetime fields are six bits wide. There is a two byte hole in struct request past the ioprio field. I'd like to use that space for a new data lifetime member. Thanks, Bart.
On 10/15/23 23:17, Christoph Hellwig wrote: > On Fri, Oct 06, 2023 at 05:19:52PM +0900, Damien Le Moal wrote: >> Your change seem to assume that it makes sense to be able to combine CDL with >> lifetime hints. But does it really ? > > Yes, it does. > > >> CDL is of dubious value for solid state >> media and as far as I know, > > No, it's pretty useful and I'd bet my 2 cents that it will eventually > show up in relevant standards and devices. > > Even if it wasn't making our user interfaces exclusive would be a > massive pain. > >> The other question here if you really want to keep the bit separation approach >> is: do we really need up to 64 different lifetime hints ? While the scsi >> standard allows that much, does this many different lifetime make sense in >> practice ? Can we ever think of a usecase that needs more than say 8 different >> liftimes (3 bits) ? If you limit the number of possible lifetime hints to 8, >> then we can keep 4 bits unused in the hint field for future features. > > Yes, I think this is the smoking gun. We should be fine with a much > more limited number of lifetime hints, i.e. the user interface only > exposes 5 hints, and supporting more in the in-kernel interfaces seems > of rather doubtfuŀ use. Thanks for the feedback Christoph. I will address this feedback in the next version of this patch series. Bart.
On 10/16/23 02:20, Niklas Cassel wrote: > So when you say: > "It depends on how we want the user to specify the data lifetime for > direct I/O.", do you mean that buffered I/O should use fcntl() to specify > data lifetime, but direct I/O should used Linux IO priority API to specify > the same? > > Because, to me that seems to be how the series is currently working. > (I am sorry if I am missing something.) Let's drop support for passing the data lifetime in the I/O priority field from user space as Damien proposed. The remaining question is whether only to restore support for F_SET_RW_HINT (per inode) or also for F_SET_FILE_RW_HINT (per file)? Thanks, Bart.
diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h index bee2bdb0eedb..efe9bc450872 100644 --- a/include/uapi/linux/ioprio.h +++ b/include/uapi/linux/ioprio.h @@ -71,7 +71,7 @@ enum { * class and level. */ #define IOPRIO_HINT_SHIFT IOPRIO_LEVEL_NR_BITS -#define IOPRIO_HINT_NR_BITS 10 +#define IOPRIO_HINT_NR_BITS 3 #define IOPRIO_NR_HINTS (1 << IOPRIO_HINT_NR_BITS) #define IOPRIO_HINT_MASK (IOPRIO_NR_HINTS - 1) #define IOPRIO_PRIO_HINT(ioprio) \ @@ -102,6 +102,12 @@ enum { IOPRIO_HINT_DEV_DURATION_LIMIT_7 = 7, }; +#define IOPRIO_LIFETIME_SHIFT (IOPRIO_HINT_SHIFT + IOPRIO_HINT_NR_BITS) +#define IOPRIO_LIFETIME_NR_BITS 6 +#define IOPRIO_LIFETIME_MASK ((1u << IOPRIO_LIFETIME_NR_BITS) - 1) +#define IOPRIO_PRIO_LIFETIME(ioprio) \ + ((ioprio >> IOPRIO_LIFETIME_SHIFT) & IOPRIO_LIFETIME_MASK) + #define IOPRIO_BAD_VALUE(val, max) ((val) < 0 || (val) >= (max)) /*
The NVMe and SCSI standards define 64 different data lifetimes. Support storing this information in the I/O priority bitfield. The current allocation of the 16 bits in the I/O priority bitfield is as follows: * 15..13: I/O priority class * 12..6: unused * 5..3: I/O hint (CDL) * 2..0: I/O priority level This patch changes this into the following: * 15..13: I/O priority class * 12: unused * 11..6: data lifetime * 5..3: I/O hint (CDL) * 2..0: I/O priority level Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- include/uapi/linux/ioprio.h | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-)