Message ID | 20240924092457.7846-4-joshi.k@samsung.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v6,1/3] nvme: enable FDP support | expand |
Per-I/O hints really can't work for file system interfaes. They will have to be a clear opt in to support for regular files. I'll do a more detailed review when I'm back next week.
On 9/24/24 11:24, Kanchan Joshi wrote: > With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and > all the subsequent writes on the file pass that hint value down. > This can be limiting for large files (and for block device) as all the > writes can be tagged with only one lifetime hint value. > Concurrent writes (with different hint values) are hard to manage. > Per-IO hinting solves that problem. > > Allow userspace to pass the write hint type and its value in the SQE. > Two new fields are carved in the leftover space of SQE: > __u8 hint_type; > __u64 hint_val; > > Adding the hint_type helps in keeping the interface extensible for future > use. > At this point only one type TYPE_WRITE_LIFETIME_HINT is supported. With > this type, user can pass the lifetime hint values that are currently > supported by F_SET_RW_HINT fcntl. > > The write handlers (io_prep_rw, io_write) process the hint type/value > and hint value is passed to lower-layer using kiocb. This is good for > supporting direct IO, but not when kiocb is not available (e.g., > buffered IO). > > In general, per-io hints take the precedence on per-inode hints. > Three cases to consider: > > Case 1: When hint_type is 0 (explicitly, or implicitly as SQE fields are > initialized to 0), this means user did not send any hint. The per-inode > hint values are set in the kiocb (as before). > > Case 2: When hint_type is TYPE_WRITE_LIFETIME_HINT, the hint_value is > set into the kiocb after sanity checking. > > Case 3: When hint_type is anything else, this is flagged as an error > and write is failed. > > Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> > Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com> > --- > fs/fcntl.c | 22 ---------------------- > include/linux/rw_hint.h | 24 ++++++++++++++++++++++++ > include/uapi/linux/io_uring.h | 10 ++++++++++ > io_uring/rw.c | 21 ++++++++++++++++++++- > 4 files changed, 54 insertions(+), 23 deletions(-) > > diff --git a/fs/fcntl.c b/fs/fcntl.c > index 081e5e3d89ea..2eb78035a350 100644 > --- a/fs/fcntl.c > +++ b/fs/fcntl.c > @@ -334,28 +334,6 @@ static int f_getowner_uids(struct file *filp, unsigned long arg) > } > #endif > > -static bool rw_hint_valid(u64 hint) > -{ > - BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); > - BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); > - BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); > - BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); > - BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); > - BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); > - > - switch (hint) { > - case RWH_WRITE_LIFE_NOT_SET: > - case RWH_WRITE_LIFE_NONE: > - case RWH_WRITE_LIFE_SHORT: > - case RWH_WRITE_LIFE_MEDIUM: > - case RWH_WRITE_LIFE_LONG: > - case RWH_WRITE_LIFE_EXTREME: > - return true; > - default: > - return false; > - } > -} > - > static long fcntl_get_rw_hint(struct file *file, unsigned int cmd, > unsigned long arg) > { > diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h > index 309ca72f2dfb..f4373a71ffed 100644 > --- a/include/linux/rw_hint.h > +++ b/include/linux/rw_hint.h > @@ -21,4 +21,28 @@ enum rw_hint { > static_assert(sizeof(enum rw_hint) == 1); > #endif > > +#define WRITE_LIFE_INVALID (RWH_WRITE_LIFE_EXTREME + 1) > + > +static inline bool rw_hint_valid(u64 hint) > +{ > + BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); > + BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); > + BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); > + BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); > + BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); > + BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); > + > + switch (hint) { > + case RWH_WRITE_LIFE_NOT_SET: > + case RWH_WRITE_LIFE_NONE: > + case RWH_WRITE_LIFE_SHORT: > + case RWH_WRITE_LIFE_MEDIUM: > + case RWH_WRITE_LIFE_LONG: > + case RWH_WRITE_LIFE_EXTREME: > + return true; > + default: > + return false; > + } > +} > + > #endif /* _LINUX_RW_HINT_H */ > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 1fe79e750470..e21a74dd0c49 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -98,6 +98,11 @@ struct io_uring_sqe { > __u64 addr3; > __u64 __pad2[1]; > }; > + struct { > + /* To send per-io hint type/value with write command */ > + __u64 hint_val; > + __u8 hint_type; > + }; Why is 'hint_val' 64 bits? Everything else is 8 bytes, so wouldn't it be better to shorten that? As it stands the new struct will introduce a hole of 24 bytes after 'hint_type'. Cheers, Hannes
On 9/25/2024 11:27 AM, Hannes Reinecke wrote: >> @@ -98,6 +98,11 @@ struct io_uring_sqe { >> __u64 addr3; >> __u64 __pad2[1]; >> }; >> + struct { >> + /* To send per-io hint type/value with write command */ >> + __u64 hint_val; >> + __u8 hint_type; >> + }; > Why is 'hint_val' 64 bits? Everything else is 8 bytes, so wouldn't it > be better to shorten that? Right, within kernel hint is stored as 8bits value. But I chose not because how kernel stores hint internally (which may change at any time) but how the existing F_SET_RW_HINT interface exposed this to user space. It expects u64. If we do 8bits interface here, application needs to learn that for the same lifetime hint it needs u64 for fcntl interface, but u8 for io_uring interface. That seems a bit confusing. Also, in future if we do support another hint type, we may be able to pass hint_val beyond what can be supported by u8. As it stands the new struct will introduce > a hole of 24 bytes after 'hint_type'. This gets implicitly padded at this point [1][2], and overall size is still capped by largest struct (which is of 16 bytes, placed just above this). [1] On 64bit » union { » » struct { » » » __u64 addr3; /* 48 8 */ » » » __u64 __pad2[1]; /* 56 8 */ » » }; /* 48 16 */ » » struct { » » » __u64 hint_val; /* 48 8 */ » » » __u8 hint_type; /* 56 1 */ » » }; /* 48 16 */ » » __u64 optval; /* 48 8 */ » » __u8 cmd[0]; /* 48 0 */ » }; /* 48 16 */ » /* size: 64, cachelines: 1, members: 13 */ [2] On 32bit » union { » » struct { » » » __u64 addr3; /* 48 8 */ » » » __u64 __pad2[1]; /* 56 8 */ » » }; /* 48 16 */ » » struct { » » » __u64 hint_val; /* 48 8 */ » » » __u8 hint_type; /* 56 1 */ » » }; /* 48 12 */ » » __u64 optval; /* 48 8 */ » » __u8 cmd[0]; /* 48 0 */ » }; /* 48 16 */ » /* size: 64, cachelines: 1, members: 13 */ };
On 9/25/24 12:09, Kanchan Joshi wrote: > On 9/25/2024 11:27 AM, Hannes Reinecke wrote: ... > As it stands the new struct will introduce >> a hole of 24 bytes after 'hint_type'. > > This gets implicitly padded at this point [1][2], and overall size is > still capped by largest struct (which is of 16 bytes, placed just above > this). For me it's about having hardly usable in the future by anyone else 7 bytes of space or how much that will be. Try to add another field using those bytes and endianess will start messing with you. And 7 bytes is not that convenient. I have same problem with how commands were merged while I was not looking. There was no explicit padding, and it split u64 into u32 and implicit padding, so no apps can use the space to put a pointer anymore while there was a much better option of using one of existing 4B fields. > [1] On 64bit > » union { > » » struct { > » » » __u64 addr3; /* 48 8 */ > » » » __u64 __pad2[1]; /* 56 8 */ > » » }; /* 48 16 */ > » » struct { > » » » __u64 hint_val; /* 48 8 */ > » » » __u8 hint_type; /* 56 1 */ > » » }; /* 48 16 */ > » » __u64 optval; /* 48 8 */ > » » __u8 cmd[0]; /* 48 0 */ > » }; /* 48 16 */ > > » /* size: 64, cachelines: 1, members: 13 */ > > [2] On 32bit > > » union { > » » struct { > » » » __u64 addr3; /* 48 8 */ > » » » __u64 __pad2[1]; /* 56 8 */ > » » }; /* 48 16 */ > » » struct { > » » » __u64 hint_val; /* 48 8 */ > » » » __u8 hint_type; /* 56 1 */ > » » }; /* 48 12 */ > » » __u64 optval; /* 48 8 */ > » » __u8 cmd[0]; /* 48 0 */ > » }; /* 48 16 */ > > » /* size: 64, cachelines: 1, members: 13 */ > };
On 9/25/2024 5:53 PM, Pavel Begunkov wrote: > On 9/25/24 12:09, Kanchan Joshi wrote: >> On 9/25/2024 11:27 AM, Hannes Reinecke wrote: > ... >> As it stands the new struct will introduce >>> a hole of 24 bytes after 'hint_type'. >> >> This gets implicitly padded at this point [1][2], and overall size is >> still capped by largest struct (which is of 16 bytes, placed just above >> this). > > For me it's about having hardly usable in the future by anyone else > 7 bytes of space or how much that will be. Try to add another field > using those bytes and endianess will start messing with you. And 7 > bytes is not that convenient. > > I have same problem with how commands were merged while I was not > looking. There was no explicit padding, and it split u64 into u32 > and implicit padding, so no apps can use the space to put a pointer > anymore while there was a much better option of using one of existing > 4B fields. How would you prefer it. Explicit padding (7 bytes), hint_type as u16 or anything else?
On 9/25/24 14:21, Kanchan Joshi wrote: > On 9/25/2024 5:53 PM, Pavel Begunkov wrote: >> On 9/25/24 12:09, Kanchan Joshi wrote: >>> On 9/25/2024 11:27 AM, Hannes Reinecke wrote: >> ... >>> As it stands the new struct will introduce >>>> a hole of 24 bytes after 'hint_type'. >>> >>> This gets implicitly padded at this point [1][2], and overall size is >>> still capped by largest struct (which is of 16 bytes, placed just above >>> this). >> >> For me it's about having hardly usable in the future by anyone else >> 7 bytes of space or how much that will be. Try to add another field >> using those bytes and endianess will start messing with you. And 7 >> bytes is not that convenient. >> >> I have same problem with how commands were merged while I was not >> looking. There was no explicit padding, and it split u64 into u32 >> and implicit padding, so no apps can use the space to put a pointer >> anymore while there was a much better option of using one of existing >> 4B fields. > > How would you prefer it. Explicit padding (7 bytes), hint_type as u16 or > anything else? Explicit padding is better than the current version. Ideally, I'd like the new fields gone (e.g. if it goes in the direction of per file hints) or prefer to minimise the size and make the leftover padding reusable, but that depends on what the feature needs to be extendable. And what hint types do we expect in the future? Another question, don't we want an apui that allows to pass multiple hints? Quite similar to what I asked about "meta" rw, and it might actually make a lot of sense to combine them into common infra, like what cmsg is for networking. meta[] = [ {INTEGRITY, integrity_params}, {write_hint, ...}, ...]; Even though an actual impl would need to be a bit more elaborated.
diff --git a/fs/fcntl.c b/fs/fcntl.c index 081e5e3d89ea..2eb78035a350 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -334,28 +334,6 @@ static int f_getowner_uids(struct file *filp, unsigned long arg) } #endif -static bool rw_hint_valid(u64 hint) -{ - BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); - BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); - BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); - BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); - BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); - BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); - - switch (hint) { - case RWH_WRITE_LIFE_NOT_SET: - case RWH_WRITE_LIFE_NONE: - case RWH_WRITE_LIFE_SHORT: - case RWH_WRITE_LIFE_MEDIUM: - case RWH_WRITE_LIFE_LONG: - case RWH_WRITE_LIFE_EXTREME: - return true; - default: - return false; - } -} - static long fcntl_get_rw_hint(struct file *file, unsigned int cmd, unsigned long arg) { diff --git a/include/linux/rw_hint.h b/include/linux/rw_hint.h index 309ca72f2dfb..f4373a71ffed 100644 --- a/include/linux/rw_hint.h +++ b/include/linux/rw_hint.h @@ -21,4 +21,28 @@ enum rw_hint { static_assert(sizeof(enum rw_hint) == 1); #endif +#define WRITE_LIFE_INVALID (RWH_WRITE_LIFE_EXTREME + 1) + +static inline bool rw_hint_valid(u64 hint) +{ + BUILD_BUG_ON(WRITE_LIFE_NOT_SET != RWH_WRITE_LIFE_NOT_SET); + BUILD_BUG_ON(WRITE_LIFE_NONE != RWH_WRITE_LIFE_NONE); + BUILD_BUG_ON(WRITE_LIFE_SHORT != RWH_WRITE_LIFE_SHORT); + BUILD_BUG_ON(WRITE_LIFE_MEDIUM != RWH_WRITE_LIFE_MEDIUM); + BUILD_BUG_ON(WRITE_LIFE_LONG != RWH_WRITE_LIFE_LONG); + BUILD_BUG_ON(WRITE_LIFE_EXTREME != RWH_WRITE_LIFE_EXTREME); + + switch (hint) { + case RWH_WRITE_LIFE_NOT_SET: + case RWH_WRITE_LIFE_NONE: + case RWH_WRITE_LIFE_SHORT: + case RWH_WRITE_LIFE_MEDIUM: + case RWH_WRITE_LIFE_LONG: + case RWH_WRITE_LIFE_EXTREME: + return true; + default: + return false; + } +} + #endif /* _LINUX_RW_HINT_H */ diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 1fe79e750470..e21a74dd0c49 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -98,6 +98,11 @@ struct io_uring_sqe { __u64 addr3; __u64 __pad2[1]; }; + struct { + /* To send per-io hint type/value with write command */ + __u64 hint_val; + __u8 hint_type; + }; __u64 optval; /* * If the ring is initialized with IORING_SETUP_SQE128, then @@ -107,6 +112,11 @@ struct io_uring_sqe { }; }; +enum hint_type { + /* this type covers the values supported by F_SET_RW_HINT fcntl */ + TYPE_WRITE_LIFETIME_HINT = 1, +}; + /* * If sqe->file_index is set to this for opcodes that instantiate a new * direct descriptor (like openat/openat2/accept), then io_uring will allocate diff --git a/io_uring/rw.c b/io_uring/rw.c index 510123d3d837..f78ad0ddeef5 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -269,6 +269,20 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, rw->kiocb.ki_ioprio = get_current_ioprio(); } rw->kiocb.dio_complete = NULL; + if (ddir == ITER_SOURCE) { + u8 htype = READ_ONCE(sqe->hint_type); + + rw->kiocb.ki_write_hint = WRITE_LIFE_INVALID; + if (htype) { + u64 hval = READ_ONCE(sqe->hint_val); + + if (htype != TYPE_WRITE_LIFETIME_HINT || + !rw_hint_valid(hval)) + return -EINVAL; + + rw->kiocb.ki_write_hint = hval; + } + } rw->addr = READ_ONCE(sqe->addr); rw->len = READ_ONCE(sqe->len); @@ -1023,7 +1037,12 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret)) return ret; req->cqe.res = iov_iter_count(&io->iter); - rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp); + /* + * Use per-file hint only if per-io hint is not set. + * We need per-io hint to get precedence. + */ + if (rw->kiocb.ki_write_hint == WRITE_LIFE_INVALID) + rw->kiocb.ki_write_hint = file_write_hint(rw->kiocb.ki_filp); if (force_nonblock) { /* If the file doesn't support async, just async punt */