Message ID | 20241104140601.12239-7-anuj20.g@samsung.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v7,01/10] block: define set of integrity flags to be inherited by cloned bip | expand |
On Mon, Nov 04, 2024 at 07:35:57PM +0530, Anuj Gupta wrote: > read/write. A new meta_type field is introduced in SQE which indicates > the type of metadata being passed. I still object to this completely pointless and ill-defined field.
On Tue, Nov 5, 2024 at 3:26 PM Christoph Hellwig <hch@lst.de> wrote: > > On Mon, Nov 04, 2024 at 07:35:57PM +0530, Anuj Gupta wrote: > > read/write. A new meta_type field is introduced in SQE which indicates > > the type of metadata being passed. > > I still object to this completely pointless and ill-defined field. The field is used only at io_uring level, and it helps there in using the SQE space flexibly. Overall, while all other pieces are sorted, we are only missing the consensus on io_uring bits. This is also an attempt to gain that. We will have to see in what form Jens/Pavel would like to see this part.
On Tue, Nov 05, 2024 at 06:34:29PM +0530, Anuj gupta wrote: > The field is used only at io_uring level, and it helps there in using the > SQE space flexibly. How so? There is absolutely no documentation for it in either the code or commit log. And if it is about sqe space management, meta_type is about the most confusing possible name as well. So someone please needs to write down how it is supposed to work and come up with a name that remotely makes sense for that.
On 11/5/2024 7:26 PM, Christoph Hellwig wrote: > On Tue, Nov 05, 2024 at 06:34:29PM +0530, Anuj gupta wrote: >> The field is used only at io_uring level, and it helps there in using the >> SQE space flexibly. > > How so? There is absolutely no documentation for it in either the > code or commit log. And if it is about sqe space management, meta_type > is about the most confusing possible name as well. So someone please > needs to write down how it is supposed to work and come up with a name > that remotely makes sense for that. Can add the documentation (if this version is palatable for Jens/Pavel), but this was discussed in previous iteration: 1. Each meta type may have different space requirement in SQE. Only for PI, we need so much space that we can't fit that in first SQE. The SQE128 requirement is only for PI type. Another different meta type may just fit into the first SQE. For that we don't have to mandate SQE128. 2. If two meta types are known not to co-exist, they can be kept in the same place within SQE. Since each meta-type is a flag, we can check what combinations are valid within io_uring and throw the error in case of incompatibility. 3. Previous version was relying on SQE128 flag. If user set the ring that way, it is assumed that PI information was sent. This is more explicitly conveyed now - if user passed META_TYPE_PI flag, it has sent the PI. This comment in the code: + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */ + union { If this flag is not passed, parsing of second SQE is skipped, which is the current behavior as now also one can send regular (non pi) read/write on SQE128 ring.
On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote: > Can add the documentation (if this version is palatable for Jens/Pavel), > but this was discussed in previous iteration: > > 1. Each meta type may have different space requirement in SQE. > > Only for PI, we need so much space that we can't fit that in first SQE. > The SQE128 requirement is only for PI type. > Another different meta type may just fit into the first SQE. For that we > don't have to mandate SQE128. Ok, I'm really confused now. The way I understood Anuj was that this is NOT about block level metadata, but about other uses of the big SQE. Which version is right? Or did I just completely misunderstand Anuj? > 2. If two meta types are known not to co-exist, they can be kept in the > same place within SQE. Since each meta-type is a flag, we can check what > combinations are valid within io_uring and throw the error in case of > incompatibility. And this sounds like what you refer to is not actually block metadata as in this patchset or nvme, (or weirdly enough integrity in the block layer code). > 3. Previous version was relying on SQE128 flag. If user set the ring > that way, it is assumed that PI information was sent. > This is more explicitly conveyed now - if user passed META_TYPE_PI flag, > it has sent the PI. This comment in the code: > > + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */ > + union { > > If this flag is not passed, parsing of second SQE is skipped, which is > the current behavior as now also one can send regular (non pi) > read/write on SQE128 ring. And while I don't understand how this threads in with the previous statements, this makes sense. If you only want to send a pointer (+len) to metadata you can use the normal 64-byte SQE. If you want to send a PI tuple you need SEQ128. Is that what the various above statements try to express? If so the right API to me would be to have two flags: - a flag that a pointer to metadata is passed. This can work with a 64-bit SQE. - another flag that a PI tuple is passed. This requires a 128-byte and also the previous flag. > > > > > ---end quoted text---
On Tue, Nov 05, 2024 at 05:00:51PM +0100, Christoph Hellwig wrote: > On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote: > > Can add the documentation (if this version is palatable for Jens/Pavel), > > but this was discussed in previous iteration: > > > > 1. Each meta type may have different space requirement in SQE. > > > > Only for PI, we need so much space that we can't fit that in first SQE. > > The SQE128 requirement is only for PI type. > > Another different meta type may just fit into the first SQE. For that we > > don't have to mandate SQE128. > > Ok, I'm really confused now. The way I understood Anuj was that this > is NOT about block level metadata, but about other uses of the big SQE. > > Which version is right? Or did I just completely misunderstand Anuj? Let's not call this "meta_type". Can we use something that has a less overloaded meaning, like "sqe_extended_capabilities", or "ecap", or something like that. > > 2. If two meta types are known not to co-exist, they can be kept in the > > same place within SQE. Since each meta-type is a flag, we can check what > > combinations are valid within io_uring and throw the error in case of > > incompatibility. > > And this sounds like what you refer to is not actually block metadata > as in this patchset or nvme, (or weirdly enough integrity in the block > layer code). > > > 3. Previous version was relying on SQE128 flag. If user set the ring > > that way, it is assumed that PI information was sent. > > This is more explicitly conveyed now - if user passed META_TYPE_PI flag, > > it has sent the PI. This comment in the code: > > > > + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */ > > + union { > > > > If this flag is not passed, parsing of second SQE is skipped, which is > > the current behavior as now also one can send regular (non pi) > > read/write on SQE128 ring. > > And while I don't understand how this threads in with the previous > statements, this makes sense. If you only want to send a pointer (+len) > to metadata you can use the normal 64-byte SQE. If you want to send > a PI tuple you need SEQ128. Is that what the various above statements > try to express? If so the right API to me would be to have two flags: > > - a flag that a pointer to metadata is passed. This can work with > a 64-bit SQE. > - another flag that a PI tuple is passed. This requires a 128-byte > and also the previous flag. I don't think anything done so far aligns with what Pavel had in mind. Let me try to lay out what I think he's going for. Just bare with me, this is just a hypothetical example. This patch adds a PI extension. Later, let's say write streams needs another extenion. Then key per-IO wants another extention. Then someone else adds wizbang-awesome-feature extention. Let's say you have device that can do all 4, or any combination of them. Pavel wants a solution that is future proof to such a scenario. So not just a single new "meta_type" with its structure, but a list of types in no particular order, and their structures. That list can exist either in the extended SQE, or in some other user address that the kernel will need copy.
On 11/5/2024 9:30 PM, Christoph Hellwig wrote: > On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote: >> Can add the documentation (if this version is palatable for Jens/Pavel), >> but this was discussed in previous iteration: >> >> 1. Each meta type may have different space requirement in SQE. >> >> Only for PI, we need so much space that we can't fit that in first SQE. >> The SQE128 requirement is only for PI type. >> Another different meta type may just fit into the first SQE. For that we >> don't have to mandate SQE128. > > Ok, I'm really confused now. The way I understood Anuj was that this > is NOT about block level metadata, but about other uses of the big SQE. > > Which version is right? Or did I just completely misunderstand Anuj? We both mean the same. Currently read/write don't [need to] use big SQE as all the information is there in the first SQE. Down the line there may be users fighting for space in SQE. The flag (meta_type) may help a bit when that happens. >> 2. If two meta types are known not to co-exist, they can be kept in the >> same place within SQE. Since each meta-type is a flag, we can check what >> combinations are valid within io_uring and throw the error in case of >> incompatibility. > > And this sounds like what you refer to is not actually block metadata > as in this patchset or nvme, (or weirdly enough integrity in the block > layer code). Right, not about block metadata/pi. But some extra information (different in size/semantics etc.) that user wants to pass into SQE along with read/write. >> 3. Previous version was relying on SQE128 flag. If user set the ring >> that way, it is assumed that PI information was sent. >> This is more explicitly conveyed now - if user passed META_TYPE_PI flag, >> it has sent the PI. This comment in the code: >> >> + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */ >> + union { >> >> If this flag is not passed, parsing of second SQE is skipped, which is >> the current behavior as now also one can send regular (non pi) >> read/write on SQE128 ring. > > And while I don't understand how this threads in with the previous > statements, this makes sense. If you only want to send a pointer (+len) > to metadata you can use the normal 64-byte SQE. If you want to send > a PI tuple you need SEQ128. Is that what the various above statements > try to express? Not exactly. You are talking about pi-type 0 (which only requires meta buffer/len) versus !0 pi-type. We thought about it, but decided to keep asking for SQE128 regardless of that (pi 0 or non-zero). In both cases user will set meta-buffer/len, and other type-specific flags are taken care by the low-level code. This keeps thing simple and at io_uring level we don't have to distinguish that case. What I rather meant in this statement was - one can setup a ring with SQE128 today and send IORING_OP_READ/IORING_OP_WRITE. That goes fine without any processing/error as SQE128 is skipped completely. So relying only on SQE128 flag to detect the presence of PI is a bit fragile.
On 11/5/2024 9:53 PM, Keith Busch wrote: > On Tue, Nov 05, 2024 at 05:00:51PM +0100, Christoph Hellwig wrote: >> On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote: >>> Can add the documentation (if this version is palatable for Jens/Pavel), >>> but this was discussed in previous iteration: >>> >>> 1. Each meta type may have different space requirement in SQE. >>> >>> Only for PI, we need so much space that we can't fit that in first SQE. >>> The SQE128 requirement is only for PI type. >>> Another different meta type may just fit into the first SQE. For that we >>> don't have to mandate SQE128. >> >> Ok, I'm really confused now. The way I understood Anuj was that this >> is NOT about block level metadata, but about other uses of the big SQE. >> >> Which version is right? Or did I just completely misunderstand Anuj? > > Let's not call this "meta_type". Can we use something that has a less > overloaded meaning, like "sqe_extended_capabilities", or "ecap", or > something like that. > Right, something like that. We need to change it. Seems a useful thing is not being seen that way because of its name. >>> 2. If two meta types are known not to co-exist, they can be kept in the >>> same place within SQE. Since each meta-type is a flag, we can check what >>> combinations are valid within io_uring and throw the error in case of >>> incompatibility. >> >> And this sounds like what you refer to is not actually block metadata >> as in this patchset or nvme, (or weirdly enough integrity in the block >> layer code). >> >>> 3. Previous version was relying on SQE128 flag. If user set the ring >>> that way, it is assumed that PI information was sent. >>> This is more explicitly conveyed now - if user passed META_TYPE_PI flag, >>> it has sent the PI. This comment in the code: >>> >>> + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */ >>> + union { >>> >>> If this flag is not passed, parsing of second SQE is skipped, which is >>> the current behavior as now also one can send regular (non pi) >>> read/write on SQE128 ring. >> >> And while I don't understand how this threads in with the previous >> statements, this makes sense. If you only want to send a pointer (+len) >> to metadata you can use the normal 64-byte SQE. If you want to send >> a PI tuple you need SEQ128. Is that what the various above statements >> try to express? If so the right API to me would be to have two flags: >> >> - a flag that a pointer to metadata is passed. This can work with >> a 64-bit SQE. >> - another flag that a PI tuple is passed. This requires a 128-byte >> and also the previous flag. > > I don't think anything done so far aligns with what Pavel had in mind. > Let me try to lay out what I think he's going for. Just bare with me, > this is just a hypothetical example. I have the same example in mind. > This patch adds a PI extension. > Later, let's say write streams needs another extenion. > Then key per-IO wants another extention. > Then someone else adds wizbang-awesome-feature extention. > > Let's say you have device that can do all 4, or any combination of them. > Pavel wants a solution that is future proof to such a scenario. So not > just a single new "meta_type" with its structure, but a list of types in > no particular order, and their structures. > > That list can exist either in the extended SQE, or in some other user > address that the kernel will need copy. That list is the meta_type bit-flags this series creates. For some future meta_type there can be "META_TYPE_XYZ_INDIRECT" flag and that will mean extra-information needs to fetched via copy_from_user.
On Tue, Nov 05, 2024 at 09:23:19AM -0700, Keith Busch wrote: > > > The SQE128 requirement is only for PI type. > > > Another different meta type may just fit into the first SQE. For that we > > > don't have to mandate SQE128. > > > > Ok, I'm really confused now. The way I understood Anuj was that this > > is NOT about block level metadata, but about other uses of the big SQE. > > > > Which version is right? Or did I just completely misunderstand Anuj? > > Let's not call this "meta_type". Can we use something that has a less > overloaded meaning, like "sqe_extended_capabilities", or "ecap", or > something like that. So it's just a flag that a 128-byte SQE is used? Don't we know that implicitly from the sq? > > - a flag that a pointer to metadata is passed. This can work with > > a 64-bit SQE. > > - another flag that a PI tuple is passed. This requires a 128-byte > > and also the previous flag. > > I don't think anything done so far aligns with what Pavel had in mind. > Let me try to lay out what I think he's going for. Just bare with me, > this is just a hypothetical example. > > This patch adds a PI extension. > Later, let's say write streams needs another extenion. > Then key per-IO wants another extention. > Then someone else adds wizbang-awesome-feature extention. > > Let's say you have device that can do all 4, or any combination of them. > Pavel wants a solution that is future proof to such a scenario. So not > just a single new "meta_type" with its structure, but a list of types in > no particular order, and their structures. But why do we need the type at all? Each of them obvious needs two things: 1) some space to actually store the extra fields 2) a flag that the additional values are passed any single value is not going to help with supporting arbitrary combinations, because well, you can can mix and match, and you need space for all them even if you are not using all of them.
On Tue, Nov 05, 2024 at 10:08:46PM +0530, Kanchan Joshi wrote: > We both mean the same. Currently read/write don't [need to] use big SQE > as all the information is there in the first SQE. > Down the line there may be users fighting for space in SQE. The flag > (meta_type) may help a bit when that happens. IFF we ever have a fight we need to split command or add an even bigger SQE.` > What I rather meant in this statement was - one can setup a ring with > SQE128 today and send IORING_OP_READ/IORING_OP_WRITE. That goes fine > without any processing/error as SQE128 is skipped completely. So relying > only on SQE128 flag to detect the presence of PI is a bit fragile. Maybe the right answer is to add READ_LARGE/WRITE_LARGE (better names welcome) commands that are defined to the entire 128-byte SQE, and then we have a bitmap of what extra features are supported in it, with descriptive names for each feature. Not trying to have one command for 64 vs 128 byte SQE might also be useful to have a more straight forward layout in general (although I haven't checked for that, just speaking from experience in other protocols).
On 11/6/2024 10:59 AM, Christoph Hellwig wrote: > On Tue, Nov 05, 2024 at 09:23:19AM -0700, Keith Busch wrote: >>>> The SQE128 requirement is only for PI type. >>>> Another different meta type may just fit into the first SQE. For that we >>>> don't have to mandate SQE128. >>> >>> Ok, I'm really confused now. The way I understood Anuj was that this >>> is NOT about block level metadata, but about other uses of the big SQE. >>> >>> Which version is right? Or did I just completely misunderstand Anuj? >> >> Let's not call this "meta_type". Can we use something that has a less >> overloaded meaning, like "sqe_extended_capabilities", or "ecap", or >> something like that. > > So it's just a flag that a 128-byte SQE is used? No, this flag tells that user decided to send PI in SQE. And this flag is kept into first half of SQE (which always exists). This is just additional detail/requirement that PI fields are kept into SQE128 (which is opt in). > Don't we know that > implicitly from the sq? Yes, we have a separate ring-level flag for that. #define IORING_SETUP_SQE128 (1U << 10) /* SQEs are 128 byte */ >>> - a flag that a pointer to metadata is passed. This can work with >>> a 64-bit SQE. >>> - another flag that a PI tuple is passed. This requires a 128-byte >>> and also the previous flag. >> >> I don't think anything done so far aligns with what Pavel had in mind. >> Let me try to lay out what I think he's going for. Just bare with me, >> this is just a hypothetical example. >> >> This patch adds a PI extension. >> Later, let's say write streams needs another extenion. >> Then key per-IO wants another extention. >> Then someone else adds wizbang-awesome-feature extention. >> >> Let's say you have device that can do all 4, or any combination of them. >> Pavel wants a solution that is future proof to such a scenario. So not >> just a single new "meta_type" with its structure, but a list of types in >> no particular order, and their structures. > > But why do we need the type at all? Each of them obvious needs two > things: > > 1) some space to actually store the extra fields > 2) a flag that the additional values are passed Yes, this is exactly how the patch is implemented. 'meta-type' is the flag that tells additional values (representing PI info) are passed. > any single value is not going to help with supporting arbitrary > combinations, Not a single value. It is a u16 field, so it can represent 16 possible flags. This part in the patch: +enum io_uring_sqe_meta_type_bits { + META_TYPE_PI_BIT, + /* not a real meta type; just to make sure that we don't overflow */ + META_TYPE_LAST_BIT, +}; + +/* meta type flags */ +#define META_TYPE_PI (1U << META_TYPE_PI_BIT) For future users, one can add things like META_TYPE_KPIO_BIT or META_TYPE_WRITE_HINT_BIT if they needed to send extra information in SQE. Note that these users may not require SQE128. It all depends on how much of extra information is required. We still have some free space in first SQE. because well, you can can mix and match, and you need > space for all them even if you are not using all of them. mix-and-match can be detected with the above flags. And in case two types don't go well together, that also. And for such types we can reuse the space.
On Wed, Nov 06, 2024 at 11:30:45AM +0530, Kanchan Joshi wrote: > > 1) some space to actually store the extra fields > > 2) a flag that the additional values are passed > > Yes, this is exactly how the patch is implemented. 'meta-type' is the > flag that tells additional values (representing PI info) are passed. > > > any single value is not going to help with supporting arbitrary > > combinations, > > Not a single value. It is a u16 field, so it can represent 16 possible > flags. > This part in the patch: > > +enum io_uring_sqe_meta_type_bits { > + META_TYPE_PI_BIT, > + /* not a real meta type; just to make sure that we don't overflow */ > + META_TYPE_LAST_BIT, > +}; Well, then it's grossly misnamed and underdocumented. For one the meta name simply is wrong because it's about all extra features. Second a type implies an enumeration of types, not a set of flags. So if you actually name this extended_features or similar and clearly document it might actually make sense.
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 024745283783..7f01124bedd5 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -92,6 +92,10 @@ struct io_uring_sqe { __u16 addr_len; __u16 __pad3[1]; }; + struct { + __u16 meta_type; + __u16 __pad4[1]; + }; }; union { struct { @@ -107,6 +111,32 @@ struct io_uring_sqe { }; }; +enum io_uring_sqe_meta_type_bits { + META_TYPE_PI_BIT, + /* not a real meta type; just to make sure that we don't overflow */ + META_TYPE_LAST_BIT, +}; + +/* meta type flags */ +#define META_TYPE_PI (1U << META_TYPE_PI_BIT) + +/* Second half of SQE128 for IORING_OP_READ/WRITE */ +struct io_uring_sqe_ext { + __u64 rsvd0[4]; + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */ + union { + __u64 rsvd1[4]; + struct { + __u16 flags; + __u16 app_tag; + __u32 len; + __u64 addr; + __u64 seed; + __u64 rsvd; + } rw_pi; + }; +}; + /* * If sqe->file_index is set to this for opcodes that instantiate a new * direct descriptor (like openat/openat2/accept), then io_uring will allocate diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 44a772013c09..116c93022985 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3875,7 +3875,9 @@ static int __init io_uring_init(void) BUILD_BUG_SQE_ELEM(44, __s32, splice_fd_in); BUILD_BUG_SQE_ELEM(44, __u32, file_index); BUILD_BUG_SQE_ELEM(44, __u16, addr_len); + BUILD_BUG_SQE_ELEM(44, __u16, meta_type); BUILD_BUG_SQE_ELEM(46, __u16, __pad3[0]); + BUILD_BUG_SQE_ELEM(46, __u16, __pad4[0]); BUILD_BUG_SQE_ELEM(48, __u64, addr3); BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd); BUILD_BUG_SQE_ELEM(56, __u64, __pad2); @@ -3902,6 +3904,12 @@ static int __init io_uring_init(void) /* top 8bits are for internal use */ BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0); + BUILD_BUG_ON(sizeof(struct io_uring_sqe_ext) != + sizeof(struct io_uring_sqe)); + + BUILD_BUG_ON(META_TYPE_LAST_BIT > + 8 * sizeof_field(struct io_uring_sqe, meta_type)); + io_uring_optable_init(); /* diff --git a/io_uring/rw.c b/io_uring/rw.c index 30448f343c7f..eb19b033df24 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -257,11 +257,64 @@ static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import) return 0; } +static inline void io_meta_save_state(struct io_async_rw *io) +{ + io->meta_state.seed = io->meta.seed; + iov_iter_save_state(&io->meta.iter, &io->meta_state.iter_meta); +} + +static inline void io_meta_restore(struct io_async_rw *io) +{ + io->meta.seed = io->meta_state.seed; + iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta); +} + +static inline const void *io_uring_sqe_ext(const struct io_uring_sqe *sqe) +{ + return (sqe + 1); +} + +static int io_prep_rw_pi(struct io_kiocb *req, const struct io_uring_sqe *sqe, + struct io_rw *rw, int ddir) +{ + const struct io_uring_sqe_ext *sqe_ext; + const struct io_issue_def *def; + struct io_async_rw *io; + int ret; + + if (!(req->ctx->flags & IORING_SETUP_SQE128)) + return -EINVAL; + + sqe_ext = io_uring_sqe_ext(sqe); + if (READ_ONCE(sqe_ext->rsvd0[0]) || READ_ONCE(sqe_ext->rsvd0[1]) + || READ_ONCE(sqe_ext->rsvd0[2]) || READ_ONCE(sqe_ext->rsvd0[3])) + return -EINVAL; + if (READ_ONCE(sqe_ext->rw_pi.rsvd)) + return -EINVAL; + + def = &io_issue_defs[req->opcode]; + if (def->vectored) + return -EOPNOTSUPP; + + io = req->async_data; + io->meta.flags = READ_ONCE(sqe_ext->rw_pi.flags); + io->meta.app_tag = READ_ONCE(sqe_ext->rw_pi.app_tag); + io->meta.seed = READ_ONCE(sqe_ext->rw_pi.seed); + ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(sqe_ext->rw_pi.addr)), + READ_ONCE(sqe_ext->rw_pi.len), &io->meta.iter); + if (unlikely(ret < 0)) + return ret; + rw->kiocb.ki_flags |= IOCB_HAS_METADATA; + io_meta_save_state(io); + return ret; +} + static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, int ddir, bool do_import) { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); unsigned ioprio; + u16 meta_type; int ret; rw->kiocb.ki_pos = READ_ONCE(sqe->off); @@ -279,11 +332,23 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, rw->kiocb.ki_ioprio = get_current_ioprio(); } rw->kiocb.dio_complete = NULL; + rw->kiocb.ki_flags = 0; rw->addr = READ_ONCE(sqe->addr); rw->len = READ_ONCE(sqe->len); rw->flags = READ_ONCE(sqe->rw_flags); - return io_prep_rw_setup(req, ddir, do_import); + ret = io_prep_rw_setup(req, ddir, do_import); + + if (unlikely(ret)) + return ret; + + meta_type = READ_ONCE(sqe->meta_type); + if (meta_type) { + if (READ_ONCE(sqe->__pad4[0]) || !(meta_type & META_TYPE_PI)) + return -EINVAL; + ret = io_prep_rw_pi(req, sqe, rw, ddir); + } + return ret; } int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -409,7 +474,10 @@ static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req) static void io_resubmit_prep(struct io_kiocb *req) { struct io_async_rw *io = req->async_data; + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + if (rw->kiocb.ki_flags & IOCB_HAS_METADATA) + io_meta_restore(io); iov_iter_restore(&io->iter, &io->iter_state); } @@ -794,7 +862,7 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) if (!(req->flags & REQ_F_FIXED_FILE)) req->flags |= io_file_get_flags(file); - kiocb->ki_flags = file->f_iocb_flags; + kiocb->ki_flags |= file->f_iocb_flags; ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type); if (unlikely(ret)) return ret; @@ -823,6 +891,18 @@ static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type) kiocb->ki_complete = io_complete_rw; } + if (kiocb->ki_flags & IOCB_HAS_METADATA) { + struct io_async_rw *io = req->async_data; + + /* + * We have a union of meta fields with wpq used for buffered-io + * in io_async_rw, so fail it here. + */ + if (!(req->file->f_flags & O_DIRECT)) + return -EOPNOTSUPP; + kiocb->private = &io->meta; + } + return 0; } @@ -897,6 +977,8 @@ static int __io_read(struct io_kiocb *req, unsigned int issue_flags) * manually if we need to. */ iov_iter_restore(&io->iter, &io->iter_state); + if (kiocb->ki_flags & IOCB_HAS_METADATA) + io_meta_restore(io); do { /* @@ -1101,6 +1183,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) } else { ret_eagain: iov_iter_restore(&io->iter, &io->iter_state); + if (kiocb->ki_flags & IOCB_HAS_METADATA) + io_meta_restore(io); if (kiocb->ki_flags & IOCB_WRITE) io_req_end_write(req); return -EAGAIN; diff --git a/io_uring/rw.h b/io_uring/rw.h index 3f432dc75441..2d7656bd268d 100644 --- a/io_uring/rw.h +++ b/io_uring/rw.h @@ -2,6 +2,11 @@ #include <linux/pagemap.h> +struct io_meta_state { + u32 seed; + struct iov_iter_state iter_meta; +}; + struct io_async_rw { size_t bytes_done; struct iov_iter iter; @@ -9,7 +14,14 @@ struct io_async_rw { struct iovec fast_iov; struct iovec *free_iovec; int free_iov_nr; - struct wait_page_queue wpq; + /* wpq is for buffered io, while meta fields are used with direct io */ + union { + struct wait_page_queue wpq; + struct { + struct uio_meta meta; + struct io_meta_state meta_state; + }; + }; }; int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);