diff mbox series

[v7,06/10] io_uring/rw: add support to send metadata along with read/write

Message ID 20241104140601.12239-7-anuj20.g@samsung.com (mailing list archive)
State New
Headers show
Series [v7,01/10] block: define set of integrity flags to be inherited by cloned bip | expand

Commit Message

Anuj Gupta Nov. 4, 2024, 2:05 p.m. UTC
This patch adds the capability of passing integrity metadata along with
read/write. A new meta_type field is introduced in SQE which indicates
the type of metadata being passed. A new 'struct io_uring_sqe_ext'
represents the secondary SQE space for read/write. The last 32 bytes of
secondary SQE is used to pass following PI related information:

- flags: integrity check flags namely
IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG}
- len: length of the pi/metadata buffer
- buf: address of the metadata buffer
- seed: seed value for reftag remapping
- app_tag: application defined 16b value

Application sets up a SQE128 ring, prepares PI information within the
second SQE. The patch processes this information to prepare uio_meta
descriptor and passes it down using kiocb->private.

Meta exchange is supported only for direct IO.
Also vectored read/write operations with meta are not supported
currently.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 include/uapi/linux/io_uring.h | 30 ++++++++++++
 io_uring/io_uring.c           |  8 ++++
 io_uring/rw.c                 | 88 ++++++++++++++++++++++++++++++++++-
 io_uring/rw.h                 | 14 +++++-
 4 files changed, 137 insertions(+), 3 deletions(-)

Comments

Christoph Hellwig Nov. 5, 2024, 9:56 a.m. UTC | #1
On Mon, Nov 04, 2024 at 07:35:57PM +0530, Anuj Gupta wrote:
> read/write. A new meta_type field is introduced in SQE which indicates
> the type of metadata being passed.

I still object to this completely pointless and ill-defined field.
Anuj gupta Nov. 5, 2024, 1:04 p.m. UTC | #2
On Tue, Nov 5, 2024 at 3:26 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Mon, Nov 04, 2024 at 07:35:57PM +0530, Anuj Gupta wrote:
> > read/write. A new meta_type field is introduced in SQE which indicates
> > the type of metadata being passed.
>
> I still object to this completely pointless and ill-defined field.

The field is used only at io_uring level, and it helps there in using the
SQE space flexibly.
Overall, while all other pieces are sorted, we are only missing the consensus
on io_uring bits. This is also an attempt to gain that. We will have to see
in what form Jens/Pavel would like to see this part.
Christoph Hellwig Nov. 5, 2024, 1:56 p.m. UTC | #3
On Tue, Nov 05, 2024 at 06:34:29PM +0530, Anuj gupta wrote:
> The field is used only at io_uring level, and it helps there in using the
> SQE space flexibly.

How so?  There is absolutely no documentation for it in either the
code or commit log.  And if it is about sqe space management, meta_type
is about the most confusing possible name as well.  So someone please
needs to write down how it is supposed to work and come up with a name
that remotely makes sense for that.
Kanchan Joshi Nov. 5, 2024, 3:51 p.m. UTC | #4
On 11/5/2024 7:26 PM, Christoph Hellwig wrote:
> On Tue, Nov 05, 2024 at 06:34:29PM +0530, Anuj gupta wrote:
>> The field is used only at io_uring level, and it helps there in using the
>> SQE space flexibly.
> 
> How so?  There is absolutely no documentation for it in either the
> code or commit log.  And if it is about sqe space management, meta_type
> is about the most confusing possible name as well.  So someone please
> needs to write down how it is supposed to work and come up with a name
> that remotely makes sense for that.

Can add the documentation (if this version is palatable for Jens/Pavel), 
but this was discussed in previous iteration:

1. Each meta type may have different space requirement in SQE.

Only for PI, we need so much space that we can't fit that in first SQE. 
The SQE128 requirement is only for PI type.
Another different meta type may just fit into the first SQE. For that we 
don't have to mandate SQE128.

2. If two meta types are known not to co-exist, they can be kept in the 
same place within SQE. Since each meta-type is a flag, we can check what 
combinations are valid within io_uring and throw the error in case of 
incompatibility.

3. Previous version was relying on SQE128 flag. If user set the ring 
that way, it is assumed that PI information was sent.
This is more explicitly conveyed now - if user passed META_TYPE_PI flag, 
it has sent the PI. This comment in the code:

+       /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */
+       union {

If this flag is not passed, parsing of second SQE is skipped, which is 
the current behavior as now also one can send regular (non pi) 
read/write on SQE128 ring.
Christoph Hellwig Nov. 5, 2024, 4 p.m. UTC | #5
On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote:
> Can add the documentation (if this version is palatable for Jens/Pavel), 
> but this was discussed in previous iteration:
> 
> 1. Each meta type may have different space requirement in SQE.
> 
> Only for PI, we need so much space that we can't fit that in first SQE. 
> The SQE128 requirement is only for PI type.
> Another different meta type may just fit into the first SQE. For that we 
> don't have to mandate SQE128.

Ok, I'm really confused now.  The way I understood Anuj was that this
is NOT about block level metadata, but about other uses of the big SQE.

Which version is right?  Or did I just completely misunderstand Anuj?

> 2. If two meta types are known not to co-exist, they can be kept in the 
> same place within SQE. Since each meta-type is a flag, we can check what 
> combinations are valid within io_uring and throw the error in case of 
> incompatibility.

And this sounds like what you refer to is not actually block metadata
as in this patchset or nvme, (or weirdly enough integrity in the block
layer code).

> 3. Previous version was relying on SQE128 flag. If user set the ring 
> that way, it is assumed that PI information was sent.
> This is more explicitly conveyed now - if user passed META_TYPE_PI flag, 
> it has sent the PI. This comment in the code:
> 
> +       /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */
> +       union {
> 
> If this flag is not passed, parsing of second SQE is skipped, which is 
> the current behavior as now also one can send regular (non pi) 
> read/write on SQE128 ring.

And while I don't understand how this threads in with the previous
statements, this makes sense.  If you only want to send a pointer (+len)
to metadata you can use the normal 64-byte SQE.  If you want to send
a PI tuple you need SEQ128.  Is that what the various above statements
try to express?  If so the right API to me would be to have two flags:

 - a flag that a pointer to metadata is passed.  This can work with
   a 64-bit SQE.
 - another flag that a PI tuple is passed.  This requires a 128-byte
   and also the previous flag.


> 
> 
> 
> 
> 
---end quoted text---
Keith Busch Nov. 5, 2024, 4:23 p.m. UTC | #6
On Tue, Nov 05, 2024 at 05:00:51PM +0100, Christoph Hellwig wrote:
> On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote:
> > Can add the documentation (if this version is palatable for Jens/Pavel), 
> > but this was discussed in previous iteration:
> > 
> > 1. Each meta type may have different space requirement in SQE.
> > 
> > Only for PI, we need so much space that we can't fit that in first SQE. 
> > The SQE128 requirement is only for PI type.
> > Another different meta type may just fit into the first SQE. For that we 
> > don't have to mandate SQE128.
> 
> Ok, I'm really confused now.  The way I understood Anuj was that this
> is NOT about block level metadata, but about other uses of the big SQE.
> 
> Which version is right?  Or did I just completely misunderstand Anuj?

Let's not call this "meta_type". Can we use something that has a less
overloaded meaning, like "sqe_extended_capabilities", or "ecap", or
something like that.
 
> > 2. If two meta types are known not to co-exist, they can be kept in the 
> > same place within SQE. Since each meta-type is a flag, we can check what 
> > combinations are valid within io_uring and throw the error in case of 
> > incompatibility.
> 
> And this sounds like what you refer to is not actually block metadata
> as in this patchset or nvme, (or weirdly enough integrity in the block
> layer code).
> 
> > 3. Previous version was relying on SQE128 flag. If user set the ring 
> > that way, it is assumed that PI information was sent.
> > This is more explicitly conveyed now - if user passed META_TYPE_PI flag, 
> > it has sent the PI. This comment in the code:
> > 
> > +       /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */
> > +       union {
> > 
> > If this flag is not passed, parsing of second SQE is skipped, which is 
> > the current behavior as now also one can send regular (non pi) 
> > read/write on SQE128 ring.
> 
> And while I don't understand how this threads in with the previous
> statements, this makes sense.  If you only want to send a pointer (+len)
> to metadata you can use the normal 64-byte SQE.  If you want to send
> a PI tuple you need SEQ128.  Is that what the various above statements
> try to express?  If so the right API to me would be to have two flags:
> 
>  - a flag that a pointer to metadata is passed.  This can work with
>    a 64-bit SQE.
>  - another flag that a PI tuple is passed.  This requires a 128-byte
>    and also the previous flag.

I don't think anything done so far aligns with what Pavel had in mind.
Let me try to lay out what I think he's going for. Just bare with me,
this is just a hypothetical example.

  This patch adds a PI extension.
  Later, let's say write streams needs another extenion.
  Then key per-IO wants another extention.
  Then someone else adds wizbang-awesome-feature extention.

Let's say you have device that can do all 4, or any combination of them.
Pavel wants a solution that is future proof to such a scenario. So not
just a single new "meta_type" with its structure, but a list of types in
no particular order, and their structures.

That list can exist either in the extended SQE, or in some other user
address that the kernel will need copy.
Kanchan Joshi Nov. 5, 2024, 4:38 p.m. UTC | #7
On 11/5/2024 9:30 PM, Christoph Hellwig wrote:
> On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote:
>> Can add the documentation (if this version is palatable for Jens/Pavel),
>> but this was discussed in previous iteration:
>>
>> 1. Each meta type may have different space requirement in SQE.
>>
>> Only for PI, we need so much space that we can't fit that in first SQE.
>> The SQE128 requirement is only for PI type.
>> Another different meta type may just fit into the first SQE. For that we
>> don't have to mandate SQE128.
> 
> Ok, I'm really confused now.  The way I understood Anuj was that this
> is NOT about block level metadata, but about other uses of the big SQE.
> 
> Which version is right?  Or did I just completely misunderstand Anuj?

We both mean the same. Currently read/write don't [need to] use big SQE 
as all the information is there in the first SQE.
Down the line there may be users fighting for space in SQE. The flag 
(meta_type) may help a bit when that happens.

>> 2. If two meta types are known not to co-exist, they can be kept in the
>> same place within SQE. Since each meta-type is a flag, we can check what
>> combinations are valid within io_uring and throw the error in case of
>> incompatibility.
> 
> And this sounds like what you refer to is not actually block metadata
> as in this patchset or nvme, (or weirdly enough integrity in the block
> layer code).

Right, not about block metadata/pi. But some extra information 
(different in size/semantics etc.) that user wants to pass into SQE 
along with read/write.

>> 3. Previous version was relying on SQE128 flag. If user set the ring
>> that way, it is assumed that PI information was sent.
>> This is more explicitly conveyed now - if user passed META_TYPE_PI flag,
>> it has sent the PI. This comment in the code:
>>
>> +       /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */
>> +       union {
>>
>> If this flag is not passed, parsing of second SQE is skipped, which is
>> the current behavior as now also one can send regular (non pi)
>> read/write on SQE128 ring.
> 
> And while I don't understand how this threads in with the previous
> statements, this makes sense.  If you only want to send a pointer (+len)
> to metadata you can use the normal 64-byte SQE.  If you want to send
> a PI tuple you need SEQ128.  Is that what the various above statements
> try to express? 

Not exactly. You are talking about pi-type 0 (which only requires meta 
buffer/len) versus !0 pi-type. We thought about it, but decided to keep 
asking for SQE128 regardless of that (pi 0 or non-zero). In both cases 
user will set meta-buffer/len, and other type-specific flags are taken 
care by the low-level code. This keeps thing simple and at io_uring 
level we don't have to distinguish that case.

What I rather meant in this statement was - one can setup a ring with 
SQE128 today and send IORING_OP_READ/IORING_OP_WRITE. That goes fine 
without any processing/error as SQE128 is skipped completely. So relying 
only on SQE128 flag to detect the presence of PI is a bit fragile.
Kanchan Joshi Nov. 5, 2024, 4:50 p.m. UTC | #8
On 11/5/2024 9:53 PM, Keith Busch wrote:
> On Tue, Nov 05, 2024 at 05:00:51PM +0100, Christoph Hellwig wrote:
>> On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote:
>>> Can add the documentation (if this version is palatable for Jens/Pavel),
>>> but this was discussed in previous iteration:
>>>
>>> 1. Each meta type may have different space requirement in SQE.
>>>
>>> Only for PI, we need so much space that we can't fit that in first SQE.
>>> The SQE128 requirement is only for PI type.
>>> Another different meta type may just fit into the first SQE. For that we
>>> don't have to mandate SQE128.
>>
>> Ok, I'm really confused now.  The way I understood Anuj was that this
>> is NOT about block level metadata, but about other uses of the big SQE.
>>
>> Which version is right?  Or did I just completely misunderstand Anuj?
> 
> Let's not call this "meta_type". Can we use something that has a less
> overloaded meaning, like "sqe_extended_capabilities", or "ecap", or
> something like that.
>   

Right, something like that. We need to change it.
Seems a useful thing is not being seen that way because of its name.

>>> 2. If two meta types are known not to co-exist, they can be kept in the
>>> same place within SQE. Since each meta-type is a flag, we can check what
>>> combinations are valid within io_uring and throw the error in case of
>>> incompatibility.
>>
>> And this sounds like what you refer to is not actually block metadata
>> as in this patchset or nvme, (or weirdly enough integrity in the block
>> layer code).
>>
>>> 3. Previous version was relying on SQE128 flag. If user set the ring
>>> that way, it is assumed that PI information was sent.
>>> This is more explicitly conveyed now - if user passed META_TYPE_PI flag,
>>> it has sent the PI. This comment in the code:
>>>
>>> +       /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */
>>> +       union {
>>>
>>> If this flag is not passed, parsing of second SQE is skipped, which is
>>> the current behavior as now also one can send regular (non pi)
>>> read/write on SQE128 ring.
>>
>> And while I don't understand how this threads in with the previous
>> statements, this makes sense.  If you only want to send a pointer (+len)
>> to metadata you can use the normal 64-byte SQE.  If you want to send
>> a PI tuple you need SEQ128.  Is that what the various above statements
>> try to express?  If so the right API to me would be to have two flags:
>>
>>   - a flag that a pointer to metadata is passed.  This can work with
>>     a 64-bit SQE.
>>   - another flag that a PI tuple is passed.  This requires a 128-byte
>>     and also the previous flag.
> 
> I don't think anything done so far aligns with what Pavel had in mind.
> Let me try to lay out what I think he's going for. Just bare with me,
> this is just a hypothetical example.

I have the same example in mind.


>    This patch adds a PI extension.
>    Later, let's say write streams needs another extenion.
>    Then key per-IO wants another extention.
>    Then someone else adds wizbang-awesome-feature extention.
> 
> Let's say you have device that can do all 4, or any combination of them.
> Pavel wants a solution that is future proof to such a scenario. So not
> just a single new "meta_type" with its structure, but a list of types in
> no particular order, and their structures.
> 
> That list can exist either in the extended SQE, or in some other user
> address that the kernel will need copy.

That list is the meta_type bit-flags this series creates.

For some future meta_type there can be "META_TYPE_XYZ_INDIRECT" flag and 
that will mean extra-information needs to fetched via copy_from_user.
Christoph Hellwig Nov. 6, 2024, 5:29 a.m. UTC | #9
On Tue, Nov 05, 2024 at 09:23:19AM -0700, Keith Busch wrote:
> > > The SQE128 requirement is only for PI type.
> > > Another different meta type may just fit into the first SQE. For that we 
> > > don't have to mandate SQE128.
> > 
> > Ok, I'm really confused now.  The way I understood Anuj was that this
> > is NOT about block level metadata, but about other uses of the big SQE.
> > 
> > Which version is right?  Or did I just completely misunderstand Anuj?
> 
> Let's not call this "meta_type". Can we use something that has a less
> overloaded meaning, like "sqe_extended_capabilities", or "ecap", or
> something like that.

So it's just a flag that a 128-byte SQE is used?  Don't we know that
implicitly from the sq?

> >  - a flag that a pointer to metadata is passed.  This can work with
> >    a 64-bit SQE.
> >  - another flag that a PI tuple is passed.  This requires a 128-byte
> >    and also the previous flag.
> 
> I don't think anything done so far aligns with what Pavel had in mind.
> Let me try to lay out what I think he's going for. Just bare with me,
> this is just a hypothetical example.
> 
>   This patch adds a PI extension.
>   Later, let's say write streams needs another extenion.
>   Then key per-IO wants another extention.
>   Then someone else adds wizbang-awesome-feature extention.
> 
> Let's say you have device that can do all 4, or any combination of them.
> Pavel wants a solution that is future proof to such a scenario. So not
> just a single new "meta_type" with its structure, but a list of types in
> no particular order, and their structures.

But why do we need the type at all?  Each of them obvious needs two
things:

 1) some space to actually store the extra fields
 2) a flag that the additional values are passed

any single value is not going to help with supporting arbitrary
combinations, because well, you can can mix and match, and you need
space for all them even if you are not using all of them.
Christoph Hellwig Nov. 6, 2024, 5:33 a.m. UTC | #10
On Tue, Nov 05, 2024 at 10:08:46PM +0530, Kanchan Joshi wrote:
> We both mean the same. Currently read/write don't [need to] use big SQE 
> as all the information is there in the first SQE.
> Down the line there may be users fighting for space in SQE. The flag 
> (meta_type) may help a bit when that happens.

IFF we ever have a fight we need to split command or add an even bigger
SQE.`

> What I rather meant in this statement was - one can setup a ring with 
> SQE128 today and send IORING_OP_READ/IORING_OP_WRITE. That goes fine 
> without any processing/error as SQE128 is skipped completely. So relying 
> only on SQE128 flag to detect the presence of PI is a bit fragile.

Maybe the right answer is to add

READ_LARGE/WRITE_LARGE (better names welcome) commands that are defined
to the entire 128-byte SQE, and then we have a bitmap of what extra
features are supported in it, with descriptive names for each feature.
Not trying to have one command for 64 vs 128 byte SQE might also be
useful to have a more straight forward layout in general (although I
haven't checked for that, just speaking from experience in other
protocols).
Kanchan Joshi Nov. 6, 2024, 6 a.m. UTC | #11
On 11/6/2024 10:59 AM, Christoph Hellwig wrote:
> On Tue, Nov 05, 2024 at 09:23:19AM -0700, Keith Busch wrote:
>>>> The SQE128 requirement is only for PI type.
>>>> Another different meta type may just fit into the first SQE. For that we
>>>> don't have to mandate SQE128.
>>>
>>> Ok, I'm really confused now.  The way I understood Anuj was that this
>>> is NOT about block level metadata, but about other uses of the big SQE.
>>>
>>> Which version is right?  Or did I just completely misunderstand Anuj?
>>
>> Let's not call this "meta_type". Can we use something that has a less
>> overloaded meaning, like "sqe_extended_capabilities", or "ecap", or
>> something like that.
> 
> So it's just a flag that a 128-byte SQE is used?

No, this flag tells that user decided to send PI in SQE. And this flag 
is kept into first half of SQE (which always exists). This is just 
additional detail/requirement that PI fields are kept into SQE128 (which 
is opt in).

>  Don't we know that
> implicitly from the sq?

Yes, we have a separate ring-level flag for that.

#define IORING_SETUP_SQE128             (1U << 10) /* SQEs are 128 byte */

>>>   - a flag that a pointer to metadata is passed.  This can work with
>>>     a 64-bit SQE.
>>>   - another flag that a PI tuple is passed.  This requires a 128-byte
>>>     and also the previous flag.
>>
>> I don't think anything done so far aligns with what Pavel had in mind.
>> Let me try to lay out what I think he's going for. Just bare with me,
>> this is just a hypothetical example.
>>
>>    This patch adds a PI extension.
>>    Later, let's say write streams needs another extenion.
>>    Then key per-IO wants another extention.
>>    Then someone else adds wizbang-awesome-feature extention.
>>
>> Let's say you have device that can do all 4, or any combination of them.
>> Pavel wants a solution that is future proof to such a scenario. So not
>> just a single new "meta_type" with its structure, but a list of types in
>> no particular order, and their structures.
> 
> But why do we need the type at all?  Each of them obvious needs two
> things:
> 
>   1) some space to actually store the extra fields
>   2) a flag that the additional values are passed

Yes, this is exactly how the patch is implemented. 'meta-type' is the 
flag that tells additional values (representing PI info) are passed.

> any single value is not going to help with supporting arbitrary
> combinations,

Not a single value. It is a u16 field, so it can represent 16 possible 
flags.
This part in the patch:

+enum io_uring_sqe_meta_type_bits {
+       META_TYPE_PI_BIT,
+       /* not a real meta type; just to make sure that we don't overflow */
+       META_TYPE_LAST_BIT,
+};
+
+/* meta type flags */
+#define META_TYPE_PI   (1U << META_TYPE_PI_BIT)

For future users, one can add things like META_TYPE_KPIO_BIT or 
META_TYPE_WRITE_HINT_BIT if they needed to send extra information in SQE.

Note that these users may not require SQE128. It all depends on how much 
of extra information is required. We still have some free space in first 
SQE.

  because well, you can can mix and match, and you need
> space for all them even if you are not using all of them.

mix-and-match can be detected with the above flags.
And in case two types don't go well together, that also. And for such 
types we can reuse the space.
Christoph Hellwig Nov. 6, 2024, 6:12 a.m. UTC | #12
On Wed, Nov 06, 2024 at 11:30:45AM +0530, Kanchan Joshi wrote:
> >   1) some space to actually store the extra fields
> >   2) a flag that the additional values are passed
> 
> Yes, this is exactly how the patch is implemented. 'meta-type' is the 
> flag that tells additional values (representing PI info) are passed.
> 
> > any single value is not going to help with supporting arbitrary
> > combinations,
> 
> Not a single value. It is a u16 field, so it can represent 16 possible 
> flags.
> This part in the patch:
> 
> +enum io_uring_sqe_meta_type_bits {
> +       META_TYPE_PI_BIT,
> +       /* not a real meta type; just to make sure that we don't overflow */
> +       META_TYPE_LAST_BIT,
> +};

Well, then it's grossly misnamed and underdocumented.  For one the
meta name simply is wrong because it's about all extra features.
Second a type implies an enumeration of types, not a set of flags.

So if you actually name this extended_features or similar and clearly
document it might actually make sense.
diff mbox series

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 024745283783..7f01124bedd5 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -92,6 +92,10 @@  struct io_uring_sqe {
 			__u16	addr_len;
 			__u16	__pad3[1];
 		};
+		struct {
+			__u16	meta_type;
+			__u16	__pad4[1];
+		};
 	};
 	union {
 		struct {
@@ -107,6 +111,32 @@  struct io_uring_sqe {
 	};
 };
 
+enum io_uring_sqe_meta_type_bits {
+	META_TYPE_PI_BIT,
+	/* not a real meta type; just to make sure that we don't overflow */
+	META_TYPE_LAST_BIT,
+};
+
+/* meta type flags */
+#define META_TYPE_PI	(1U << META_TYPE_PI_BIT)
+
+/* Second half of SQE128 for IORING_OP_READ/WRITE */
+struct io_uring_sqe_ext {
+	__u64	rsvd0[4];
+	/* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */
+	union {
+		__u64	rsvd1[4];
+		struct {
+			__u16	flags;
+			__u16	app_tag;
+			__u32	len;
+			__u64	addr;
+			__u64	seed;
+			__u64	rsvd;
+		} rw_pi;
+	};
+};
+
 /*
  * If sqe->file_index is set to this for opcodes that instantiate a new
  * direct descriptor (like openat/openat2/accept), then io_uring will allocate
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 44a772013c09..116c93022985 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3875,7 +3875,9 @@  static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(44, __s32,  splice_fd_in);
 	BUILD_BUG_SQE_ELEM(44, __u32,  file_index);
 	BUILD_BUG_SQE_ELEM(44, __u16,  addr_len);
+	BUILD_BUG_SQE_ELEM(44, __u16,  meta_type);
 	BUILD_BUG_SQE_ELEM(46, __u16,  __pad3[0]);
+	BUILD_BUG_SQE_ELEM(46, __u16,  __pad4[0]);
 	BUILD_BUG_SQE_ELEM(48, __u64,  addr3);
 	BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
 	BUILD_BUG_SQE_ELEM(56, __u64,  __pad2);
@@ -3902,6 +3904,12 @@  static int __init io_uring_init(void)
 	/* top 8bits are for internal use */
 	BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0);
 
+	BUILD_BUG_ON(sizeof(struct io_uring_sqe_ext) !=
+		     sizeof(struct io_uring_sqe));
+
+	BUILD_BUG_ON(META_TYPE_LAST_BIT >
+		     8 * sizeof_field(struct io_uring_sqe, meta_type));
+
 	io_uring_optable_init();
 
 	/*
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 30448f343c7f..eb19b033df24 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -257,11 +257,64 @@  static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import)
 	return 0;
 }
 
+static inline void io_meta_save_state(struct io_async_rw *io)
+{
+	io->meta_state.seed = io->meta.seed;
+	iov_iter_save_state(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static inline void io_meta_restore(struct io_async_rw *io)
+{
+	io->meta.seed = io->meta_state.seed;
+	iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static inline const void *io_uring_sqe_ext(const struct io_uring_sqe *sqe)
+{
+	return (sqe + 1);
+}
+
+static int io_prep_rw_pi(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			   struct io_rw *rw, int ddir)
+{
+	const struct io_uring_sqe_ext *sqe_ext;
+	const struct io_issue_def *def;
+	struct io_async_rw *io;
+	int ret;
+
+	if (!(req->ctx->flags & IORING_SETUP_SQE128))
+		return -EINVAL;
+
+	sqe_ext = io_uring_sqe_ext(sqe);
+	if (READ_ONCE(sqe_ext->rsvd0[0]) || READ_ONCE(sqe_ext->rsvd0[1])
+	    || READ_ONCE(sqe_ext->rsvd0[2]) || READ_ONCE(sqe_ext->rsvd0[3]))
+		return -EINVAL;
+	if (READ_ONCE(sqe_ext->rw_pi.rsvd))
+		return -EINVAL;
+
+	def = &io_issue_defs[req->opcode];
+	if (def->vectored)
+		return -EOPNOTSUPP;
+
+	io = req->async_data;
+	io->meta.flags = READ_ONCE(sqe_ext->rw_pi.flags);
+	io->meta.app_tag = READ_ONCE(sqe_ext->rw_pi.app_tag);
+	io->meta.seed = READ_ONCE(sqe_ext->rw_pi.seed);
+	ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(sqe_ext->rw_pi.addr)),
+			  READ_ONCE(sqe_ext->rw_pi.len), &io->meta.iter);
+	if (unlikely(ret < 0))
+		return ret;
+	rw->kiocb.ki_flags |= IOCB_HAS_METADATA;
+	io_meta_save_state(io);
+	return ret;
+}
+
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      int ddir, bool do_import)
 {
 	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 	unsigned ioprio;
+	u16 meta_type;
 	int ret;
 
 	rw->kiocb.ki_pos = READ_ONCE(sqe->off);
@@ -279,11 +332,23 @@  static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
+	rw->kiocb.ki_flags = 0;
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
 	rw->flags = READ_ONCE(sqe->rw_flags);
-	return io_prep_rw_setup(req, ddir, do_import);
+	ret = io_prep_rw_setup(req, ddir, do_import);
+
+	if (unlikely(ret))
+		return ret;
+
+	meta_type = READ_ONCE(sqe->meta_type);
+	if (meta_type) {
+		if (READ_ONCE(sqe->__pad4[0]) || !(meta_type & META_TYPE_PI))
+			return -EINVAL;
+		ret = io_prep_rw_pi(req, sqe, rw, ddir);
+	}
+	return ret;
 }
 
 int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -409,7 +474,10 @@  static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req)
 static void io_resubmit_prep(struct io_kiocb *req)
 {
 	struct io_async_rw *io = req->async_data;
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 
+	if (rw->kiocb.ki_flags & IOCB_HAS_METADATA)
+		io_meta_restore(io);
 	iov_iter_restore(&io->iter, &io->iter_state);
 }
 
@@ -794,7 +862,7 @@  static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 	if (!(req->flags & REQ_F_FIXED_FILE))
 		req->flags |= io_file_get_flags(file);
 
-	kiocb->ki_flags = file->f_iocb_flags;
+	kiocb->ki_flags |= file->f_iocb_flags;
 	ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type);
 	if (unlikely(ret))
 		return ret;
@@ -823,6 +891,18 @@  static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 		kiocb->ki_complete = io_complete_rw;
 	}
 
+	if (kiocb->ki_flags & IOCB_HAS_METADATA) {
+		struct io_async_rw *io = req->async_data;
+
+		/*
+		 * We have a union of meta fields with wpq used for buffered-io
+		 * in io_async_rw, so fail it here.
+		 */
+		if (!(req->file->f_flags & O_DIRECT))
+			return -EOPNOTSUPP;
+		kiocb->private = &io->meta;
+	}
+
 	return 0;
 }
 
@@ -897,6 +977,8 @@  static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 	 * manually if we need to.
 	 */
 	iov_iter_restore(&io->iter, &io->iter_state);
+	if (kiocb->ki_flags & IOCB_HAS_METADATA)
+		io_meta_restore(io);
 
 	do {
 		/*
@@ -1101,6 +1183,8 @@  int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	} else {
 ret_eagain:
 		iov_iter_restore(&io->iter, &io->iter_state);
+		if (kiocb->ki_flags & IOCB_HAS_METADATA)
+			io_meta_restore(io);
 		if (kiocb->ki_flags & IOCB_WRITE)
 			io_req_end_write(req);
 		return -EAGAIN;
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3f432dc75441..2d7656bd268d 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -2,6 +2,11 @@ 
 
 #include <linux/pagemap.h>
 
+struct io_meta_state {
+	u32			seed;
+	struct iov_iter_state	iter_meta;
+};
+
 struct io_async_rw {
 	size_t				bytes_done;
 	struct iov_iter			iter;
@@ -9,7 +14,14 @@  struct io_async_rw {
 	struct iovec			fast_iov;
 	struct iovec			*free_iovec;
 	int				free_iov_nr;
-	struct wait_page_queue		wpq;
+	/* wpq is for buffered io, while meta fields are used with direct io */
+	union {
+		struct wait_page_queue		wpq;
+		struct {
+			struct uio_meta			meta;
+			struct io_meta_state		meta_state;
+		};
+	};
 };
 
 int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);