diff mbox series

[v6,06/10] io_uring/rw: add support to send metadata along with read/write

Message ID 20241030180112.4635-7-joshi.k@samsung.com (mailing list archive)
State New
Headers show
Series [v6,01/10] block: define set of integrity flags to be inherited by cloned bip | expand

Commit Message

Kanchan Joshi Oct. 30, 2024, 6:01 p.m. UTC
From: Anuj Gupta <anuj20.g@samsung.com>

This patch adds the capability of passing integrity metadata along with
read/write.

Introduce a new 'struct io_uring_meta_pi' that contains following:
- pi_flags: integrity check flags namely
IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG}
- len: length of the pi/metadata buffer
- buf: address of the metadata buffer
- seed: seed value for reftag remapping
- app_tag: application defined 16b value

Application sets up a SQE128 ring, prepares io_uring_meta_pi within
the second SQE.
The patch processes this information to prepare uio_meta descriptor
and passes it down using kiocb->private.

Meta exchange is supported only for direct IO.
Also vectored read/write operations with meta are not supported
currently.

Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
---
 include/uapi/linux/io_uring.h | 16 ++++++++
 io_uring/io_uring.c           |  4 ++
 io_uring/rw.c                 | 71 ++++++++++++++++++++++++++++++++++-
 io_uring/rw.h                 | 14 ++++++-
 4 files changed, 102 insertions(+), 3 deletions(-)

Comments

Keith Busch Oct. 30, 2024, 9:09 p.m. UTC | #1
On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote:
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 024745283783..48dcca125db3 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -105,6 +105,22 @@ struct io_uring_sqe {
>  		 */
>  		__u8	cmd[0];
>  	};
> +	/*
> +	 * If the ring is initialized with IORING_SETUP_SQE128, then
> +	 * this field is starting offset for 64 bytes of data. For meta io
> +	 * this contains 'struct io_uring_meta_pi'
> +	 */
> +	__u8	big_sqe[0];
> +};
> +
> +/* this is placed in SQE128 */
> +struct io_uring_meta_pi {
> +	__u16		pi_flags;
> +	__u16		app_tag;
> +	__u32		len;
> +	__u64		addr;
> +	__u64		seed;
> +	__u64		rsvd[2];
>  };

On the previous version, I was more questioning if it aligns with what
Pavel was trying to do here. I didn't quite get it, so I was more
confused than saying it should be this way now.

But I personally think this path makes sense. I would set it up just a
little differently for extended sqe's so that the PI overlays a more
generic struct that other opcodes might find a way to use later.
Something like:

struct io_uring_sqe_ext {
	union {
		__u32	rsvd0[8];
		struct {
			__u16		pi_flags;
			__u16		app_tag;
			__u32		len;
			__u64		addr;
			__u64		seed;
		} rw_pi;
	};
	__u32	rsvd1[8];
};
  
> @@ -3902,6 +3903,9 @@ static int __init io_uring_init(void)
>  	/* top 8bits are for internal use */
>  	BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0);
>  
> +	BUILD_BUG_ON(sizeof(struct io_uring_meta_pi) >
> +		     sizeof(struct io_uring_sqe));

Then this check would become:

	BUILD_BUG_ON(sizeof(struct io_uring_sqe_ext) != sizeof(struct io_uring_sqe));
Christoph Hellwig Oct. 31, 2024, 6:55 a.m. UTC | #2
Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>
Pavel Begunkov Oct. 31, 2024, 2:39 p.m. UTC | #3
On 10/30/24 21:09, Keith Busch wrote:
> On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote:
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index 024745283783..48dcca125db3 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -105,6 +105,22 @@ struct io_uring_sqe {
>>   		 */
>>   		__u8	cmd[0];
>>   	};
>> +	/*
>> +	 * If the ring is initialized with IORING_SETUP_SQE128, then
>> +	 * this field is starting offset for 64 bytes of data. For meta io
>> +	 * this contains 'struct io_uring_meta_pi'
>> +	 */
>> +	__u8	big_sqe[0];
>> +};

I don't think zero sized arrays are good as a uapi regardless of
cmd[0] above, let's just do

sqe = get_sqe();
big_sqe = (void *)(sqe + 1)

with an appropriate helper.

>> +
>> +/* this is placed in SQE128 */
>> +struct io_uring_meta_pi {
>> +	__u16		pi_flags;
>> +	__u16		app_tag;
>> +	__u32		len;
>> +	__u64		addr;
>> +	__u64		seed;
>> +	__u64		rsvd[2];
>>   };
> 
> On the previous version, I was more questioning if it aligns with what

I missed that discussion, let me know if I need to look it up

> Pavel was trying to do here. I didn't quite get it, so I was more
> confused than saying it should be this way now.

The point is, SQEs don't have nearly enough space to accommodate all
such optional features, especially when it's taking so much space and
not applicable to all reads but rather some specific  use cases and
files. Consider that there might be more similar extensions and we might
even want to use them together.

1. SQE128 makes it big for all requests, intermixing with requests that
don't need additional space wastes space. SQE128 is fine to use but at
the same time we should be mindful about it and try to avoid enabling it
if feasible.

2. This API hard codes io_uring_meta_pi into the extended part of the
SQE. If we want to add another feature it'd need to go after the meta
struct. SQE256? And what if the user doesn't need PI but only the second
feature?

In short, the uAPI need to have a clear vision of how it can be used
with / extended to multiple optional features and not just PI.

One option I mentioned before is passing a user pointer to an array of
structures, each would will have the type specifying what kind of
feature / meta information it is, e.g. META_TYPE_PI. It's not a
complete solution but a base idea to extend upon. I separately
mentioned before, if copy_from_user is expensive we can optimise it
with pre-registering memory. I think Jens even tried something similar
with structures we pass as waiting parameters.

I didn't read through all iterations of the series, so if there is
some other approach described that ticks the boxes and flexible
enough, I'd be absolutely fine with it.
diff mbox series

Patch

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 024745283783..48dcca125db3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -105,6 +105,22 @@  struct io_uring_sqe {
 		 */
 		__u8	cmd[0];
 	};
+	/*
+	 * If the ring is initialized with IORING_SETUP_SQE128, then
+	 * this field is starting offset for 64 bytes of data. For meta io
+	 * this contains 'struct io_uring_meta_pi'
+	 */
+	__u8	big_sqe[0];
+};
+
+/* this is placed in SQE128 */
+struct io_uring_meta_pi {
+	__u16		pi_flags;
+	__u16		app_tag;
+	__u32		len;
+	__u64		addr;
+	__u64		seed;
+	__u64		rsvd[2];
 };
 
 /*
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 44a772013c09..c5fd74e42c04 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3879,6 +3879,7 @@  static int __init io_uring_init(void)
 	BUILD_BUG_SQE_ELEM(48, __u64,  addr3);
 	BUILD_BUG_SQE_ELEM_SIZE(48, 0, cmd);
 	BUILD_BUG_SQE_ELEM(56, __u64,  __pad2);
+	BUILD_BUG_SQE_ELEM_SIZE(64, 0, big_sqe);
 
 	BUILD_BUG_ON(sizeof(struct io_uring_files_update) !=
 		     sizeof(struct io_uring_rsrc_update));
@@ -3902,6 +3903,9 @@  static int __init io_uring_init(void)
 	/* top 8bits are for internal use */
 	BUILD_BUG_ON((IORING_URING_CMD_MASK & 0xff000000) != 0);
 
+	BUILD_BUG_ON(sizeof(struct io_uring_meta_pi) >
+		     sizeof(struct io_uring_sqe));
+
 	io_uring_optable_init();
 
 	/*
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 30448f343c7f..cbb74fcfd0d1 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -257,6 +257,46 @@  static int io_prep_rw_setup(struct io_kiocb *req, int ddir, bool do_import)
 	return 0;
 }
 
+static inline void io_meta_save_state(struct io_async_rw *io)
+{
+	io->meta_state.seed = io->meta.seed;
+	iov_iter_save_state(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static inline void io_meta_restore(struct io_async_rw *io)
+{
+	io->meta.seed = io->meta_state.seed;
+	iov_iter_restore(&io->meta.iter, &io->meta_state.iter_meta);
+}
+
+static int io_prep_rw_meta(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+			   struct io_rw *rw, int ddir)
+{
+	const struct io_uring_meta_pi *md = (struct io_uring_meta_pi *)sqe->big_sqe;
+	const struct io_issue_def *def;
+	struct io_async_rw *io;
+	int ret;
+
+	if (READ_ONCE(md->rsvd[0]) || READ_ONCE(md->rsvd[1]))
+		return -EINVAL;
+
+	def = &io_issue_defs[req->opcode];
+	if (def->vectored)
+		return -EOPNOTSUPP;
+
+	io = req->async_data;
+	io->meta.flags = READ_ONCE(md->pi_flags);
+	io->meta.app_tag = READ_ONCE(md->app_tag);
+	io->meta.seed = READ_ONCE(md->seed);
+	ret = import_ubuf(ddir, u64_to_user_ptr(READ_ONCE(md->addr)),
+			  READ_ONCE(md->len), &io->meta.iter);
+	if (unlikely(ret < 0))
+		return ret;
+	rw->kiocb.ki_flags |= IOCB_HAS_METADATA;
+	io_meta_save_state(io);
+	return ret;
+}
+
 static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		      int ddir, bool do_import)
 {
@@ -279,11 +319,19 @@  static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 		rw->kiocb.ki_ioprio = get_current_ioprio();
 	}
 	rw->kiocb.dio_complete = NULL;
+	rw->kiocb.ki_flags = 0;
 
 	rw->addr = READ_ONCE(sqe->addr);
 	rw->len = READ_ONCE(sqe->len);
 	rw->flags = READ_ONCE(sqe->rw_flags);
-	return io_prep_rw_setup(req, ddir, do_import);
+	ret = io_prep_rw_setup(req, ddir, do_import);
+
+	if (unlikely(ret))
+		return ret;
+
+	if (req->ctx->flags & IORING_SETUP_SQE128)
+		ret = io_prep_rw_meta(req, sqe, rw, ddir);
+	return ret;
 }
 
 int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -409,7 +457,10 @@  static inline loff_t *io_kiocb_update_pos(struct io_kiocb *req)
 static void io_resubmit_prep(struct io_kiocb *req)
 {
 	struct io_async_rw *io = req->async_data;
+	struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
 
+	if (rw->kiocb.ki_flags & IOCB_HAS_METADATA)
+		io_meta_restore(io);
 	iov_iter_restore(&io->iter, &io->iter_state);
 }
 
@@ -794,7 +845,7 @@  static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 	if (!(req->flags & REQ_F_FIXED_FILE))
 		req->flags |= io_file_get_flags(file);
 
-	kiocb->ki_flags = file->f_iocb_flags;
+	kiocb->ki_flags |= file->f_iocb_flags;
 	ret = kiocb_set_rw_flags(kiocb, rw->flags, rw_type);
 	if (unlikely(ret))
 		return ret;
@@ -823,6 +874,18 @@  static int io_rw_init_file(struct io_kiocb *req, fmode_t mode, int rw_type)
 		kiocb->ki_complete = io_complete_rw;
 	}
 
+	if (kiocb->ki_flags & IOCB_HAS_METADATA) {
+		struct io_async_rw *io = req->async_data;
+
+		/*
+		 * We have a union of meta fields with wpq used for buffered-io
+		 * in io_async_rw, so fail it here.
+		 */
+		if (!(req->file->f_flags & O_DIRECT))
+			return -EOPNOTSUPP;
+		kiocb->private = &io->meta;
+	}
+
 	return 0;
 }
 
@@ -897,6 +960,8 @@  static int __io_read(struct io_kiocb *req, unsigned int issue_flags)
 	 * manually if we need to.
 	 */
 	iov_iter_restore(&io->iter, &io->iter_state);
+	if (kiocb->ki_flags & IOCB_HAS_METADATA)
+		io_meta_restore(io);
 
 	do {
 		/*
@@ -1101,6 +1166,8 @@  int io_write(struct io_kiocb *req, unsigned int issue_flags)
 	} else {
 ret_eagain:
 		iov_iter_restore(&io->iter, &io->iter_state);
+		if (kiocb->ki_flags & IOCB_HAS_METADATA)
+			io_meta_restore(io);
 		if (kiocb->ki_flags & IOCB_WRITE)
 			io_req_end_write(req);
 		return -EAGAIN;
diff --git a/io_uring/rw.h b/io_uring/rw.h
index 3f432dc75441..2d7656bd268d 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -2,6 +2,11 @@ 
 
 #include <linux/pagemap.h>
 
+struct io_meta_state {
+	u32			seed;
+	struct iov_iter_state	iter_meta;
+};
+
 struct io_async_rw {
 	size_t				bytes_done;
 	struct iov_iter			iter;
@@ -9,7 +14,14 @@  struct io_async_rw {
 	struct iovec			fast_iov;
 	struct iovec			*free_iovec;
 	int				free_iov_nr;
-	struct wait_page_queue		wpq;
+	/* wpq is for buffered io, while meta fields are used with direct io */
+	union {
+		struct wait_page_queue		wpq;
+		struct {
+			struct uio_meta			meta;
+			struct io_meta_state		meta_state;
+		};
+	};
 };
 
 int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe);