diff mbox series

io_uring/rw: transform single vector readv/writev into ubuf

Message ID 43cb1fb7-b30b-8df1-bba6-e50797d680c6@kernel.dk (mailing list archive)
State New
Headers show
Series io_uring/rw: transform single vector readv/writev into ubuf | expand

Commit Message

Jens Axboe March 24, 2023, 2:35 p.m. UTC
It's very common to have applications that use vectored reads or writes,
even if they only pass in a single segment. Obviously they should be
using read/write at that point, but...

Vectored IO comes with the downside of needing to retain iovec state,
and hence they require and allocation and state copy if they end up
getting deferred. Additionally, they also require extra cleanup when
completed as the memory as the allocated state memory has to be freed.

Automatically transform single segment IORING_OP_{READV,WRITEV} into
IORING_OP_{READ,WRITE}, and hence into an ITER_UBUF. Outside of being
more efficient if needing deferral, ITER_UBUF is also more efficient
for normal processing compared to ITER_IOVEC, as they don't require
iteration. The latter is apparent when running peak testing, where
using IORING_OP_READV to randomly read 24 drives previously scored:

IOPS=72.54M, BW=35.42GiB/s, IOS/call=32/31
IOPS=73.35M, BW=35.81GiB/s, IOS/call=32/31
IOPS=72.71M, BW=35.50GiB/s, IOS/call=32/31
IOPS=73.29M, BW=35.78GiB/s, IOS/call=32/32
IOPS=73.45M, BW=35.86GiB/s, IOS/call=32/32
IOPS=73.19M, BW=35.74GiB/s, IOS/call=31/32
IOPS=72.89M, BW=35.59GiB/s, IOS/call=32/31
IOPS=73.07M, BW=35.68GiB/s, IOS/call=32/32

and after the change we get:

IOPS=77.31M, BW=37.75GiB/s, IOS/call=32/31
IOPS=77.32M, BW=37.75GiB/s, IOS/call=32/32
IOPS=77.45M, BW=37.81GiB/s, IOS/call=31/31
IOPS=77.47M, BW=37.83GiB/s, IOS/call=32/32
IOPS=77.14M, BW=37.67GiB/s, IOS/call=32/32
IOPS=77.14M, BW=37.66GiB/s, IOS/call=31/31
IOPS=77.37M, BW=37.78GiB/s, IOS/call=32/32
IOPS=77.25M, BW=37.72GiB/s, IOS/call=32/32

which is a nice win as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

---

Comments

Ming Lei March 24, 2023, 10:41 p.m. UTC | #1
On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote:
> It's very common to have applications that use vectored reads or writes,
> even if they only pass in a single segment. Obviously they should be
> using read/write at that point, but...

Yeah, it is like fixing application issue in kernel side, :-)

> 
> Vectored IO comes with the downside of needing to retain iovec state,
> and hence they require and allocation and state copy if they end up
> getting deferred. Additionally, they also require extra cleanup when
> completed as the memory as the allocated state memory has to be freed.
> 
> Automatically transform single segment IORING_OP_{READV,WRITEV} into
> IORING_OP_{READ,WRITE}, and hence into an ITER_UBUF. Outside of being
> more efficient if needing deferral, ITER_UBUF is also more efficient
> for normal processing compared to ITER_IOVEC, as they don't require
> iteration. The latter is apparent when running peak testing, where
> using IORING_OP_READV to randomly read 24 drives previously scored:
> 
> IOPS=72.54M, BW=35.42GiB/s, IOS/call=32/31
> IOPS=73.35M, BW=35.81GiB/s, IOS/call=32/31
> IOPS=72.71M, BW=35.50GiB/s, IOS/call=32/31
> IOPS=73.29M, BW=35.78GiB/s, IOS/call=32/32
> IOPS=73.45M, BW=35.86GiB/s, IOS/call=32/32
> IOPS=73.19M, BW=35.74GiB/s, IOS/call=31/32
> IOPS=72.89M, BW=35.59GiB/s, IOS/call=32/31
> IOPS=73.07M, BW=35.68GiB/s, IOS/call=32/32
> 
> and after the change we get:
> 
> IOPS=77.31M, BW=37.75GiB/s, IOS/call=32/31
> IOPS=77.32M, BW=37.75GiB/s, IOS/call=32/32
> IOPS=77.45M, BW=37.81GiB/s, IOS/call=31/31
> IOPS=77.47M, BW=37.83GiB/s, IOS/call=32/32
> IOPS=77.14M, BW=37.67GiB/s, IOS/call=32/32
> IOPS=77.14M, BW=37.66GiB/s, IOS/call=31/31
> IOPS=77.37M, BW=37.78GiB/s, IOS/call=32/32
> IOPS=77.25M, BW=37.72GiB/s, IOS/call=32/32
> 
> which is a nice win as well.
> 
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> 
> ---
> 
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 4c233910e200..5c998754cb17 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
>  			      req->ctx->compat);
>  	if (unlikely(ret < 0))
>  		return ERR_PTR(ret);
> -	return iovec;
> +	if (iter->nr_segs != 1)
> +		return iovec;
> +	/*
> +	 * Convert to non-vectored request if we have a single segment. If we
> +	 * need to defer the request, then we no longer have to allocate and
> +	 * maintain a struct io_async_rw. Additionally, we won't have cleanup
> +	 * to do at completion time
> +	 */
> +	rw->addr = (unsigned long) iter->iov[0].iov_base;
> +	rw->len = iter->iov[0].iov_len;
> +	iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len);
> +	/* readv -> read distance is the same as writev -> write */
> +	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
> +			(IORING_OP_WRITE - IORING_OP_WRITEV));
> +	req->opcode += (IORING_OP_READ - IORING_OP_READV);

It is a bit fragile to change ->opcode, which may need matched callbacks for
the two OPs, also cause inconsistent opcode in traces.

I am wondering why not play the magic in io_prep_rw() from beginning?


Thanks,
Ming
Jens Axboe March 24, 2023, 11:06 p.m. UTC | #2
On 3/24/23 4:41?PM, Ming Lei wrote:
> On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote:
>> It's very common to have applications that use vectored reads or writes,
>> even if they only pass in a single segment. Obviously they should be
>> using read/write at that point, but...
> 
> Yeah, it is like fixing application issue in kernel side, :-)

It totally is, the same thing happens all of the time for readv as well.
No amount of informing or documenting will ever fix that...

Also see:

https://lore.kernel.org/linux-fsdevel/20230324204443.45950-1-axboe@kernel.dk/

with which I think I'll change this one to just be:

	if (iter->iter_type == ITER_UBUF) {
		rw->addr = iter->ubuf;
		rw->len = iter->count;
	/* readv -> read distance is the same as writev -> write */
	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
			(IORING_OP_WRITE - IORING_OP_WRITEV));
		req->opcode += (IORING_OP_READ - IORING_OP_READV);
	}

instead.

We could also just skip it completely and just have liburing do the
right thing if io_uring_prep_readv/writev is called with nr_segs == 1.
Just turn it into a READ/WRITE at that point. If we do that, and with
the above generic change, it's probably Good Enough. If you use
READV/WRITEV and you're using the raw interface, then you're on your
own.

>> +	rw->addr = (unsigned long) iter->iov[0].iov_base;
>> +	rw->len = iter->iov[0].iov_len;
>> +	iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len);
>> +	/* readv -> read distance is the same as writev -> write */
>> +	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
>> +			(IORING_OP_WRITE - IORING_OP_WRITEV));
>> +	req->opcode += (IORING_OP_READ - IORING_OP_READV);
> 
> It is a bit fragile to change ->opcode, which may need matched
> callbacks for the two OPs, also cause inconsistent opcode in traces.
> 
> I am wondering why not play the magic in io_prep_rw() from beginning?

It has to be done when importing the vec, we cannot really do it in
prep... Well we could, but that'd be adding a bunch more code and
duplicating part of the vec import.
Keith Busch March 24, 2023, 11:54 p.m. UTC | #3
On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote:
> @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
>  			      req->ctx->compat);
>  	if (unlikely(ret < 0))
>  		return ERR_PTR(ret);
> -	return iovec;
> +	if (iter->nr_segs != 1)
> +		return iovec;
> +	/*
> +	 * Convert to non-vectored request if we have a single segment. If we
> +	 * need to defer the request, then we no longer have to allocate and
> +	 * maintain a struct io_async_rw. Additionally, we won't have cleanup
> +	 * to do at completion time
> +	 */
> +	rw->addr = (unsigned long) iter->iov[0].iov_base;
> +	rw->len = iter->iov[0].iov_len;
> +	iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len);
> +	/* readv -> read distance is the same as writev -> write */
> +	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
> +			(IORING_OP_WRITE - IORING_OP_WRITEV));
> +	req->opcode += (IORING_OP_READ - IORING_OP_READV);
> +	return NULL;
>  }

This may break anyone using io_uring with those bizarre drivers that have
entirely different readv semantics from normal read. I think we can safely say
no one's using io_uring for such interfaces, so probably a moot point. If you
wanted to be extra cautious though, you may want to skip this transformation if
the file->f_op implements both .read+read_iter and .write+.write_iter.
Ming Lei March 25, 2023, 12:24 a.m. UTC | #4
On Fri, Mar 24, 2023 at 05:06:00PM -0600, Jens Axboe wrote:
> On 3/24/23 4:41?PM, Ming Lei wrote:
> > On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote:
> >> It's very common to have applications that use vectored reads or writes,
> >> even if they only pass in a single segment. Obviously they should be
> >> using read/write at that point, but...
> > 
> > Yeah, it is like fixing application issue in kernel side, :-)
> 
> It totally is, the same thing happens all of the time for readv as well.
> No amount of informing or documenting will ever fix that...
> 
> Also see:
> 
> https://lore.kernel.org/linux-fsdevel/20230324204443.45950-1-axboe@kernel.dk/
> 
> with which I think I'll change this one to just be:
> 
> 	if (iter->iter_type == ITER_UBUF) {
> 		rw->addr = iter->ubuf;
> 		rw->len = iter->count;
> 	/* readv -> read distance is the same as writev -> write */
> 	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
> 			(IORING_OP_WRITE - IORING_OP_WRITEV));
> 		req->opcode += (IORING_OP_READ - IORING_OP_READV);
> 	}
> 
> instead.
> 
> We could also just skip it completely and just have liburing do the
> right thing if io_uring_prep_readv/writev is called with nr_segs == 1.
> Just turn it into a READ/WRITE at that point. If we do that, and with
> the above generic change, it's probably Good Enough. If you use
> READV/WRITEV and you're using the raw interface, then you're on your
> own.
> 
> >> +	rw->addr = (unsigned long) iter->iov[0].iov_base;
> >> +	rw->len = iter->iov[0].iov_len;
> >> +	iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len);
> >> +	/* readv -> read distance is the same as writev -> write */
> >> +	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
> >> +			(IORING_OP_WRITE - IORING_OP_WRITEV));
> >> +	req->opcode += (IORING_OP_READ - IORING_OP_READV);
> > 
> > It is a bit fragile to change ->opcode, which may need matched
> > callbacks for the two OPs, also cause inconsistent opcode in traces.
> > 
> > I am wondering why not play the magic in io_prep_rw() from beginning?
> 
> It has to be done when importing the vec, we cannot really do it in
> prep... Well we could, but that'd be adding a bunch more code and
> duplicating part of the vec import.

I meant something like the following(un-tested), which at least
guarantees that op_code, rw->addr/len are finalized since ->prep().

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 0c292ef9a40f..4bf4c3effdac 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -120,6 +120,25 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 			return ret;
 	}
 
+	if (req->opcode == IORING_OP_READV && req->opcode == IORING_OP_WRITEV &&
+			rw->len == 1) {
+		struct iovec iov;
+		struct iovec *iovp;
+
+		iovp = iovec_from_user(u64_to_user_ptr(rw->addr), 1, 1, &iov,
+					req->ctx->compat);
+		if (IS_ERR(iovp))
+			return PTR_ERR(iovp);
+
+		rw->addr = (unsigned long) iovp->iov_base;
+		rw->len = iovp->iov_len;
+
+		/* readv -> read distance is the same as writev -> write */
+		BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
+				(IORING_OP_WRITE - IORING_OP_WRITEV));
+		req->opcode += (IORING_OP_READ - IORING_OP_READV);
+	}
+
 	return 0;
 }
 


Thanks,
Ming
Keith Busch March 25, 2023, 1:06 a.m. UTC | #5
On Fri, Mar 24, 2023 at 05:54:36PM -0600, Keith Busch wrote:
> On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote:
> > @@ -402,7 +402,22 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
> >  			      req->ctx->compat);
> >  	if (unlikely(ret < 0))
> >  		return ERR_PTR(ret);
> > -	return iovec;
> > +	if (iter->nr_segs != 1)
> > +		return iovec;
> > +	/*
> > +	 * Convert to non-vectored request if we have a single segment. If we
> > +	 * need to defer the request, then we no longer have to allocate and
> > +	 * maintain a struct io_async_rw. Additionally, we won't have cleanup
> > +	 * to do at completion time
> > +	 */
> > +	rw->addr = (unsigned long) iter->iov[0].iov_base;
> > +	rw->len = iter->iov[0].iov_len;
> > +	iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len);
> > +	/* readv -> read distance is the same as writev -> write */
> > +	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
> > +			(IORING_OP_WRITE - IORING_OP_WRITEV));
> > +	req->opcode += (IORING_OP_READ - IORING_OP_READV);
> > +	return NULL;
> >  }
> 
> This may break anyone using io_uring with those bizarre drivers that have
> entirely different readv semantics from normal read. I think we can safely say
> no one's using io_uring for such interfaces, so probably a moot point. If you
> wanted to be extra cautious though, you may want to skip this transformation if
> the file->f_op implements both .read+read_iter and .write+.write_iter.

Sorry, scratch that; those drivers are already broken if they implement both
with different semantics, so we can definitely say none of those are using
io_uring. This new patch should be safe then.
Pavel Begunkov March 27, 2023, 11:45 a.m. UTC | #6
On 3/25/23 00:24, Ming Lei wrote:
> On Fri, Mar 24, 2023 at 05:06:00PM -0600, Jens Axboe wrote:
>> On 3/24/23 4:41?PM, Ming Lei wrote:
>>> On Fri, Mar 24, 2023 at 08:35:38AM -0600, Jens Axboe wrote:
>>>> It's very common to have applications that use vectored reads or writes,
>>>> even if they only pass in a single segment. Obviously they should be
>>>> using read/write at that point, but...
>>>
>>> Yeah, it is like fixing application issue in kernel side, :-)
>>
>> It totally is, the same thing happens all of the time for readv as well.
>> No amount of informing or documenting will ever fix that...
>>
>> Also see:
>>
>> https://lore.kernel.org/linux-fsdevel/20230324204443.45950-1-axboe@kernel.dk/
>>
>> with which I think I'll change this one to just be:
>>
>> 	if (iter->iter_type == ITER_UBUF) {
>> 		rw->addr = iter->ubuf;
>> 		rw->len = iter->count;
>> 	/* readv -> read distance is the same as writev -> write */
>> 	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
>> 			(IORING_OP_WRITE - IORING_OP_WRITEV));
>> 		req->opcode += (IORING_OP_READ - IORING_OP_READV);
>> 	}
>>
>> instead.
>>
>> We could also just skip it completely and just have liburing do the
>> right thing if io_uring_prep_readv/writev is called with nr_segs == 1.
>> Just turn it into a READ/WRITE at that point. If we do that, and with
>> the above generic change, it's probably Good Enough. If you use
>> READV/WRITEV and you're using the raw interface, then you're on your
>> own.

I like this option but sendmsg and recvmsg probably do need the same
fix up, which is more justified as they can't get converted to
send/recv as this.

Another option is to internally detangle opcodes from iter types.

import() {
     if (req->op == READV)
         import_iovec();
     else
         import_buf();
}

would get replaced with:

prep() {
     if (req->op == READV)
         req->flags = REQ_F_IOVEC;
}

import() {
     if (req->flags & REQ_F_IOVEC)
         import_iovec();
     else
         import_buf();
}


>>>> +	rw->addr = (unsigned long) iter->iov[0].iov_base;
>>>> +	rw->len = iter->iov[0].iov_len;
>>>> +	iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len);
>>>> +	/* readv -> read distance is the same as writev -> write */
>>>> +	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
>>>> +			(IORING_OP_WRITE - IORING_OP_WRITEV));
>>>> +	req->opcode += (IORING_OP_READ - IORING_OP_READV);
>>>
>>> It is a bit fragile to change ->opcode, which may need matched
>>> callbacks for the two OPs, also cause inconsistent opcode in traces.
>>>
>>> I am wondering why not play the magic in io_prep_rw() from beginning?
>>
>> It has to be done when importing the vec, we cannot really do it in
>> prep... Well we could, but that'd be adding a bunch more code and
>> duplicating part of the vec import.
> 
> I meant something like the following(un-tested), which at least
> guarantees that op_code, rw->addr/len are finalized since ->prep().

It sounds like a better approach. With opcode machinations it's easy
to forget about some kind of state that could be fatal.

Take IOSQE_ASYNC for example. The core code will allocate
async_data and do io_readv_prep_async() -> import_iovec(), which
inside changes the opcode. It'll be a problem if io_readv_prep_async()
forgets that it might a different opcode with a slightly different req
layout, or even non-vectored read would do sth weird with ->async_data
or mishandle REQ_F_NEED_CLEANUP.

fwiw, needs compat handling, i.e. leave as iovec if compat

> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 0c292ef9a40f..4bf4c3effdac 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -120,6 +120,25 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>   			return ret;
>   	}
>   
> +	if (req->opcode == IORING_OP_READV && req->opcode == IORING_OP_WRITEV &&
> +			rw->len == 1) {
> +		struct iovec iov;
> +		struct iovec *iovp;
> +
> +		iovp = iovec_from_user(u64_to_user_ptr(rw->addr), 1, 1, &iov,
> +					req->ctx->compat);
> +		if (IS_ERR(iovp))
> +			return PTR_ERR(iovp);
> +
> +		rw->addr = (unsigned long) iovp->iov_base;
> +		rw->len = iovp->iov_len;
> +
> +		/* readv -> read distance is the same as writev -> write */
> +		BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
> +				(IORING_OP_WRITE - IORING_OP_WRITEV));
> +		req->opcode += (IORING_OP_READ - IORING_OP_READV);
> +	}
> +
>   	return 0;
>   }
diff mbox series

Patch

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 4c233910e200..5c998754cb17 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -402,7 +402,22 @@  static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req,
 			      req->ctx->compat);
 	if (unlikely(ret < 0))
 		return ERR_PTR(ret);
-	return iovec;
+	if (iter->nr_segs != 1)
+		return iovec;
+	/*
+	 * Convert to non-vectored request if we have a single segment. If we
+	 * need to defer the request, then we no longer have to allocate and
+	 * maintain a struct io_async_rw. Additionally, we won't have cleanup
+	 * to do at completion time
+	 */
+	rw->addr = (unsigned long) iter->iov[0].iov_base;
+	rw->len = iter->iov[0].iov_len;
+	iov_iter_ubuf(iter, ddir, iter->iov[0].iov_base, rw->len);
+	/* readv -> read distance is the same as writev -> write */
+	BUILD_BUG_ON((IORING_OP_READ - IORING_OP_READV) !=
+			(IORING_OP_WRITE - IORING_OP_WRITEV));
+	req->opcode += (IORING_OP_READ - IORING_OP_READV);
+	return NULL;
 }
 
 static inline int io_import_iovec(int rw, struct io_kiocb *req,