diff mbox series

[15/17] io_uring/uring_cmd: defer SQE copying until we need it

Message ID 20240320225750.1769647-16-axboe@kernel.dk (mailing list archive)
State New
Headers show
Series Improve async state handling | expand

Commit Message

Jens Axboe March 20, 2024, 10:55 p.m. UTC
The previous commit turned on async data for uring_cmd, and did the
basic conversion of setting everything up on the prep side. However, for
a lot of use cases, we'll get -EIOCBQUEUED on issue, which means we do
not need a persistent big SQE copied.

Unless we're going async immediately, defer copying the double SQE until
we know we have to.

This greatly reduces the overhead of such commands, as evidenced by
a perf diff from before and after this change:

    10.60%     -8.58%  [kernel.vmlinux]  [k] io_uring_cmd_prep

where the prep side drops from 10.60% to ~2%, which is more expected.
Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
to where it was before the async command prep.

Signed-off-by: Jens Axboe <axboe@kernel.dk>

~# Last command done (1 command done):
---
 io_uring/uring_cmd.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

Comments

Anuj gupta March 25, 2024, 12:41 p.m. UTC | #1
On Thu, Mar 21, 2024 at 4:28 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> The previous commit turned on async data for uring_cmd, and did the
> basic conversion of setting everything up on the prep side. However, for
> a lot of use cases, we'll get -EIOCBQUEUED on issue, which means we do
> not need a persistent big SQE copied.
>
> Unless we're going async immediately, defer copying the double SQE until
> we know we have to.
>
> This greatly reduces the overhead of such commands, as evidenced by
> a perf diff from before and after this change:
>
>     10.60%     -8.58%  [kernel.vmlinux]  [k] io_uring_cmd_prep
>
> where the prep side drops from 10.60% to ~2%, which is more expected.
> Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
> to where it was before the async command prep.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> ~# Last command done (1 command done):
> ---
>  io_uring/uring_cmd.c | 25 +++++++++++++++++++------
>  1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index 9bd0ba87553f..92346b5d9f5b 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -182,12 +182,18 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
>         struct uring_cache *cache;
>
>         cache = io_uring_async_get(req);
> -       if (cache) {
> -               memcpy(cache->sqes, sqe, uring_sqe_size(req->ctx));
> -               ioucmd->sqe = req->async_data;
> +       if (unlikely(!cache))
> +               return -ENOMEM;
> +
> +       if (!(req->flags & REQ_F_FORCE_ASYNC)) {
> +               /* defer memcpy until we need it */
> +               ioucmd->sqe = sqe;
>                 return 0;
>         }
> -       return -ENOMEM;
> +
> +       memcpy(req->async_data, sqe, uring_sqe_size(req->ctx));
> +       ioucmd->sqe = req->async_data;
> +       return 0;
>  }
>
>  int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> @@ -245,8 +251,15 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>         }
>
>         ret = file->f_op->uring_cmd(ioucmd, issue_flags);
> -       if (ret == -EAGAIN || ret == -EIOCBQUEUED)
> -               return ret;
> +       if (ret == -EAGAIN) {
> +               struct uring_cache *cache = req->async_data;
> +
> +               if (ioucmd->sqe != (void *) cache)
> +                       memcpy(cache, ioucmd->sqe, uring_sqe_size(req->ctx));
> +               return -EAGAIN;
> +       } else if (ret == -EIOCBQUEUED) {
> +               return -EIOCBQUEUED;
> +       }
>
>         if (ret < 0)
>                 req_set_fail(req);
> --
> 2.43.0
>
>

The io_uring_cmd plumbing part of this series looks good to me.
I tested it with io_uring nvme-passthrough on my setup with two
optanes and there is no drop in performance as well [1].
For this and the previous patch,

Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>

[1]
# taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -O0 -F1 -B1 -u1
-n2 -r4 /dev/ng0n1 /dev/ng2n1
submitter=1, tid=7166, file=/dev/ng2n1, nfiles=1, node=-1
submitter=0, tid=7165, file=/dev/ng0n1, nfiles=1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=1, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=10.04M
--
Anuj Gupta
Jens Axboe March 25, 2024, 2:55 p.m. UTC | #2
On 3/25/24 6:41 AM, Anuj gupta wrote:
> On Thu, Mar 21, 2024 at 4:28?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> The previous commit turned on async data for uring_cmd, and did the
>> basic conversion of setting everything up on the prep side. However, for
>> a lot of use cases, we'll get -EIOCBQUEUED on issue, which means we do
>> not need a persistent big SQE copied.
>>
>> Unless we're going async immediately, defer copying the double SQE until
>> we know we have to.
>>
>> This greatly reduces the overhead of such commands, as evidenced by
>> a perf diff from before and after this change:
>>
>>     10.60%     -8.58%  [kernel.vmlinux]  [k] io_uring_cmd_prep
>>
>> where the prep side drops from 10.60% to ~2%, which is more expected.
>> Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
>> to where it was before the async command prep.
>>
>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>>
>> ~# Last command done (1 command done):
>> ---
>>  io_uring/uring_cmd.c | 25 +++++++++++++++++++------
>>  1 file changed, 19 insertions(+), 6 deletions(-)
>>
>> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
>> index 9bd0ba87553f..92346b5d9f5b 100644
>> --- a/io_uring/uring_cmd.c
>> +++ b/io_uring/uring_cmd.c
>> @@ -182,12 +182,18 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
>>         struct uring_cache *cache;
>>
>>         cache = io_uring_async_get(req);
>> -       if (cache) {
>> -               memcpy(cache->sqes, sqe, uring_sqe_size(req->ctx));
>> -               ioucmd->sqe = req->async_data;
>> +       if (unlikely(!cache))
>> +               return -ENOMEM;
>> +
>> +       if (!(req->flags & REQ_F_FORCE_ASYNC)) {
>> +               /* defer memcpy until we need it */
>> +               ioucmd->sqe = sqe;
>>                 return 0;
>>         }
>> -       return -ENOMEM;
>> +
>> +       memcpy(req->async_data, sqe, uring_sqe_size(req->ctx));
>> +       ioucmd->sqe = req->async_data;
>> +       return 0;
>>  }
>>
>>  int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> @@ -245,8 +251,15 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
>>         }
>>
>>         ret = file->f_op->uring_cmd(ioucmd, issue_flags);
>> -       if (ret == -EAGAIN || ret == -EIOCBQUEUED)
>> -               return ret;
>> +       if (ret == -EAGAIN) {
>> +               struct uring_cache *cache = req->async_data;
>> +
>> +               if (ioucmd->sqe != (void *) cache)
>> +                       memcpy(cache, ioucmd->sqe, uring_sqe_size(req->ctx));
>> +               return -EAGAIN;
>> +       } else if (ret == -EIOCBQUEUED) {
>> +               return -EIOCBQUEUED;
>> +       }
>>
>>         if (ret < 0)
>>                 req_set_fail(req);
>> --
>> 2.43.0
>>
>>
> 
> The io_uring_cmd plumbing part of this series looks good to me.
> I tested it with io_uring nvme-passthrough on my setup with two
> optanes and there is no drop in performance as well [1].
> For this and the previous patch,
> 
> Tested-by: Anuj Gupta <anuj20.g@samsung.com>
> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>

Thanks for reviewing and testing, will add.
diff mbox series

Patch

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 9bd0ba87553f..92346b5d9f5b 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -182,12 +182,18 @@  static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 	struct uring_cache *cache;
 
 	cache = io_uring_async_get(req);
-	if (cache) {
-		memcpy(cache->sqes, sqe, uring_sqe_size(req->ctx));
-		ioucmd->sqe = req->async_data;
+	if (unlikely(!cache))
+		return -ENOMEM;
+
+	if (!(req->flags & REQ_F_FORCE_ASYNC)) {
+		/* defer memcpy until we need it */
+		ioucmd->sqe = sqe;
 		return 0;
 	}
-	return -ENOMEM;
+
+	memcpy(req->async_data, sqe, uring_sqe_size(req->ctx));
+	ioucmd->sqe = req->async_data;
+	return 0;
 }
 
 int io_uring_cmd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
@@ -245,8 +251,15 @@  int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 	}
 
 	ret = file->f_op->uring_cmd(ioucmd, issue_flags);
-	if (ret == -EAGAIN || ret == -EIOCBQUEUED)
-		return ret;
+	if (ret == -EAGAIN) {
+		struct uring_cache *cache = req->async_data;
+
+		if (ioucmd->sqe != (void *) cache)
+			memcpy(cache, ioucmd->sqe, uring_sqe_size(req->ctx));
+		return -EAGAIN;
+	} else if (ret == -EIOCBQUEUED) {
+		return -EIOCBQUEUED;
+	}
 
 	if (ret < 0)
 		req_set_fail(req);