mbox series

[RFC,0/4] Read/Write with meta buffer

Message ID 20240322185023.131697-1-joshi.k@samsung.com (mailing list archive)
Headers show
Series Read/Write with meta buffer | expand

Message

Kanchan Joshi March 22, 2024, 6:50 p.m. UTC
This patchset is aimed at getting the feedback on a new io_uring
interface that userspace can use to exchange meta buffer along with
read/write.

Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META.
The leftover space in the SQE is used to send meta buffer pointer
and its length. Patch #2 for this.

The interface is supported for block direct IO. Patch #4 for this.
Other two are prep patches.

It has been tried not to touch the hot read/write path, as much as
possible. Performance for non-meta IO is same after the patches [2].
There is some code in the cold path (worker-based async)
though.

Moderately tested by modifying the fio [1] to use this interface
(only for NVMe block devices)

[1] https://github.com/OpenMPDK/fio/tree/feat/test-meta

[2]
without this:

taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31

With this:
taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31

Anuj Gupta (3):
  io_uring/rw: Get rid of flags field in struct io_rw
  io_uring/rw: support read/write with metadata
  block: modify bio_integrity_map_user to accept iov_iter as argument

Kanchan Joshi (1):
  block: add support to pass the meta buffer

 block/bio-integrity.c         |  27 ++++++---
 block/fops.c                  |   9 +++
 block/t10-pi.c                |   6 ++
 drivers/nvme/host/ioctl.c     |  11 +++-
 include/linux/bio.h           |  13 +++-
 include/linux/fs.h            |   1 +
 include/uapi/linux/io_uring.h |   6 ++
 io_uring/io_uring.c           |   2 +
 io_uring/opdef.c              |  29 +++++++++
 io_uring/rw.c                 | 108 +++++++++++++++++++++++++++++-----
 io_uring/rw.h                 |   8 +++
 11 files changed, 193 insertions(+), 27 deletions(-)


base-commit: 6f0974eccbf78baead1735722c4f1ee3eb9422cd

Comments

Jens Axboe March 27, 2024, 11:38 p.m. UTC | #1
On 3/22/24 12:50 PM, Kanchan Joshi wrote:
> This patchset is aimed at getting the feedback on a new io_uring
> interface that userspace can use to exchange meta buffer along with
> read/write.
> 
> Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META.
> The leftover space in the SQE is used to send meta buffer pointer
> and its length. Patch #2 for this.
> 
> The interface is supported for block direct IO. Patch #4 for this.
> Other two are prep patches.
> 
> It has been tried not to touch the hot read/write path, as much as
> possible. Performance for non-meta IO is same after the patches [2].
> There is some code in the cold path (worker-based async)
> though.

This patchset should look cleaner if you rebase it on top of the current
for-6.10/io_uring branch, as it gets rid of the async nastiness. Since
that'll need doing anyway, could you repost a v2 where it's rebased on
top of that?

Also in terms of the cover letter, would be good with a bit more of a
description of what this enables. It's a bit scant on detail on what
exactly this gives you.

> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
> 
> With this:
> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31

Not that I don't believe you, but that looks like you pasted the same
stuff in there twice? It's the exact same perf and pids.
Kanchan Joshi March 28, 2024, 12:03 p.m. UTC | #2
On 3/28/2024 5:08 AM, Jens Axboe wrote:

> This patchset should look cleaner if you rebase it on top of the current
> for-6.10/io_uring branch, as it gets rid of the async nastiness. Since
> that'll need doing anyway, could you repost a v2 where it's rebased on
> top of that?

Yes, next iteration will use that as the base.

> Also in terms of the cover letter, would be good with a bit more of a
> description of what this enables. It's a bit scant on detail on what
> exactly this gives you.

Will fix that.
But currently the only thing it gives is - pass meta buffer to/from the 
block-device.

It keeps things simple, and fine for PI type 0 (normal unprotected IO).
For other PI types, exposing few knobs may help. Using "sqe->rw_flags" 
if there is no other way.

>> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
>> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
>> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
>> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
>> Engine=io_uring, sq_ring=128, cq_ring=128
>> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
>> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
>>
>> With this:
>> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
>> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
>> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
>> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
>> Engine=io_uring, sq_ring=128, cq_ring=128
>> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
>> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
> 
> Not that I don't believe you, but that looks like you pasted the same
> stuff in there twice? It's the exact same perf and pids.

Indeed :-(
Made a goof-up while pasting stuff [1] to the cover letter.

[1]
Before the patch:
# taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 
/dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=10.04M

After the patch:
# taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 
/dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=2412, file=/dev/nvme1n1, node=-1
submitter=0, tid=2411, file=/dev/nvme0n1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
IOPS=10.03M, BW=4.90GiB/s, IOS/call=31/31
IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=10.04M
Pavel Begunkov April 6, 2024, 9:30 p.m. UTC | #3
On 3/22/24 18:50, Kanchan Joshi wrote:
> This patchset is aimed at getting the feedback on a new io_uring
> interface that userspace can use to exchange meta buffer along with
> read/write.
> 
> Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META.
> The leftover space in the SQE is used to send meta buffer pointer
> and its length. Patch #2 for this.

I do remember there were back and forth design discussions about that
back when some other guy attempted to implement it, but have you tried
to do it not as a separate opcode?

It reads like all read/write opcodes might benefit from it, and it'd
be unfortunate to then be adding IORING_OP_READ_META_FIXED and
multiplicatively all other variants.

> The interface is supported for block direct IO. Patch #4 for this.
> Other two are prep patches.
> 
> It has been tried not to touch the hot read/write path, as much as
> possible. Performance for non-meta IO is same after the patches [2].
> There is some code in the cold path (worker-based async)
> though.
> 
> Moderately tested by modifying the fio [1] to use this interface
> (only for NVMe block devices)
> 
> [1] https://github.com/OpenMPDK/fio/tree/feat/test-meta
> 
> [2]
> without this:
> 
> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
> 
> With this:
> taskset -c 2,5 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 -r4 /dev/nvme0n1 /dev/nvme1n1
> submitter=1, tid=2453, file=/dev/nvme1n1, node=-1
> submitter=0, tid=2452, file=/dev/nvme0n1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=10.02M, BW=4.89GiB/s, IOS/call=31/31
> IOPS=10.04M, BW=4.90GiB/s, IOS/call=31/31
> 
> Anuj Gupta (3):
>    io_uring/rw: Get rid of flags field in struct io_rw
>    io_uring/rw: support read/write with metadata
>    block: modify bio_integrity_map_user to accept iov_iter as argument
> 
> Kanchan Joshi (1):
>    block: add support to pass the meta buffer
> 
>   block/bio-integrity.c         |  27 ++++++---
>   block/fops.c                  |   9 +++
>   block/t10-pi.c                |   6 ++
>   drivers/nvme/host/ioctl.c     |  11 +++-
>   include/linux/bio.h           |  13 +++-
>   include/linux/fs.h            |   1 +
>   include/uapi/linux/io_uring.h |   6 ++
>   io_uring/io_uring.c           |   2 +
>   io_uring/opdef.c              |  29 +++++++++
>   io_uring/rw.c                 | 108 +++++++++++++++++++++++++++++-----
>   io_uring/rw.h                 |   8 +++
>   11 files changed, 193 insertions(+), 27 deletions(-)
> 
> 
> base-commit: 6f0974eccbf78baead1735722c4f1ee3eb9422cd
Kanchan Joshi April 25, 2024, 7:05 p.m. UTC | #4
On 4/7/2024 3:00 AM, Pavel Begunkov wrote:
> On 3/22/24 18:50, Kanchan Joshi wrote:
>> This patchset is aimed at getting the feedback on a new io_uring
>> interface that userspace can use to exchange meta buffer along with
>> read/write.
>>
>> Two new opcodes for that: IORING_OP_READ_META and IORING_OP_WRITE_META.
>> The leftover space in the SQE is used to send meta buffer pointer
>> and its length. Patch #2 for this.
> 
> I do remember there were back and forth design discussions about that
> back when some other guy attempted to implement it, but have you tried
> to do it not as a separate opcode?

Did not try that in the first cut, thinking it would help in not 
touching the hot (non-meta) io path. But open to this.

> It reads like all read/write opcodes might benefit from it, and it'd
> be unfortunate to then be adding IORING_OP_READ_META_FIXED and
> multiplicatively all other variants.

Right, that's a good point.