mbox series

[RFC,00/12] io_uring attached nvme queue

Message ID 20230429093925.133327-1-joshi.k@samsung.com (mailing list archive)
Headers show
Series io_uring attached nvme queue | expand

Message

Kanchan Joshi April 29, 2023, 9:39 a.m. UTC
This series shows one way to do what the title says.
This puts up a more direct/lean path that enables
 - submission from io_uring SQE to NVMe SQE
 - completion from NVMe CQE to io_uring CQE
Essentially cutting the hoops (involving request/bio) for nvme io path.

Also, io_uring ring is not to be shared among application threads.
Application is responsible for building the sharing (if it feels the
need). This means ring-associated exclusive queue can do away with some
synchronization costs that occur for shared queue.

Primary objective is to amp up of efficiency of kernel io path further
(towards PCIe gen N, N+1 hardware).
And we are seeing some asks too [1].

Building-blocks
===============
At high level, series can be divided into following parts -

1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can
be attached to other in-kernel user (not just to block-layer, which is
the case at the moment) on demand.

Example:
insmod nvme.ko poll_queus=1 raw_queues=2

nvme0: 24/0/1/2 default/read/poll queues/raw queues

While driver registers other queues with block-layer, raw-queues are
rather reserved for exclusive attachment with other in-kernel users.
At this point, each raw-queue is interrupt-disabled (similar to
poll_queues). Maybe we need a better name for these (e.g. app/user queues).
[Refer: patch 2]

2. register/unregister queue interface
(a) one for io_uring application to ask for device-queue and register
with the ring. [Refer: patch 4]
(b) another at nvme so that other in-kernel users (io_uring for now) can
ask for a raw-queue. [Refer: patch 3, 5, 6]

The latter returns a qid, that io_uring stores internally (not exposed
to user-space) in the ring ctx. At max one queue per ring is enabled.
Ring has no other special properties except the fact that it stores a
qid that it can use exclusively. So application can very well use the
ring to do other things than nvme io.

3. user-interface to send commands down this way
(a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT"
that application passes in the SQE. That is all.
(b) the flag goes down to provider of ->uring_cmd which may choose to do
  things differently based on it (or ignore it).
[Refer: patch 7]

4. nvme uring-cmd understands the above flag. It submits the command
into the known pre-registered queue, and completes (polled-completion)
from it. Transformation from "struct io_uring_cmd" to "nvme command" is
done directly without building other intermediate constructs.
[Refer: patch 8, 10, 12]

Testing and Performance
=======================
fio and t/io_uring is modified to exercise this path.
- fio: new "registerqueues" option
- t/io_uring: new "k" option

Good part:
2.96M -> 5.02M

nvme io (without this):
# t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
submitter=0, tid=2922, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1
IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2
IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1
Exiting on timeout
Maximum IOPS=2.96M

nvme io (with this):
# t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
submitter=0, tid=2927, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1
IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
Exiting on timeout
Maximum IOPS=5.02M

Not so good part:
While single IO is fast this way, we do not have batching abilities for
multi-io scenario. Plugging, submission and completion batching are tied to
block-layer constructs. Things should look better if we could do something
about that.
Particularly something is off with the completion-batching.

With -s32 and -c32, the numbers decline:

# t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
submitter=0, tid=3674, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31
IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31
IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32
Exiting on timeout
Maximum IOPS=3.71M

And perf gets restored if we go back to -c2

# t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
submitter=0, tid=3677, file=/dev/ng0n1, node=-1
polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
Engine=io_uring, sq_ring=64, cq_ring=64
IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5
IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
Exiting on timeout
Maximum IOPS=5.02M

Source
======
Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1
fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2

Please take a look.

[1]
https://lore.kernel.org/io-uring/24179a47-ab37-fa32-d177-1086668fbd3d@linux.alibaba.com/

Anuj Gupta (5):
  fs, block: interface to register/unregister the raw-queue
  io_uring, fs: plumb support to register/unregister raw-queue
  nvme: wire-up register/unregister queue f_op callback
  block: add mq_ops to submit and complete commands from raw-queue
  pci: modify nvme_setup_prp_simple parameters

Kanchan Joshi (7):
  nvme: refactor nvme_alloc_io_tag_set
  pci: enable "raw_queues = N" module parameter
  pci: implement register/unregister functionality
  io_uring: support for using registered queue in uring-cmd
  nvme: carve out a helper to prepare nvme_command from ioucmd->cmd
  nvme: submisssion/completion of uring_cmd to/from the registered queue
  pci: implement submission/completion for rawq commands

 drivers/nvme/host/core.c       |  31 ++-
 drivers/nvme/host/fc.c         |   3 +-
 drivers/nvme/host/ioctl.c      | 234 +++++++++++++++----
 drivers/nvme/host/multipath.c  |   2 +
 drivers/nvme/host/nvme.h       |  19 +-
 drivers/nvme/host/pci.c        | 409 +++++++++++++++++++++++++++++++--
 drivers/nvme/host/rdma.c       |   2 +-
 drivers/nvme/host/tcp.c        |   3 +-
 drivers/nvme/target/loop.c     |   3 +-
 fs/file.c                      |  14 ++
 include/linux/blk-mq.h         |   5 +
 include/linux/fs.h             |   4 +
 include/linux/io_uring.h       |   6 +
 include/linux/io_uring_types.h |   3 +
 include/uapi/linux/io_uring.h  |   6 +
 io_uring/io_uring.c            |  60 +++++
 io_uring/uring_cmd.c           |  14 +-
 17 files changed, 739 insertions(+), 79 deletions(-)

Comments

Jens Axboe April 29, 2023, 5:17 p.m. UTC | #1
On 4/29/23 3:39?AM, Kanchan Joshi wrote:
> This series shows one way to do what the title says.
> This puts up a more direct/lean path that enables
>  - submission from io_uring SQE to NVMe SQE
>  - completion from NVMe CQE to io_uring CQE
> Essentially cutting the hoops (involving request/bio) for nvme io path.
> 
> Also, io_uring ring is not to be shared among application threads.
> Application is responsible for building the sharing (if it feels the
> need). This means ring-associated exclusive queue can do away with some
> synchronization costs that occur for shared queue.
> 
> Primary objective is to amp up of efficiency of kernel io path further
> (towards PCIe gen N, N+1 hardware).
> And we are seeing some asks too [1].
> 
> Building-blocks
> ===============
> At high level, series can be divided into following parts -
> 
> 1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can
> be attached to other in-kernel user (not just to block-layer, which is
> the case at the moment) on demand.
> 
> Example:
> insmod nvme.ko poll_queus=1 raw_queues=2
> 
> nvme0: 24/0/1/2 default/read/poll queues/raw queues
> 
> While driver registers other queues with block-layer, raw-queues are
> rather reserved for exclusive attachment with other in-kernel users.
> At this point, each raw-queue is interrupt-disabled (similar to
> poll_queues). Maybe we need a better name for these (e.g. app/user queues).
> [Refer: patch 2]
> 
> 2. register/unregister queue interface
> (a) one for io_uring application to ask for device-queue and register
> with the ring. [Refer: patch 4]
> (b) another at nvme so that other in-kernel users (io_uring for now) can
> ask for a raw-queue. [Refer: patch 3, 5, 6]
> 
> The latter returns a qid, that io_uring stores internally (not exposed
> to user-space) in the ring ctx. At max one queue per ring is enabled.
> Ring has no other special properties except the fact that it stores a
> qid that it can use exclusively. So application can very well use the
> ring to do other things than nvme io.
> 
> 3. user-interface to send commands down this way
> (a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT"
> that application passes in the SQE. That is all.
> (b) the flag goes down to provider of ->uring_cmd which may choose to do
>   things differently based on it (or ignore it).
> [Refer: patch 7]
> 
> 4. nvme uring-cmd understands the above flag. It submits the command
> into the known pre-registered queue, and completes (polled-completion)
> from it. Transformation from "struct io_uring_cmd" to "nvme command" is
> done directly without building other intermediate constructs.
> [Refer: patch 8, 10, 12]
> 
> Testing and Performance
> =======================
> fio and t/io_uring is modified to exercise this path.
> - fio: new "registerqueues" option
> - t/io_uring: new "k" option
> 
> Good part:
> 2.96M -> 5.02M
> 
> nvme io (without this):
> # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
> submitter=0, tid=2922, file=/dev/ng0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
> Engine=io_uring, sq_ring=64, cq_ring=64
> IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1
> IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2
> IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1
> Exiting on timeout
> Maximum IOPS=2.96M
> 
> nvme io (with this):
> # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> submitter=0, tid=2927, file=/dev/ng0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> Engine=io_uring, sq_ring=64, cq_ring=64
> IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> Exiting on timeout
> Maximum IOPS=5.02M
> 
> Not so good part:
> While single IO is fast this way, we do not have batching abilities for
> multi-io scenario. Plugging, submission and completion batching are tied to
> block-layer constructs. Things should look better if we could do something
> about that.
> Particularly something is off with the completion-batching.
> 
> With -s32 and -c32, the numbers decline:
> 
> # t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> submitter=0, tid=3674, file=/dev/ng0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> Engine=io_uring, sq_ring=64, cq_ring=64
> IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31
> IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31
> IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32
> Exiting on timeout
> Maximum IOPS=3.71M
> 
> And perf gets restored if we go back to -c2
> 
> # t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> submitter=0, tid=3677, file=/dev/ng0n1, node=-1
> polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> Engine=io_uring, sq_ring=64, cq_ring=64
> IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> Exiting on timeout
> Maximum IOPS=5.02M
> 
> Source
> ======
> Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1
> fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2
> 
> Please take a look.

This looks like a great starting point! Unfortunately I won't be at
LSFMM this year to discuss it in person, but I'll be taking a closer
look at this. Some quick initial reactions:

- I'd call them "user" queues rather than raw or whatever, I think that
  more accurately describes what they are for.

- I guess there's no way around needing to pre-allocate these user
  queues, just like we do for polled_queues right now? In terms of user
  API, it'd be nicer if you could just do IORING_REGISTER_QUEUE (insert
  right name here...) and it'd allocate and return you an ID.

- Need to take a look at the uring_cmd stuff again, but would be nice if
  we did not have to add more stuff to fops for this. Maybe we can set
  aside a range of "ioctl" type commands through uring_cmd for this
  instead, and go that way for registering/unregistering queues.

We do have some users that are CPU constrained, and while my testing
easily maxes out a gen2 optane (actually 2 or 3) with the generic IO
path, that's also with all the fat that adds overhead removed. Most
people don't have this luxury, necessarily, or actually need some of
this fat for their monitoring, for example. This would provide a nice
way to have pretty consistent and efficient performance across distro
type configs, which would be great, while still retaining the fattier
bits for "normal" IO.
Kanchan Joshi May 1, 2023, 11:36 a.m. UTC | #2
On Sat, Apr 29, 2023 at 10:55 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 4/29/23 3:39?AM, Kanchan Joshi wrote:
> > This series shows one way to do what the title says.
> > This puts up a more direct/lean path that enables
> >  - submission from io_uring SQE to NVMe SQE
> >  - completion from NVMe CQE to io_uring CQE
> > Essentially cutting the hoops (involving request/bio) for nvme io path.
> >
> > Also, io_uring ring is not to be shared among application threads.
> > Application is responsible for building the sharing (if it feels the
> > need). This means ring-associated exclusive queue can do away with some
> > synchronization costs that occur for shared queue.
> >
> > Primary objective is to amp up of efficiency of kernel io path further
> > (towards PCIe gen N, N+1 hardware).
> > And we are seeing some asks too [1].
> >
> > Building-blocks
> > ===============
> > At high level, series can be divided into following parts -
> >
> > 1. nvme driver starts exposing some queue-pairs (SQ+CQ) that can
> > be attached to other in-kernel user (not just to block-layer, which is
> > the case at the moment) on demand.
> >
> > Example:
> > insmod nvme.ko poll_queus=1 raw_queues=2
> >
> > nvme0: 24/0/1/2 default/read/poll queues/raw queues
> >
> > While driver registers other queues with block-layer, raw-queues are
> > rather reserved for exclusive attachment with other in-kernel users.
> > At this point, each raw-queue is interrupt-disabled (similar to
> > poll_queues). Maybe we need a better name for these (e.g. app/user queues).
> > [Refer: patch 2]
> >
> > 2. register/unregister queue interface
> > (a) one for io_uring application to ask for device-queue and register
> > with the ring. [Refer: patch 4]
> > (b) another at nvme so that other in-kernel users (io_uring for now) can
> > ask for a raw-queue. [Refer: patch 3, 5, 6]
> >
> > The latter returns a qid, that io_uring stores internally (not exposed
> > to user-space) in the ring ctx. At max one queue per ring is enabled.
> > Ring has no other special properties except the fact that it stores a
> > qid that it can use exclusively. So application can very well use the
> > ring to do other things than nvme io.
> >
> > 3. user-interface to send commands down this way
> > (a) uring-cmd is extended to support a new flag "IORING_URING_CMD_DIRECT"
> > that application passes in the SQE. That is all.
> > (b) the flag goes down to provider of ->uring_cmd which may choose to do
> >   things differently based on it (or ignore it).
> > [Refer: patch 7]
> >
> > 4. nvme uring-cmd understands the above flag. It submits the command
> > into the known pre-registered queue, and completes (polled-completion)
> > from it. Transformation from "struct io_uring_cmd" to "nvme command" is
> > done directly without building other intermediate constructs.
> > [Refer: patch 8, 10, 12]
> >
> > Testing and Performance
> > =======================
> > fio and t/io_uring is modified to exercise this path.
> > - fio: new "registerqueues" option
> > - t/io_uring: new "k" option
> >
> > Good part:
> > 2.96M -> 5.02M
> >
> > nvme io (without this):
> > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k0 /dev/ng0n1
> > submitter=0, tid=2922, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=0 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=2.89M, BW=1412MiB/s, IOS/call=2/1
> > IOPS=2.92M, BW=1426MiB/s, IOS/call=2/2
> > IOPS=2.96M, BW=1444MiB/s, IOS/call=2/1
> > Exiting on timeout
> > Maximum IOPS=2.96M
> >
> > nvme io (with this):
> > # t/io_uring -b512 -d64 -c2 -s2 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=2927, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=4.99M, BW=2.43GiB/s, IOS/call=2/1
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=2/1
> > Exiting on timeout
> > Maximum IOPS=5.02M
> >
> > Not so good part:
> > While single IO is fast this way, we do not have batching abilities for
> > multi-io scenario. Plugging, submission and completion batching are tied to
> > block-layer constructs. Things should look better if we could do something
> > about that.
> > Particularly something is off with the completion-batching.
> >
> > With -s32 and -c32, the numbers decline:
> >
> > # t/io_uring -b512 -d64 -c32 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=3674, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=3.70M, BW=1806MiB/s, IOS/call=32/31
> > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/31
> > IOPS=3.71M, BW=1812MiB/s, IOS/call=32/32
> > Exiting on timeout
> > Maximum IOPS=3.71M
> >
> > And perf gets restored if we go back to -c2
> >
> > # t/io_uring -b512 -d64 -c2 -s32 -p1 -F1 -B1 -O0 -n1 -u1 -r4 -k1 /dev/ng0n1
> > submitter=0, tid=3677, file=/dev/ng0n1, node=-1
> > polled=1, fixedbufs=1/0, register_files=1, buffered=1, register_queues=1 QD=64
> > Engine=io_uring, sq_ring=64, cq_ring=64
> > IOPS=4.99M, BW=2.44GiB/s, IOS/call=5/5
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> > IOPS=5.02M, BW=2.45GiB/s, IOS/call=5/5
> > Exiting on timeout
> > Maximum IOPS=5.02M
> >
> > Source
> > ======
> > Kernel: https://github.com/OpenMPDK/linux/tree/feat/directq-v1
> > fio: https://github.com/OpenMPDK/fio/commits/feat/rawq-v2
> >
> > Please take a look.
>
> This looks like a great starting point! Unfortunately I won't be at
> LSFMM this year to discuss it in person, but I'll be taking a closer
> look at this.

That will help, thanks.

> Some quick initial reactions:
>
> - I'd call them "user" queues rather than raw or whatever, I think that
>   more accurately describes what they are for.

Right, that is better.

> - I guess there's no way around needing to pre-allocate these user
>   queues, just like we do for polled_queues right now?

Right, we would need to allocate nvme sq/cq in the outset.
Changing the count at run-time is a bit murky. I will have another look though.

>In terms of user
>   API, it'd be nicer if you could just do IORING_REGISTER_QUEUE (insert
>   right name here...) and it'd allocate and return you an ID.

But this is the implemented API (new register code in io_uring) in the
patchset at the moment.
So it seems I am missing your point?

> - Need to take a look at the uring_cmd stuff again, but would be nice if
>   we did not have to add more stuff to fops for this. Maybe we can set
>   aside a range of "ioctl" type commands through uring_cmd for this
>   instead, and go that way for registering/unregistering queues.

Yes, I see your point in not having to add new fops.
But, a new uring_cmd opcode is only at the nvme-level.
It is a good way to allocate/deallocate a nvme queue, but it cannot
attach that with the io_uring's ring.
Or do you have a different view? Seems this is connected to the previous point.

> We do have some users that are CPU constrained, and while my testing
> easily maxes out a gen2 optane (actually 2 or 3) with the generic IO
> path, that's also with all the fat that adds overhead removed. Most
> people don't have this luxury, necessarily, or actually need some of
> this fat for their monitoring, for example. This would provide a nice
> way to have pretty consistent and efficient performance across distro
> type configs, which would be great, while still retaining the fattier
> bits for "normal" IO.
Makes total sense.