diff mbox

[0/6] ublk zero-copy support

Message ID 20250203154517.937623-1-kbusch@meta.com (mailing list archive)
State New
Headers show

Commit Message

Keith Busch Feb. 3, 2025, 3:45 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

This is a new look at supporting zero copy with ublk.

The previous version from Ming can be viewed here:

  https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/

Based on the feedback from that thread, the desired io_uring interfaces
needed to be simpler, and the kernel registered resources need to behave
more similiar to user registered buffers.

This series introduces a new resource node type, KBUF, which, like the
BUFFER resource, needs to be installed into an io_uring buf_node table
in order for the user to access it in a fixed buffer command. The
new io_uring kernel API provides a way for a user to register a struct
request's bvec to a specific index, and a way to unregister it.

When the ublk server receives notification of a new command, it must
first select an index and register the zero copy buffer. It may use that
index for any number of fixed buffer commands, then it must unregister
the index when it's done. This can all be done in a single io_uring_enter
if desired, or it can be split into multiple enters if needed.

The io_uring instance that gets the zero copy registration doesn't
necessarily need to be the same ring that is receiving notifcations from
the ublk_drv module. This allows you to split frontend and backend rings
if desired.

At the end of this cover letter, I've provided a patch to the ublksrv to
demonstrate how to use this.

Jens Axboe (1):
  io_uring: use node for import

Keith Busch (5):
  block: const blk_rq_nr_phys_segments request
  io_uring: add support for kernel registered bvecs
  ublk: zc register/unregister bvec
  io_uring: add abstraction for buf_table rsrc data
  io_uring: cache nodes and mapped buffers

 drivers/block/ublk_drv.c       | 139 +++++++++++++-----
 include/linux/blk-mq.h         |   2 +-
 include/linux/io_uring.h       |   1 +
 include/linux/io_uring_types.h |  25 +++-
 include/uapi/linux/ublk_cmd.h  |   4 +
 io_uring/fdinfo.c              |   8 +-
 io_uring/filetable.c           |   2 +-
 io_uring/net.c                 |   5 +-
 io_uring/nop.c                 |   2 +-
 io_uring/register.c            |   2 +-
 io_uring/rsrc.c                | 259 ++++++++++++++++++++++++++-------
 io_uring/rsrc.h                |   8 +-
 io_uring/rw.c                  |   4 +-
 io_uring/uring_cmd.c           |   4 +-
 14 files changed, 351 insertions(+), 114 deletions(-)

Comments

Keith Busch Feb. 6, 2025, 3:28 p.m. UTC | #1
On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> This is a new look at supporting zero copy with ublk.

Just to give some numbers behind this since I didn't in the original
cover letter. The numbers are from a 1.6GHz Xeon platform.

Using the ublksrv patch I provided in the cover letter, created two ublk
block devices with null_blk backings:

  ublk add -t loop -f /dev/nullb0
  ublk add -t loop -f /dev/nullb1 -z

Using t/io_uring, comparing the ublk device without zero-copy vs the one
with zero-copy (-z) enabled

4k read:
 Legacy:
  IOPS=387.78K, BW=1514MiB/s, IOS/call=32/32
  IOPS=395.14K, BW=1543MiB/s, IOS/call=32/32
  IOPS=395.68K, BW=1545MiB/s, IOS/call=32/31

 Zero-copy:
  IOPS=482.69K, BW=1885MiB/s, IOS/call=32/31
  IOPS=481.34K, BW=1880MiB/s, IOS/call=32/32
  IOPS=481.66K, BW=1881MiB/s, IOS/call=32/32

64k read:
 Legacy:
  IOPS=73248, BW=4.58GiB/s, IOS/call=32/32
  IOPS=73664, BW=4.60GiB/s, IOS/call=32/32
  IOPS=72288, BW=4.52GiB/s, IOS/call=32/32

 Zero-copy:
  IOPS=381.76K, BW=23.86GiB/s, IOS/call=32/31
  IOPS=378.18K, BW=23.64GiB/s, IOS/call=32/32
  IOPS=379.52K, BW=23.72GiB/s, IOS/call=32/32

The register/unregister overhead is low enough to show a decent
improvement even at 4k IO. And it's using half the memory with lower CPU
utilization per IO, so all good wins.
Ming Lei Feb. 7, 2025, 3:51 a.m. UTC | #2
On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> This is a new look at supporting zero copy with ublk.
> 
> The previous version from Ming can be viewed here:
> 
>   https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/
> 
> Based on the feedback from that thread, the desired io_uring interfaces
> needed to be simpler, and the kernel registered resources need to behave
> more similiar to user registered buffers.
> 
> This series introduces a new resource node type, KBUF, which, like the
> BUFFER resource, needs to be installed into an io_uring buf_node table
> in order for the user to access it in a fixed buffer command. The
> new io_uring kernel API provides a way for a user to register a struct
> request's bvec to a specific index, and a way to unregister it.
> 
> When the ublk server receives notification of a new command, it must
> first select an index and register the zero copy buffer. It may use that
> index for any number of fixed buffer commands, then it must unregister
> the index when it's done. This can all be done in a single io_uring_enter
> if desired, or it can be split into multiple enters if needed.

I suspect it may not be done in single io_uring_enter() because there
is strict dependency among the three OPs(register buffer, read/write,
unregister buffer).

> 
> The io_uring instance that gets the zero copy registration doesn't
> necessarily need to be the same ring that is receiving notifcations from
> the ublk_drv module. This allows you to split frontend and backend rings
> if desired.
> 
> At the end of this cover letter, I've provided a patch to the ublksrv to
> demonstrate how to use this.
> 
> Jens Axboe (1):
>   io_uring: use node for import
> 
> Keith Busch (5):
>   block: const blk_rq_nr_phys_segments request
>   io_uring: add support for kernel registered bvecs
>   ublk: zc register/unregister bvec
>   io_uring: add abstraction for buf_table rsrc data
>   io_uring: cache nodes and mapped buffers
> 
>  drivers/block/ublk_drv.c       | 139 +++++++++++++-----
>  include/linux/blk-mq.h         |   2 +-
>  include/linux/io_uring.h       |   1 +
>  include/linux/io_uring_types.h |  25 +++-
>  include/uapi/linux/ublk_cmd.h  |   4 +
>  io_uring/fdinfo.c              |   8 +-
>  io_uring/filetable.c           |   2 +-
>  io_uring/net.c                 |   5 +-
>  io_uring/nop.c                 |   2 +-
>  io_uring/register.c            |   2 +-
>  io_uring/rsrc.c                | 259 ++++++++++++++++++++++++++-------
>  io_uring/rsrc.h                |   8 +-
>  io_uring/rw.c                  |   4 +-
>  io_uring/uring_cmd.c           |   4 +-
>  14 files changed, 351 insertions(+), 114 deletions(-)
> 
> -- 
> 2.43.5
> 
> ublksrv:
> 
> https://github.com/ublk-org/ublksrv
> 
> ---
>  include/ublk_cmd.h    |  4 +++
>  include/ublksrv_tgt.h | 13 ++++++++
>  lib/ublksrv.c         |  9 ++++++
>  tgt_loop.cpp          | 74 +++++++++++++++++++++++++++++++++++++++++--
>  ublksrv_tgt.cpp       |  2 +-
>  5 files changed, 99 insertions(+), 3 deletions(-)
> 
> diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
> index 0150003..07439be 100644
> --- a/include/ublk_cmd.h
> +++ b/include/ublk_cmd.h
> @@ -94,6 +94,10 @@
>  	_IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
>  #define	UBLK_U_IO_NEED_GET_DATA		\
>  	_IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
> +#define UBLK_U_IO_REGISTER_IO_BUF	\
> +	_IOWR('u', 0x23, struct ublksrv_io_cmd)
> +#define UBLK_U_IO_UNREGISTER_IO_BUF	\
> +	_IOWR('u', 0x24, struct ublksrv_io_cmd)
>  
>  /* only ABORT means that no re-fetch */
>  #define UBLK_IO_RES_OK			0
> diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
> index 1deee2b..6291531 100644
> --- a/include/ublksrv_tgt.h
> +++ b/include/ublksrv_tgt.h
> @@ -189,4 +189,17 @@ static inline void ublk_get_sqe_pair(struct io_uring *r,
>  		*sqe2 = io_uring_get_sqe(r);
>  }
>  
> +static inline void ublk_get_sqe_three(struct io_uring *r,
> +		struct io_uring_sqe **sqe1, struct io_uring_sqe **sqe2,
> +		struct io_uring_sqe **sqe3)
> +{
> +	unsigned left = io_uring_sq_space_left(r);
> +
> +	if (left < 3)
> +		io_uring_submit(r);
> +
> +	*sqe1 = io_uring_get_sqe(r);
> +	*sqe2 = io_uring_get_sqe(r);
> +	*sqe3 = io_uring_get_sqe(r);
> +}
>  #endif
> diff --git a/lib/ublksrv.c b/lib/ublksrv.c
> index 16a9e13..7205247 100644
> --- a/lib/ublksrv.c
> +++ b/lib/ublksrv.c
> @@ -619,6 +619,15 @@ skip_alloc_buf:
>  		goto fail;
>  	}
>  
> +	if (ctrl_dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) {
> +		ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth);
> +		if (ret) {
> +			ublk_err("ublk dev %d queue %d register spare buffers failed %d",
> +					q->dev->ctrl_dev->dev_info.dev_id, q->q_id, ret);
> +			goto fail;
> +		}
> +	}
> +
>  	io_uring_register_ring_fd(&q->ring);
>  
>  	/*
> diff --git a/tgt_loop.cpp b/tgt_loop.cpp
> index 0f16676..ce44c7d 100644
> --- a/tgt_loop.cpp
> +++ b/tgt_loop.cpp
> @@ -246,12 +246,62 @@ static inline int loop_fallocate_mode(const struct ublksrv_io_desc *iod)
>         return mode;
>  }
>  
> +static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe,
> +		int dev_fd, int tag, int q_id, __u64 index)
> +{
> +	struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
> +
> +	io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
> +	sqe->opcode		= IORING_OP_URING_CMD;
> +	sqe->flags		|= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE;

IOSQE_IO_LINK is missed, because the following buffer consumer OP
has to be issued after this buffer register OP is completed.

> +	sqe->cmd_op		= UBLK_U_IO_REGISTER_IO_BUF;
> +
> +	cmd->tag		= tag;
> +	cmd->addr		= index;
> +	cmd->q_id		= q_id;
> +}
> +
> +static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe,
> +		int dev_fd, int tag, int q_id, __u64 index)
> +{
> +	struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
> +
> +	io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
> +	sqe->opcode		= IORING_OP_URING_CMD;
> +	sqe->flags		|= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE;
> +	sqe->cmd_op		= UBLK_U_IO_UNREGISTER_IO_BUF;

IOSQE_IO_LINK is missed, because buffer un-register OP has to be issued
after the previous buffer consumer OP is completed.

> +
> +	cmd->tag		= tag;
> +	cmd->addr		= index;
> +	cmd->q_id		= q_id;
> +}
> +
>  static void loop_queue_tgt_read(const struct ublksrv_queue *q,
>  		const struct ublksrv_io_desc *iod, int tag)
>  {
> +	const struct ublksrv_ctrl_dev_info *info =
> +		ublksrv_ctrl_get_dev_info(ublksrv_get_ctrl_dev(q->dev));
>  	unsigned ublk_op = ublksrv_get_op(iod);
>  
> -	if (user_copy) {
> +	if (info->flags & UBLK_F_SUPPORT_ZERO_COPY) {
> +		struct io_uring_sqe *reg;
> +		struct io_uring_sqe *read;
> +		struct io_uring_sqe *ureg;
> +
> +		ublk_get_sqe_three(q->ring_ptr, &reg, &read, &ureg);
> +
> +		io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag);
> +
> +		io_uring_prep_read_fixed(read, 1 /*fds[1]*/,
> +			0,
> +			iod->nr_sectors << 9,
> +			iod->start_sector << 9,
> +			tag);
> +		io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE);
> +		read->user_data = build_user_data(tag, ublk_op, 0, 1);

Does this interface support to read to partial buffer? Which is useful
for stacking device cases.

Also does this interface support to consume the buffer from multiple
OPs concurrently? 


Thanks, 
Ming
Keith Busch Feb. 7, 2025, 2:06 p.m. UTC | #3
On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
> On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> > 
> > The previous version from Ming can be viewed here:
> > 
> >   https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/
> > 
> > Based on the feedback from that thread, the desired io_uring interfaces
> > needed to be simpler, and the kernel registered resources need to behave
> > more similiar to user registered buffers.
> > 
> > This series introduces a new resource node type, KBUF, which, like the
> > BUFFER resource, needs to be installed into an io_uring buf_node table
> > in order for the user to access it in a fixed buffer command. The
> > new io_uring kernel API provides a way for a user to register a struct
> > request's bvec to a specific index, and a way to unregister it.
> > 
> > When the ublk server receives notification of a new command, it must
> > first select an index and register the zero copy buffer. It may use that
> > index for any number of fixed buffer commands, then it must unregister
> > the index when it's done. This can all be done in a single io_uring_enter
> > if desired, or it can be split into multiple enters if needed.
> 
> I suspect it may not be done in single io_uring_enter() because there
> is strict dependency among the three OPs(register buffer, read/write,
> unregister buffer).

The registration is synchronous. io_uring completes the SQE entirely
before it even looks at the read command in the next SQE.

The read or write is asynchronous, but it's prep takes a reference on
the node before moving on to the next SQE..

The unregister is synchronous, and clears the index node, but the
possibly inflight read or write has a reference on that node, so all
good.

> > +		ublk_get_sqe_three(q->ring_ptr, &reg, &read, &ureg);
> > +
> > +		io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag);
> > +
> > +		io_uring_prep_read_fixed(read, 1 /*fds[1]*/,
> > +			0,
> > +			iod->nr_sectors << 9,
> > +			iod->start_sector << 9,
> > +			tag);
> > +		io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE);
> > +		read->user_data = build_user_data(tag, ublk_op, 0, 1);
> 
> Does this interface support to read to partial buffer? Which is useful
> for stacking device cases.

Are you wanting to read into this buffer without copying in parts? As in
provide an offset and/or smaller length across multiple commands? If
that's what you mean, then yes, you can do that here.
 
> Also does this interface support to consume the buffer from multiple
> OPs concurrently? 

You can register as many kernel buffers from as many OPs as you have
space for in your table, and you can use them all concurrently. Pretty
much the same as user registered fixed buffers. The main difference from
user buffers is how you register them.
Bernd Schubert Feb. 8, 2025, 12:51 a.m. UTC | #4
Hi Keith,

On 2/3/25 16:45, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
> 
> This is a new look at supporting zero copy with ublk.

will try to look at it over the weekend. Could you please keep me in the
loop for future versions?


Thanks,
Bernd
Ming Lei Feb. 8, 2025, 5:44 a.m. UTC | #5
On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
> > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> > > 
> > > The previous version from Ming can be viewed here:
> > > 
> > >   https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/
> > > 
> > > Based on the feedback from that thread, the desired io_uring interfaces
> > > needed to be simpler, and the kernel registered resources need to behave
> > > more similiar to user registered buffers.
> > > 
> > > This series introduces a new resource node type, KBUF, which, like the
> > > BUFFER resource, needs to be installed into an io_uring buf_node table
> > > in order for the user to access it in a fixed buffer command. The
> > > new io_uring kernel API provides a way for a user to register a struct
> > > request's bvec to a specific index, and a way to unregister it.
> > > 
> > > When the ublk server receives notification of a new command, it must
> > > first select an index and register the zero copy buffer. It may use that
> > > index for any number of fixed buffer commands, then it must unregister
> > > the index when it's done. This can all be done in a single io_uring_enter
> > > if desired, or it can be split into multiple enters if needed.
> > 
> > I suspect it may not be done in single io_uring_enter() because there
> > is strict dependency among the three OPs(register buffer, read/write,
> > unregister buffer).
> 
> The registration is synchronous. io_uring completes the SQE entirely
> before it even looks at the read command in the next SQE.

Can you explain a bit "synchronous" here?

In patch 4, two ublk uring_cmd(UBLK_U_IO_REGISTER_IO_BUF/UBLK_U_IO_UNREGISTER_IO_BUF)
are added, and their handlers are called from uring_cmd's ->issue().

> 
> The read or write is asynchronous, but it's prep takes a reference on
> the node before moving on to the next SQE..

The buffer is registered in ->issue() of UBLK_U_IO_REGISTER_IO_BUF,
and it isn't done yet when calling ->prep() of read_fixed/write_fixed,
in which buffer is looked up in ->prep().

> 
> The unregister is synchronous, and clears the index node, but the
> possibly inflight read or write has a reference on that node, so all
> good.

UBLK_U_IO_UNREGISTER_IO_BUF tells ublk that the buffer isn't used any
more, but it is being used by the async read/write.

It might work, but looks a bit fragile, such as:

One buggy application may panic kernel if the IO command is completed
before read/write is done.

> 
> > > +		ublk_get_sqe_three(q->ring_ptr, &reg, &read, &ureg);
> > > +
> > > +		io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag);
> > > +
> > > +		io_uring_prep_read_fixed(read, 1 /*fds[1]*/,
> > > +			0,
> > > +			iod->nr_sectors << 9,
> > > +			iod->start_sector << 9,
> > > +			tag);
> > > +		io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE);
> > > +		read->user_data = build_user_data(tag, ublk_op, 0, 1);
> > 
> > Does this interface support to read to partial buffer? Which is useful
> > for stacking device cases.
> 
> Are you wanting to read into this buffer without copying in parts? As in
> provide an offset and/or smaller length across multiple commands? If
> that's what you mean, then yes, you can do that here.

OK.

>  
> > Also does this interface support to consume the buffer from multiple
> > OPs concurrently? 
> 
> You can register as many kernel buffers from as many OPs as you have
> space for in your table, and you can use them all concurrently. Pretty
> much the same as user registered fixed buffers. The main difference from
> user buffers is how you register them.

Here it depends on if LINK between buffer register and read/write are
required. If it is required, multiple OPs consuming the buffer have to
be linked one by one, then they can't be issue concurrently.


Thanks,
Ming
Ming Lei Feb. 8, 2025, 7:52 a.m. UTC | #6
On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
> > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:

...

> > Does this interface support to read to partial buffer? Which is useful
> > for stacking device cases.
> 
> Are you wanting to read into this buffer without copying in parts? As in
> provide an offset and/or smaller length across multiple commands? If
> that's what you mean, then yes, you can do that here.

Sorry, forget another problem here.

For short read, we need to zero the remained bytes in each part of buffer,
how to deal with that?

Please see io_req_zero_remained() in my patch of "[PATCH V10 10/12] io_uring:
support leased group buffer with REQ_F_GROUP_BUF":

https://lore.kernel.org/linux-block/20241107110149.890530-11-ming.lei@redhat.com/#r



Thanks,
Ming
Pavel Begunkov Feb. 8, 2025, 2:16 p.m. UTC | #7
On 2/8/25 05:44, Ming Lei wrote:
> On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
>> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
>>> On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
>>>>
>>>> The previous version from Ming can be viewed here:
>>>>
>>>>    https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/
>>>>
>>>> Based on the feedback from that thread, the desired io_uring interfaces
>>>> needed to be simpler, and the kernel registered resources need to behave
>>>> more similiar to user registered buffers.
>>>>
>>>> This series introduces a new resource node type, KBUF, which, like the
>>>> BUFFER resource, needs to be installed into an io_uring buf_node table
>>>> in order for the user to access it in a fixed buffer command. The
>>>> new io_uring kernel API provides a way for a user to register a struct
>>>> request's bvec to a specific index, and a way to unregister it.
>>>>
>>>> When the ublk server receives notification of a new command, it must
>>>> first select an index and register the zero copy buffer. It may use that
>>>> index for any number of fixed buffer commands, then it must unregister
>>>> the index when it's done. This can all be done in a single io_uring_enter
>>>> if desired, or it can be split into multiple enters if needed.
>>>
>>> I suspect it may not be done in single io_uring_enter() because there
>>> is strict dependency among the three OPs(register buffer, read/write,
>>> unregister buffer).
>>
>> The registration is synchronous. io_uring completes the SQE entirely
>> before it even looks at the read command in the next SQE.
> 
> Can you explain a bit "synchronous" here?

I'd believe synchronous here means "executed during submission from
the submit syscall path". And I agree that we can't rely on that.
That's an implementation detail and io_uring doesn't promise that,
but even now it relies on not using certain features like drain and
the async flag.
  
> In patch 4, two ublk uring_cmd(UBLK_U_IO_REGISTER_IO_BUF/UBLK_U_IO_UNREGISTER_IO_BUF)
> are added, and their handlers are called from uring_cmd's ->issue().
> 
>>
>> The read or write is asynchronous, but it's prep takes a reference on
>> the node before moving on to the next SQE..
> 
> The buffer is registered in ->issue() of UBLK_U_IO_REGISTER_IO_BUF,
> and it isn't done yet when calling ->prep() of read_fixed/write_fixed,
> in which buffer is looked up in ->prep().

I believe we should eventually move all such binding to ->issue
to be consistent with file handling. Not super happy about either
of those, but that's the kinds of problems coming from supporting
links.
Keith Busch Feb. 8, 2025, 8:13 p.m. UTC | #8
On Sat, Feb 08, 2025 at 02:16:15PM +0000, Pavel Begunkov wrote:
> On 2/8/25 05:44, Ming Lei wrote:
> > On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
> > > On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
> > > > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
> > > > > 
> > > > > The previous version from Ming can be viewed here:
> > > > > 
> > > > >    https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/
> > > > > 
> > > > > Based on the feedback from that thread, the desired io_uring interfaces
> > > > > needed to be simpler, and the kernel registered resources need to behave
> > > > > more similiar to user registered buffers.
> > > > > 
> > > > > This series introduces a new resource node type, KBUF, which, like the
> > > > > BUFFER resource, needs to be installed into an io_uring buf_node table
> > > > > in order for the user to access it in a fixed buffer command. The
> > > > > new io_uring kernel API provides a way for a user to register a struct
> > > > > request's bvec to a specific index, and a way to unregister it.
> > > > > 
> > > > > When the ublk server receives notification of a new command, it must
> > > > > first select an index and register the zero copy buffer. It may use that
> > > > > index for any number of fixed buffer commands, then it must unregister
> > > > > the index when it's done. This can all be done in a single io_uring_enter
> > > > > if desired, or it can be split into multiple enters if needed.
> > > > 
> > > > I suspect it may not be done in single io_uring_enter() because there
> > > > is strict dependency among the three OPs(register buffer, read/write,
> > > > unregister buffer).
> > > 
> > > The registration is synchronous. io_uring completes the SQE entirely
> > > before it even looks at the read command in the next SQE.
> > 
> > Can you explain a bit "synchronous" here?
> 
> I'd believe synchronous here means "executed during submission from
> the submit syscall path". And I agree that we can't rely on that.
> That's an implementation detail and io_uring doesn't promise that,

The commands are processed in order under the ctx's uring_lock. What are
you thinking you might do to make this happen in any different order?
Pavel Begunkov Feb. 8, 2025, 9:40 p.m. UTC | #9
On 2/8/25 20:13, Keith Busch wrote:
> On Sat, Feb 08, 2025 at 02:16:15PM +0000, Pavel Begunkov wrote:
>> On 2/8/25 05:44, Ming Lei wrote:
>>> On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote:
>>>> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote:
>>>>> On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote:
>>>>>>
>>>>>> The previous version from Ming can be viewed here:
>>>>>>
>>>>>>     https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/
>>>>>>
>>>>>> Based on the feedback from that thread, the desired io_uring interfaces
>>>>>> needed to be simpler, and the kernel registered resources need to behave
>>>>>> more similiar to user registered buffers.
>>>>>>
>>>>>> This series introduces a new resource node type, KBUF, which, like the
>>>>>> BUFFER resource, needs to be installed into an io_uring buf_node table
>>>>>> in order for the user to access it in a fixed buffer command. The
>>>>>> new io_uring kernel API provides a way for a user to register a struct
>>>>>> request's bvec to a specific index, and a way to unregister it.
>>>>>>
>>>>>> When the ublk server receives notification of a new command, it must
>>>>>> first select an index and register the zero copy buffer. It may use that
>>>>>> index for any number of fixed buffer commands, then it must unregister
>>>>>> the index when it's done. This can all be done in a single io_uring_enter
>>>>>> if desired, or it can be split into multiple enters if needed.
>>>>>
>>>>> I suspect it may not be done in single io_uring_enter() because there
>>>>> is strict dependency among the three OPs(register buffer, read/write,
>>>>> unregister buffer).
>>>>
>>>> The registration is synchronous. io_uring completes the SQE entirely
>>>> before it even looks at the read command in the next SQE.
>>>
>>> Can you explain a bit "synchronous" here?
>>
>> I'd believe synchronous here means "executed during submission from
>> the submit syscall path". And I agree that we can't rely on that.
>> That's an implementation detail and io_uring doesn't promise that,
> 
> The commands are processed in order under the ctx's uring_lock. What are
> you thinking you might do to make this happen in any different order?

Bunch of stuff. IOSQE_ASYNC will reorder them. Drain can push it to
a different path with no guarantees what happens there, even when you
only used drain only for some past requests. Or it can get reordered
by racing with draining. Someone floated (not merged) idea before of
hybrid task / sqpoll execution, things like that might be needed at
some point, and that will reorder requests. Or you might want to
offload more aggressively, e.g. to already waiting tasks or the
thread pool.
diff mbox

Patch

diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h
index 0150003..07439be 100644
--- a/include/ublk_cmd.h
+++ b/include/ublk_cmd.h
@@ -94,6 +94,10 @@ 
 	_IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
 #define	UBLK_U_IO_NEED_GET_DATA		\
 	_IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
+#define UBLK_U_IO_REGISTER_IO_BUF	\
+	_IOWR('u', 0x23, struct ublksrv_io_cmd)
+#define UBLK_U_IO_UNREGISTER_IO_BUF	\
+	_IOWR('u', 0x24, struct ublksrv_io_cmd)
 
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h
index 1deee2b..6291531 100644
--- a/include/ublksrv_tgt.h
+++ b/include/ublksrv_tgt.h
@@ -189,4 +189,17 @@  static inline void ublk_get_sqe_pair(struct io_uring *r,
 		*sqe2 = io_uring_get_sqe(r);
 }
 
+static inline void ublk_get_sqe_three(struct io_uring *r,
+		struct io_uring_sqe **sqe1, struct io_uring_sqe **sqe2,
+		struct io_uring_sqe **sqe3)
+{
+	unsigned left = io_uring_sq_space_left(r);
+
+	if (left < 3)
+		io_uring_submit(r);
+
+	*sqe1 = io_uring_get_sqe(r);
+	*sqe2 = io_uring_get_sqe(r);
+	*sqe3 = io_uring_get_sqe(r);
+}
 #endif
diff --git a/lib/ublksrv.c b/lib/ublksrv.c
index 16a9e13..7205247 100644
--- a/lib/ublksrv.c
+++ b/lib/ublksrv.c
@@ -619,6 +619,15 @@  skip_alloc_buf:
 		goto fail;
 	}
 
+	if (ctrl_dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) {
+		ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth);
+		if (ret) {
+			ublk_err("ublk dev %d queue %d register spare buffers failed %d",
+					q->dev->ctrl_dev->dev_info.dev_id, q->q_id, ret);
+			goto fail;
+		}
+	}
+
 	io_uring_register_ring_fd(&q->ring);
 
 	/*
diff --git a/tgt_loop.cpp b/tgt_loop.cpp
index 0f16676..ce44c7d 100644
--- a/tgt_loop.cpp
+++ b/tgt_loop.cpp
@@ -246,12 +246,62 @@  static inline int loop_fallocate_mode(const struct ublksrv_io_desc *iod)
        return mode;
 }
 
+static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe,
+		int dev_fd, int tag, int q_id, __u64 index)
+{
+	struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
+
+	io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
+	sqe->opcode		= IORING_OP_URING_CMD;
+	sqe->flags		|= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE;
+	sqe->cmd_op		= UBLK_U_IO_REGISTER_IO_BUF;
+
+	cmd->tag		= tag;
+	cmd->addr		= index;
+	cmd->q_id		= q_id;
+}
+
+static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe,
+		int dev_fd, int tag, int q_id, __u64 index)
+{
+	struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
+
+	io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
+	sqe->opcode		= IORING_OP_URING_CMD;
+	sqe->flags		|= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE;
+	sqe->cmd_op		= UBLK_U_IO_UNREGISTER_IO_BUF;
+
+	cmd->tag		= tag;
+	cmd->addr		= index;
+	cmd->q_id		= q_id;
+}
+
 static void loop_queue_tgt_read(const struct ublksrv_queue *q,
 		const struct ublksrv_io_desc *iod, int tag)
 {
+	const struct ublksrv_ctrl_dev_info *info =
+		ublksrv_ctrl_get_dev_info(ublksrv_get_ctrl_dev(q->dev));
 	unsigned ublk_op = ublksrv_get_op(iod);
 
-	if (user_copy) {
+	if (info->flags & UBLK_F_SUPPORT_ZERO_COPY) {
+		struct io_uring_sqe *reg;
+		struct io_uring_sqe *read;
+		struct io_uring_sqe *ureg;
+
+		ublk_get_sqe_three(q->ring_ptr, &reg, &read, &ureg);
+
+		io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag);
+
+		io_uring_prep_read_fixed(read, 1 /*fds[1]*/,
+			0,
+			iod->nr_sectors << 9,
+			iod->start_sector << 9,
+			tag);
+		io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE);
+		read->user_data = build_user_data(tag, ublk_op, 0, 1);
+
+		io_uring_prep_buf_unregister(ureg, 0, tag, q->q_id, tag);
+	} else if (user_copy) {
 		struct io_uring_sqe *sqe, *sqe2;
 		__u64 pos = ublk_pos(q->q_id, tag, 0);
 		void *buf = ublksrv_queue_get_io_buf(q, tag);
@@ -286,9 +336,29 @@  static void loop_queue_tgt_read(const struct ublksrv_queue *q,
 static void loop_queue_tgt_write(const struct ublksrv_queue *q,
 		const struct ublksrv_io_desc *iod, int tag)
 {
+	const struct ublksrv_ctrl_dev_info *info =
+		ublksrv_ctrl_get_dev_info(ublksrv_get_ctrl_dev(q->dev));
 	unsigned ublk_op = ublksrv_get_op(iod);
 
-	if (user_copy) {
+	if (info->flags & UBLK_F_SUPPORT_ZERO_COPY) {
+		struct io_uring_sqe *reg;
+		struct io_uring_sqe *write;
+		struct io_uring_sqe *ureg;
+
+		ublk_get_sqe_three(q->ring_ptr, &reg, &write, &ureg);
+
+		io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag);
+
+		io_uring_prep_write_fixed(write, 1 /*fds[1]*/,
+			0,
+			iod->nr_sectors << 9,
+			iod->start_sector << 9,
+			tag);
+		io_uring_sqe_set_flags(write, IOSQE_FIXED_FILE);
+		write->user_data = build_user_data(tag, ublk_op, 0, 1);
+
+		io_uring_prep_buf_unregister(ureg, 0, tag, q->q_id, tag);
+	} else if (user_copy) {
 		struct io_uring_sqe *sqe, *sqe2;
 		__u64 pos = ublk_pos(q->q_id, tag, 0);
 		void *buf = ublksrv_queue_get_io_buf(q, tag);
diff --git a/ublksrv_tgt.cpp b/ublksrv_tgt.cpp
index 8f9cf28..f3ebe14 100644
--- a/ublksrv_tgt.cpp
+++ b/ublksrv_tgt.cpp
@@ -723,7 +723,7 @@  static int cmd_dev_add(int argc, char *argv[])
 			data.tgt_type = optarg;
 			break;
 		case 'z':
-			data.flags |= UBLK_F_SUPPORT_ZERO_COPY;
+			data.flags |= UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_USER_COPY;
 			break;
 		case 'q':
 			data.nr_hw_queues = strtol(optarg, NULL, 10);