Message ID | 20250203154517.937623-1-kbusch@meta.com (mailing list archive) |
---|---|
State | New |
Headers | show |
On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > This is a new look at supporting zero copy with ublk. Just to give some numbers behind this since I didn't in the original cover letter. The numbers are from a 1.6GHz Xeon platform. Using the ublksrv patch I provided in the cover letter, created two ublk block devices with null_blk backings: ublk add -t loop -f /dev/nullb0 ublk add -t loop -f /dev/nullb1 -z Using t/io_uring, comparing the ublk device without zero-copy vs the one with zero-copy (-z) enabled 4k read: Legacy: IOPS=387.78K, BW=1514MiB/s, IOS/call=32/32 IOPS=395.14K, BW=1543MiB/s, IOS/call=32/32 IOPS=395.68K, BW=1545MiB/s, IOS/call=32/31 Zero-copy: IOPS=482.69K, BW=1885MiB/s, IOS/call=32/31 IOPS=481.34K, BW=1880MiB/s, IOS/call=32/32 IOPS=481.66K, BW=1881MiB/s, IOS/call=32/32 64k read: Legacy: IOPS=73248, BW=4.58GiB/s, IOS/call=32/32 IOPS=73664, BW=4.60GiB/s, IOS/call=32/32 IOPS=72288, BW=4.52GiB/s, IOS/call=32/32 Zero-copy: IOPS=381.76K, BW=23.86GiB/s, IOS/call=32/31 IOPS=378.18K, BW=23.64GiB/s, IOS/call=32/32 IOPS=379.52K, BW=23.72GiB/s, IOS/call=32/32 The register/unregister overhead is low enough to show a decent improvement even at 4k IO. And it's using half the memory with lower CPU utilization per IO, so all good wins.
On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > This is a new look at supporting zero copy with ublk. > > The previous version from Ming can be viewed here: > > https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/ > > Based on the feedback from that thread, the desired io_uring interfaces > needed to be simpler, and the kernel registered resources need to behave > more similiar to user registered buffers. > > This series introduces a new resource node type, KBUF, which, like the > BUFFER resource, needs to be installed into an io_uring buf_node table > in order for the user to access it in a fixed buffer command. The > new io_uring kernel API provides a way for a user to register a struct > request's bvec to a specific index, and a way to unregister it. > > When the ublk server receives notification of a new command, it must > first select an index and register the zero copy buffer. It may use that > index for any number of fixed buffer commands, then it must unregister > the index when it's done. This can all be done in a single io_uring_enter > if desired, or it can be split into multiple enters if needed. I suspect it may not be done in single io_uring_enter() because there is strict dependency among the three OPs(register buffer, read/write, unregister buffer). > > The io_uring instance that gets the zero copy registration doesn't > necessarily need to be the same ring that is receiving notifcations from > the ublk_drv module. This allows you to split frontend and backend rings > if desired. > > At the end of this cover letter, I've provided a patch to the ublksrv to > demonstrate how to use this. > > Jens Axboe (1): > io_uring: use node for import > > Keith Busch (5): > block: const blk_rq_nr_phys_segments request > io_uring: add support for kernel registered bvecs > ublk: zc register/unregister bvec > io_uring: add abstraction for buf_table rsrc data > io_uring: cache nodes and mapped buffers > > drivers/block/ublk_drv.c | 139 +++++++++++++----- > include/linux/blk-mq.h | 2 +- > include/linux/io_uring.h | 1 + > include/linux/io_uring_types.h | 25 +++- > include/uapi/linux/ublk_cmd.h | 4 + > io_uring/fdinfo.c | 8 +- > io_uring/filetable.c | 2 +- > io_uring/net.c | 5 +- > io_uring/nop.c | 2 +- > io_uring/register.c | 2 +- > io_uring/rsrc.c | 259 ++++++++++++++++++++++++++------- > io_uring/rsrc.h | 8 +- > io_uring/rw.c | 4 +- > io_uring/uring_cmd.c | 4 +- > 14 files changed, 351 insertions(+), 114 deletions(-) > > -- > 2.43.5 > > ublksrv: > > https://github.com/ublk-org/ublksrv > > --- > include/ublk_cmd.h | 4 +++ > include/ublksrv_tgt.h | 13 ++++++++ > lib/ublksrv.c | 9 ++++++ > tgt_loop.cpp | 74 +++++++++++++++++++++++++++++++++++++++++-- > ublksrv_tgt.cpp | 2 +- > 5 files changed, 99 insertions(+), 3 deletions(-) > > diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h > index 0150003..07439be 100644 > --- a/include/ublk_cmd.h > +++ b/include/ublk_cmd.h > @@ -94,6 +94,10 @@ > _IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd) > #define UBLK_U_IO_NEED_GET_DATA \ > _IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd) > +#define UBLK_U_IO_REGISTER_IO_BUF \ > + _IOWR('u', 0x23, struct ublksrv_io_cmd) > +#define UBLK_U_IO_UNREGISTER_IO_BUF \ > + _IOWR('u', 0x24, struct ublksrv_io_cmd) > > /* only ABORT means that no re-fetch */ > #define UBLK_IO_RES_OK 0 > diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h > index 1deee2b..6291531 100644 > --- a/include/ublksrv_tgt.h > +++ b/include/ublksrv_tgt.h > @@ -189,4 +189,17 @@ static inline void ublk_get_sqe_pair(struct io_uring *r, > *sqe2 = io_uring_get_sqe(r); > } > > +static inline void ublk_get_sqe_three(struct io_uring *r, > + struct io_uring_sqe **sqe1, struct io_uring_sqe **sqe2, > + struct io_uring_sqe **sqe3) > +{ > + unsigned left = io_uring_sq_space_left(r); > + > + if (left < 3) > + io_uring_submit(r); > + > + *sqe1 = io_uring_get_sqe(r); > + *sqe2 = io_uring_get_sqe(r); > + *sqe3 = io_uring_get_sqe(r); > +} > #endif > diff --git a/lib/ublksrv.c b/lib/ublksrv.c > index 16a9e13..7205247 100644 > --- a/lib/ublksrv.c > +++ b/lib/ublksrv.c > @@ -619,6 +619,15 @@ skip_alloc_buf: > goto fail; > } > > + if (ctrl_dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) { > + ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth); > + if (ret) { > + ublk_err("ublk dev %d queue %d register spare buffers failed %d", > + q->dev->ctrl_dev->dev_info.dev_id, q->q_id, ret); > + goto fail; > + } > + } > + > io_uring_register_ring_fd(&q->ring); > > /* > diff --git a/tgt_loop.cpp b/tgt_loop.cpp > index 0f16676..ce44c7d 100644 > --- a/tgt_loop.cpp > +++ b/tgt_loop.cpp > @@ -246,12 +246,62 @@ static inline int loop_fallocate_mode(const struct ublksrv_io_desc *iod) > return mode; > } > > +static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe, > + int dev_fd, int tag, int q_id, __u64 index) > +{ > + struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd; > + > + io_uring_prep_read(sqe, dev_fd, 0, 0, 0); > + sqe->opcode = IORING_OP_URING_CMD; > + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE; IOSQE_IO_LINK is missed, because the following buffer consumer OP has to be issued after this buffer register OP is completed. > + sqe->cmd_op = UBLK_U_IO_REGISTER_IO_BUF; > + > + cmd->tag = tag; > + cmd->addr = index; > + cmd->q_id = q_id; > +} > + > +static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe, > + int dev_fd, int tag, int q_id, __u64 index) > +{ > + struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd; > + > + io_uring_prep_read(sqe, dev_fd, 0, 0, 0); > + sqe->opcode = IORING_OP_URING_CMD; > + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE; > + sqe->cmd_op = UBLK_U_IO_UNREGISTER_IO_BUF; IOSQE_IO_LINK is missed, because buffer un-register OP has to be issued after the previous buffer consumer OP is completed. > + > + cmd->tag = tag; > + cmd->addr = index; > + cmd->q_id = q_id; > +} > + > static void loop_queue_tgt_read(const struct ublksrv_queue *q, > const struct ublksrv_io_desc *iod, int tag) > { > + const struct ublksrv_ctrl_dev_info *info = > + ublksrv_ctrl_get_dev_info(ublksrv_get_ctrl_dev(q->dev)); > unsigned ublk_op = ublksrv_get_op(iod); > > - if (user_copy) { > + if (info->flags & UBLK_F_SUPPORT_ZERO_COPY) { > + struct io_uring_sqe *reg; > + struct io_uring_sqe *read; > + struct io_uring_sqe *ureg; > + > + ublk_get_sqe_three(q->ring_ptr, ®, &read, &ureg); > + > + io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag); > + > + io_uring_prep_read_fixed(read, 1 /*fds[1]*/, > + 0, > + iod->nr_sectors << 9, > + iod->start_sector << 9, > + tag); > + io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE); > + read->user_data = build_user_data(tag, ublk_op, 0, 1); Does this interface support to read to partial buffer? Which is useful for stacking device cases. Also does this interface support to consume the buffer from multiple OPs concurrently? Thanks, Ming
On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote: > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote: > > > > The previous version from Ming can be viewed here: > > > > https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/ > > > > Based on the feedback from that thread, the desired io_uring interfaces > > needed to be simpler, and the kernel registered resources need to behave > > more similiar to user registered buffers. > > > > This series introduces a new resource node type, KBUF, which, like the > > BUFFER resource, needs to be installed into an io_uring buf_node table > > in order for the user to access it in a fixed buffer command. The > > new io_uring kernel API provides a way for a user to register a struct > > request's bvec to a specific index, and a way to unregister it. > > > > When the ublk server receives notification of a new command, it must > > first select an index and register the zero copy buffer. It may use that > > index for any number of fixed buffer commands, then it must unregister > > the index when it's done. This can all be done in a single io_uring_enter > > if desired, or it can be split into multiple enters if needed. > > I suspect it may not be done in single io_uring_enter() because there > is strict dependency among the three OPs(register buffer, read/write, > unregister buffer). The registration is synchronous. io_uring completes the SQE entirely before it even looks at the read command in the next SQE. The read or write is asynchronous, but it's prep takes a reference on the node before moving on to the next SQE.. The unregister is synchronous, and clears the index node, but the possibly inflight read or write has a reference on that node, so all good. > > + ublk_get_sqe_three(q->ring_ptr, ®, &read, &ureg); > > + > > + io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag); > > + > > + io_uring_prep_read_fixed(read, 1 /*fds[1]*/, > > + 0, > > + iod->nr_sectors << 9, > > + iod->start_sector << 9, > > + tag); > > + io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE); > > + read->user_data = build_user_data(tag, ublk_op, 0, 1); > > Does this interface support to read to partial buffer? Which is useful > for stacking device cases. Are you wanting to read into this buffer without copying in parts? As in provide an offset and/or smaller length across multiple commands? If that's what you mean, then yes, you can do that here. > Also does this interface support to consume the buffer from multiple > OPs concurrently? You can register as many kernel buffers from as many OPs as you have space for in your table, and you can use them all concurrently. Pretty much the same as user registered fixed buffers. The main difference from user buffers is how you register them.
Hi Keith, On 2/3/25 16:45, Keith Busch wrote: > From: Keith Busch <kbusch@kernel.org> > > This is a new look at supporting zero copy with ublk. will try to look at it over the weekend. Could you please keep me in the loop for future versions? Thanks, Bernd
On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote: > On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote: > > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote: > > > > > > The previous version from Ming can be viewed here: > > > > > > https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/ > > > > > > Based on the feedback from that thread, the desired io_uring interfaces > > > needed to be simpler, and the kernel registered resources need to behave > > > more similiar to user registered buffers. > > > > > > This series introduces a new resource node type, KBUF, which, like the > > > BUFFER resource, needs to be installed into an io_uring buf_node table > > > in order for the user to access it in a fixed buffer command. The > > > new io_uring kernel API provides a way for a user to register a struct > > > request's bvec to a specific index, and a way to unregister it. > > > > > > When the ublk server receives notification of a new command, it must > > > first select an index and register the zero copy buffer. It may use that > > > index for any number of fixed buffer commands, then it must unregister > > > the index when it's done. This can all be done in a single io_uring_enter > > > if desired, or it can be split into multiple enters if needed. > > > > I suspect it may not be done in single io_uring_enter() because there > > is strict dependency among the three OPs(register buffer, read/write, > > unregister buffer). > > The registration is synchronous. io_uring completes the SQE entirely > before it even looks at the read command in the next SQE. Can you explain a bit "synchronous" here? In patch 4, two ublk uring_cmd(UBLK_U_IO_REGISTER_IO_BUF/UBLK_U_IO_UNREGISTER_IO_BUF) are added, and their handlers are called from uring_cmd's ->issue(). > > The read or write is asynchronous, but it's prep takes a reference on > the node before moving on to the next SQE.. The buffer is registered in ->issue() of UBLK_U_IO_REGISTER_IO_BUF, and it isn't done yet when calling ->prep() of read_fixed/write_fixed, in which buffer is looked up in ->prep(). > > The unregister is synchronous, and clears the index node, but the > possibly inflight read or write has a reference on that node, so all > good. UBLK_U_IO_UNREGISTER_IO_BUF tells ublk that the buffer isn't used any more, but it is being used by the async read/write. It might work, but looks a bit fragile, such as: One buggy application may panic kernel if the IO command is completed before read/write is done. > > > > + ublk_get_sqe_three(q->ring_ptr, ®, &read, &ureg); > > > + > > > + io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag); > > > + > > > + io_uring_prep_read_fixed(read, 1 /*fds[1]*/, > > > + 0, > > > + iod->nr_sectors << 9, > > > + iod->start_sector << 9, > > > + tag); > > > + io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE); > > > + read->user_data = build_user_data(tag, ublk_op, 0, 1); > > > > Does this interface support to read to partial buffer? Which is useful > > for stacking device cases. > > Are you wanting to read into this buffer without copying in parts? As in > provide an offset and/or smaller length across multiple commands? If > that's what you mean, then yes, you can do that here. OK. > > > Also does this interface support to consume the buffer from multiple > > OPs concurrently? > > You can register as many kernel buffers from as many OPs as you have > space for in your table, and you can use them all concurrently. Pretty > much the same as user registered fixed buffers. The main difference from > user buffers is how you register them. Here it depends on if LINK between buffer register and read/write are required. If it is required, multiple OPs consuming the buffer have to be linked one by one, then they can't be issue concurrently. Thanks, Ming
On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote: > On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote: > > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote: ... > > Does this interface support to read to partial buffer? Which is useful > > for stacking device cases. > > Are you wanting to read into this buffer without copying in parts? As in > provide an offset and/or smaller length across multiple commands? If > that's what you mean, then yes, you can do that here. Sorry, forget another problem here. For short read, we need to zero the remained bytes in each part of buffer, how to deal with that? Please see io_req_zero_remained() in my patch of "[PATCH V10 10/12] io_uring: support leased group buffer with REQ_F_GROUP_BUF": https://lore.kernel.org/linux-block/20241107110149.890530-11-ming.lei@redhat.com/#r Thanks, Ming
On 2/8/25 05:44, Ming Lei wrote: > On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote: >> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote: >>> On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote: >>>> >>>> The previous version from Ming can be viewed here: >>>> >>>> https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/ >>>> >>>> Based on the feedback from that thread, the desired io_uring interfaces >>>> needed to be simpler, and the kernel registered resources need to behave >>>> more similiar to user registered buffers. >>>> >>>> This series introduces a new resource node type, KBUF, which, like the >>>> BUFFER resource, needs to be installed into an io_uring buf_node table >>>> in order for the user to access it in a fixed buffer command. The >>>> new io_uring kernel API provides a way for a user to register a struct >>>> request's bvec to a specific index, and a way to unregister it. >>>> >>>> When the ublk server receives notification of a new command, it must >>>> first select an index and register the zero copy buffer. It may use that >>>> index for any number of fixed buffer commands, then it must unregister >>>> the index when it's done. This can all be done in a single io_uring_enter >>>> if desired, or it can be split into multiple enters if needed. >>> >>> I suspect it may not be done in single io_uring_enter() because there >>> is strict dependency among the three OPs(register buffer, read/write, >>> unregister buffer). >> >> The registration is synchronous. io_uring completes the SQE entirely >> before it even looks at the read command in the next SQE. > > Can you explain a bit "synchronous" here? I'd believe synchronous here means "executed during submission from the submit syscall path". And I agree that we can't rely on that. That's an implementation detail and io_uring doesn't promise that, but even now it relies on not using certain features like drain and the async flag. > In patch 4, two ublk uring_cmd(UBLK_U_IO_REGISTER_IO_BUF/UBLK_U_IO_UNREGISTER_IO_BUF) > are added, and their handlers are called from uring_cmd's ->issue(). > >> >> The read or write is asynchronous, but it's prep takes a reference on >> the node before moving on to the next SQE.. > > The buffer is registered in ->issue() of UBLK_U_IO_REGISTER_IO_BUF, > and it isn't done yet when calling ->prep() of read_fixed/write_fixed, > in which buffer is looked up in ->prep(). I believe we should eventually move all such binding to ->issue to be consistent with file handling. Not super happy about either of those, but that's the kinds of problems coming from supporting links.
On Sat, Feb 08, 2025 at 02:16:15PM +0000, Pavel Begunkov wrote: > On 2/8/25 05:44, Ming Lei wrote: > > On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote: > > > On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote: > > > > On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote: > > > > > > > > > > The previous version from Ming can be viewed here: > > > > > > > > > > https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/ > > > > > > > > > > Based on the feedback from that thread, the desired io_uring interfaces > > > > > needed to be simpler, and the kernel registered resources need to behave > > > > > more similiar to user registered buffers. > > > > > > > > > > This series introduces a new resource node type, KBUF, which, like the > > > > > BUFFER resource, needs to be installed into an io_uring buf_node table > > > > > in order for the user to access it in a fixed buffer command. The > > > > > new io_uring kernel API provides a way for a user to register a struct > > > > > request's bvec to a specific index, and a way to unregister it. > > > > > > > > > > When the ublk server receives notification of a new command, it must > > > > > first select an index and register the zero copy buffer. It may use that > > > > > index for any number of fixed buffer commands, then it must unregister > > > > > the index when it's done. This can all be done in a single io_uring_enter > > > > > if desired, or it can be split into multiple enters if needed. > > > > > > > > I suspect it may not be done in single io_uring_enter() because there > > > > is strict dependency among the three OPs(register buffer, read/write, > > > > unregister buffer). > > > > > > The registration is synchronous. io_uring completes the SQE entirely > > > before it even looks at the read command in the next SQE. > > > > Can you explain a bit "synchronous" here? > > I'd believe synchronous here means "executed during submission from > the submit syscall path". And I agree that we can't rely on that. > That's an implementation detail and io_uring doesn't promise that, The commands are processed in order under the ctx's uring_lock. What are you thinking you might do to make this happen in any different order?
On 2/8/25 20:13, Keith Busch wrote: > On Sat, Feb 08, 2025 at 02:16:15PM +0000, Pavel Begunkov wrote: >> On 2/8/25 05:44, Ming Lei wrote: >>> On Fri, Feb 07, 2025 at 07:06:54AM -0700, Keith Busch wrote: >>>> On Fri, Feb 07, 2025 at 11:51:49AM +0800, Ming Lei wrote: >>>>> On Mon, Feb 03, 2025 at 07:45:11AM -0800, Keith Busch wrote: >>>>>> >>>>>> The previous version from Ming can be viewed here: >>>>>> >>>>>> https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/ >>>>>> >>>>>> Based on the feedback from that thread, the desired io_uring interfaces >>>>>> needed to be simpler, and the kernel registered resources need to behave >>>>>> more similiar to user registered buffers. >>>>>> >>>>>> This series introduces a new resource node type, KBUF, which, like the >>>>>> BUFFER resource, needs to be installed into an io_uring buf_node table >>>>>> in order for the user to access it in a fixed buffer command. The >>>>>> new io_uring kernel API provides a way for a user to register a struct >>>>>> request's bvec to a specific index, and a way to unregister it. >>>>>> >>>>>> When the ublk server receives notification of a new command, it must >>>>>> first select an index and register the zero copy buffer. It may use that >>>>>> index for any number of fixed buffer commands, then it must unregister >>>>>> the index when it's done. This can all be done in a single io_uring_enter >>>>>> if desired, or it can be split into multiple enters if needed. >>>>> >>>>> I suspect it may not be done in single io_uring_enter() because there >>>>> is strict dependency among the three OPs(register buffer, read/write, >>>>> unregister buffer). >>>> >>>> The registration is synchronous. io_uring completes the SQE entirely >>>> before it even looks at the read command in the next SQE. >>> >>> Can you explain a bit "synchronous" here? >> >> I'd believe synchronous here means "executed during submission from >> the submit syscall path". And I agree that we can't rely on that. >> That's an implementation detail and io_uring doesn't promise that, > > The commands are processed in order under the ctx's uring_lock. What are > you thinking you might do to make this happen in any different order? Bunch of stuff. IOSQE_ASYNC will reorder them. Drain can push it to a different path with no guarantees what happens there, even when you only used drain only for some past requests. Or it can get reordered by racing with draining. Someone floated (not merged) idea before of hybrid task / sqpoll execution, things like that might be needed at some point, and that will reorder requests. Or you might want to offload more aggressively, e.g. to already waiting tasks or the thread pool.
diff --git a/include/ublk_cmd.h b/include/ublk_cmd.h index 0150003..07439be 100644 --- a/include/ublk_cmd.h +++ b/include/ublk_cmd.h @@ -94,6 +94,10 @@ _IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd) #define UBLK_U_IO_NEED_GET_DATA \ _IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd) +#define UBLK_U_IO_REGISTER_IO_BUF \ + _IOWR('u', 0x23, struct ublksrv_io_cmd) +#define UBLK_U_IO_UNREGISTER_IO_BUF \ + _IOWR('u', 0x24, struct ublksrv_io_cmd) /* only ABORT means that no re-fetch */ #define UBLK_IO_RES_OK 0 diff --git a/include/ublksrv_tgt.h b/include/ublksrv_tgt.h index 1deee2b..6291531 100644 --- a/include/ublksrv_tgt.h +++ b/include/ublksrv_tgt.h @@ -189,4 +189,17 @@ static inline void ublk_get_sqe_pair(struct io_uring *r, *sqe2 = io_uring_get_sqe(r); } +static inline void ublk_get_sqe_three(struct io_uring *r, + struct io_uring_sqe **sqe1, struct io_uring_sqe **sqe2, + struct io_uring_sqe **sqe3) +{ + unsigned left = io_uring_sq_space_left(r); + + if (left < 3) + io_uring_submit(r); + + *sqe1 = io_uring_get_sqe(r); + *sqe2 = io_uring_get_sqe(r); + *sqe3 = io_uring_get_sqe(r); +} #endif diff --git a/lib/ublksrv.c b/lib/ublksrv.c index 16a9e13..7205247 100644 --- a/lib/ublksrv.c +++ b/lib/ublksrv.c @@ -619,6 +619,15 @@ skip_alloc_buf: goto fail; } + if (ctrl_dev->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY) { + ret = io_uring_register_buffers_sparse(&q->ring, q->q_depth); + if (ret) { + ublk_err("ublk dev %d queue %d register spare buffers failed %d", + q->dev->ctrl_dev->dev_info.dev_id, q->q_id, ret); + goto fail; + } + } + io_uring_register_ring_fd(&q->ring); /* diff --git a/tgt_loop.cpp b/tgt_loop.cpp index 0f16676..ce44c7d 100644 --- a/tgt_loop.cpp +++ b/tgt_loop.cpp @@ -246,12 +246,62 @@ static inline int loop_fallocate_mode(const struct ublksrv_io_desc *iod) return mode; } +static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe, + int dev_fd, int tag, int q_id, __u64 index) +{ + struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd; + + io_uring_prep_read(sqe, dev_fd, 0, 0, 0); + sqe->opcode = IORING_OP_URING_CMD; + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE; + sqe->cmd_op = UBLK_U_IO_REGISTER_IO_BUF; + + cmd->tag = tag; + cmd->addr = index; + cmd->q_id = q_id; +} + +static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe, + int dev_fd, int tag, int q_id, __u64 index) +{ + struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd; + + io_uring_prep_read(sqe, dev_fd, 0, 0, 0); + sqe->opcode = IORING_OP_URING_CMD; + sqe->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_FIXED_FILE; + sqe->cmd_op = UBLK_U_IO_UNREGISTER_IO_BUF; + + cmd->tag = tag; + cmd->addr = index; + cmd->q_id = q_id; +} + static void loop_queue_tgt_read(const struct ublksrv_queue *q, const struct ublksrv_io_desc *iod, int tag) { + const struct ublksrv_ctrl_dev_info *info = + ublksrv_ctrl_get_dev_info(ublksrv_get_ctrl_dev(q->dev)); unsigned ublk_op = ublksrv_get_op(iod); - if (user_copy) { + if (info->flags & UBLK_F_SUPPORT_ZERO_COPY) { + struct io_uring_sqe *reg; + struct io_uring_sqe *read; + struct io_uring_sqe *ureg; + + ublk_get_sqe_three(q->ring_ptr, ®, &read, &ureg); + + io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag); + + io_uring_prep_read_fixed(read, 1 /*fds[1]*/, + 0, + iod->nr_sectors << 9, + iod->start_sector << 9, + tag); + io_uring_sqe_set_flags(read, IOSQE_FIXED_FILE); + read->user_data = build_user_data(tag, ublk_op, 0, 1); + + io_uring_prep_buf_unregister(ureg, 0, tag, q->q_id, tag); + } else if (user_copy) { struct io_uring_sqe *sqe, *sqe2; __u64 pos = ublk_pos(q->q_id, tag, 0); void *buf = ublksrv_queue_get_io_buf(q, tag); @@ -286,9 +336,29 @@ static void loop_queue_tgt_read(const struct ublksrv_queue *q, static void loop_queue_tgt_write(const struct ublksrv_queue *q, const struct ublksrv_io_desc *iod, int tag) { + const struct ublksrv_ctrl_dev_info *info = + ublksrv_ctrl_get_dev_info(ublksrv_get_ctrl_dev(q->dev)); unsigned ublk_op = ublksrv_get_op(iod); - if (user_copy) { + if (info->flags & UBLK_F_SUPPORT_ZERO_COPY) { + struct io_uring_sqe *reg; + struct io_uring_sqe *write; + struct io_uring_sqe *ureg; + + ublk_get_sqe_three(q->ring_ptr, ®, &write, &ureg); + + io_uring_prep_buf_register(reg, 0, tag, q->q_id, tag); + + io_uring_prep_write_fixed(write, 1 /*fds[1]*/, + 0, + iod->nr_sectors << 9, + iod->start_sector << 9, + tag); + io_uring_sqe_set_flags(write, IOSQE_FIXED_FILE); + write->user_data = build_user_data(tag, ublk_op, 0, 1); + + io_uring_prep_buf_unregister(ureg, 0, tag, q->q_id, tag); + } else if (user_copy) { struct io_uring_sqe *sqe, *sqe2; __u64 pos = ublk_pos(q->q_id, tag, 0); void *buf = ublksrv_queue_get_io_buf(q, tag); diff --git a/ublksrv_tgt.cpp b/ublksrv_tgt.cpp index 8f9cf28..f3ebe14 100644 --- a/ublksrv_tgt.cpp +++ b/ublksrv_tgt.cpp @@ -723,7 +723,7 @@ static int cmd_dev_add(int argc, char *argv[]) data.tgt_type = optarg; break; case 'z': - data.flags |= UBLK_F_SUPPORT_ZERO_COPY; + data.flags |= UBLK_F_SUPPORT_ZERO_COPY | UBLK_F_USER_COPY; break; case 'q': data.nr_hw_queues = strtol(optarg, NULL, 10);
From: Keith Busch <kbusch@kernel.org> This is a new look at supporting zero copy with ublk. The previous version from Ming can be viewed here: https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/ Based on the feedback from that thread, the desired io_uring interfaces needed to be simpler, and the kernel registered resources need to behave more similiar to user registered buffers. This series introduces a new resource node type, KBUF, which, like the BUFFER resource, needs to be installed into an io_uring buf_node table in order for the user to access it in a fixed buffer command. The new io_uring kernel API provides a way for a user to register a struct request's bvec to a specific index, and a way to unregister it. When the ublk server receives notification of a new command, it must first select an index and register the zero copy buffer. It may use that index for any number of fixed buffer commands, then it must unregister the index when it's done. This can all be done in a single io_uring_enter if desired, or it can be split into multiple enters if needed. The io_uring instance that gets the zero copy registration doesn't necessarily need to be the same ring that is receiving notifcations from the ublk_drv module. This allows you to split frontend and backend rings if desired. At the end of this cover letter, I've provided a patch to the ublksrv to demonstrate how to use this. Jens Axboe (1): io_uring: use node for import Keith Busch (5): block: const blk_rq_nr_phys_segments request io_uring: add support for kernel registered bvecs ublk: zc register/unregister bvec io_uring: add abstraction for buf_table rsrc data io_uring: cache nodes and mapped buffers drivers/block/ublk_drv.c | 139 +++++++++++++----- include/linux/blk-mq.h | 2 +- include/linux/io_uring.h | 1 + include/linux/io_uring_types.h | 25 +++- include/uapi/linux/ublk_cmd.h | 4 + io_uring/fdinfo.c | 8 +- io_uring/filetable.c | 2 +- io_uring/net.c | 5 +- io_uring/nop.c | 2 +- io_uring/register.c | 2 +- io_uring/rsrc.c | 259 ++++++++++++++++++++++++++------- io_uring/rsrc.h | 8 +- io_uring/rw.c | 4 +- io_uring/uring_cmd.c | 4 +- 14 files changed, 351 insertions(+), 114 deletions(-)