diff mbox series

[V5,16/16] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy

Message ID 20230328150958.1253547-17-ming.lei@redhat.com (mailing list archive)
State New, archived
Headers show
Series io_uring/ublk: add IORING_OP_FUSED_CMD | expand

Commit Message

Ming Lei March 28, 2023, 3:09 p.m. UTC
Apply io_uring fused command for supporting zero copy:

1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it
in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
retrieve it from concurrent fused command.

1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
fused cmd(zero copy) buffer

2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
request and submit secondary request; meantime setup complete callback via
this API, once secondary request is completed, the complete callback is
called for freeing the buffer and completing the fused command

Also request reference is held during fused command lifetime, and this way
guarantees that request buffer won't be freed until all inflight fused
commands are completed.

userspace(only implement sqe128 fused command):

	https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v5

liburing test(only implement normal sqe fused command: two 64byte SQEs)

	https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v5

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
 drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
 include/uapi/linux/ublk_cmd.h |   6 +-
 3 files changed, 303 insertions(+), 21 deletions(-)

Comments

Ziyang Zhang March 29, 2023, 2:57 a.m. UTC | #1
On 2023/3/28 23:09, Ming Lei wrote:
> Apply io_uring fused command for supporting zero copy:
> 

[...]

>  
> @@ -1374,7 +1533,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>  	if (!ubq || ub_cmd->q_id != ubq->q_id)
>  		goto out;
>  
> -	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
> +	/*
> +	 * The fused command reads the io buffer data structure only, so it
> +	 * is fine to be issued from other context.
> +	 */
> +	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
> +			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
>  		goto out;
>  

Hi Ming,

What is your use case that fused io_uring cmd is issued from another thread?
I think it is good practice to operate one io_uring instance in one thread
only.

Regards,
Zhang
Ming Lei March 29, 2023, 9 a.m. UTC | #2
On Wed, Mar 29, 2023 at 10:57:53AM +0800, Ziyang Zhang wrote:
> On 2023/3/28 23:09, Ming Lei wrote:
> > Apply io_uring fused command for supporting zero copy:
> > 
> 
> [...]
> 
> >  
> > @@ -1374,7 +1533,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> >  	if (!ubq || ub_cmd->q_id != ubq->q_id)
> >  		goto out;
> >  
> > -	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
> > +	/*
> > +	 * The fused command reads the io buffer data structure only, so it
> > +	 * is fine to be issued from other context.
> > +	 */
> > +	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
> > +			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
> >  		goto out;
> >  
> 
> Hi Ming,
> 
> What is your use case that fused io_uring cmd is issued from another thread?
> I think it is good practice to operate one io_uring instance in one thread
> only.

So far we limit io command has to be issued from the queue context,
which is still not friendly from userspace viewpoint, the reason is
that we can't get io_uring exit notification and ublk's use case is
very special since the queued io command may not be completed forever,
see:

https://lore.kernel.org/linux-fsdevel/ZBxTdCj60+s1aZqA@ovpn-8-16.pek2.redhat.com/

I remember that people raised concern about this implementation.

But for normal IO, it could be issued from io wq simply because of
link(dependency) or whatever, and userspace is still allowed to submit
io from another pthread via same io_uring ctx.


thanks,
Ming
Ziyang Zhang March 29, 2023, 10:01 a.m. UTC | #3
On 2023/3/29 17:00, Ming Lei wrote:
> On Wed, Mar 29, 2023 at 10:57:53AM +0800, Ziyang Zhang wrote:
>> On 2023/3/28 23:09, Ming Lei wrote:
>>> Apply io_uring fused command for supporting zero copy:
>>>
>>
>> [...]
>>
>>>  
>>> @@ -1374,7 +1533,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>>>  	if (!ubq || ub_cmd->q_id != ubq->q_id)
>>>  		goto out;
>>>  
>>> -	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
>>> +	/*
>>> +	 * The fused command reads the io buffer data structure only, so it
>>> +	 * is fine to be issued from other context.
>>> +	 */
>>> +	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
>>> +			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
>>>  		goto out;
>>>  
>>
>> Hi Ming,
>>
>> What is your use case that fused io_uring cmd is issued from another thread?
>> I think it is good practice to operate one io_uring instance in one thread
>> only.
> 
> So far we limit io command has to be issued from the queue context,
> which is still not friendly from userspace viewpoint, the reason is
> that we can't get io_uring exit notification and ublk's use case is
> very special since the queued io command may not be completed forever,

OK, so UBLK_IO_FUSED_SUBMIT_IO is guaranteed to be completed because it is
not queued. FETCH_REQ and COMMIT_AMD_FETCH are queued io commands and could
not be completed forever so they have to be issued from ubq_daemon. Right?

BTW, maybe NEED_GET_DATA can be issued from other context...

> see:
> 
> https://lore.kernel.org/linux-fsdevel/ZBxTdCj60+s1aZqA@ovpn-8-16.pek2.redhat.com/
> 
> I remember that people raised concern about this implementation.
> 
> But for normal IO, it could be issued from io wq simply because of
> link(dependency) or whatever, and userspace is still allowed to submit
> io from another pthread via same io_uring ctx.

Yes, we can submit to the same ctx from different pthread but lock may be required.
IMO, users may only choose ubq_daemon as the only submitter.


Regards,
Zhang
Ming Lei March 29, 2023, 10:52 a.m. UTC | #4
On Wed, Mar 29, 2023 at 06:01:16PM +0800, Ziyang Zhang wrote:
> On 2023/3/29 17:00, Ming Lei wrote:
> > On Wed, Mar 29, 2023 at 10:57:53AM +0800, Ziyang Zhang wrote:
> >> On 2023/3/28 23:09, Ming Lei wrote:
> >>> Apply io_uring fused command for supporting zero copy:
> >>>
> >>
> >> [...]
> >>
> >>>  
> >>> @@ -1374,7 +1533,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> >>>  	if (!ubq || ub_cmd->q_id != ubq->q_id)
> >>>  		goto out;
> >>>  
> >>> -	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
> >>> +	/*
> >>> +	 * The fused command reads the io buffer data structure only, so it
> >>> +	 * is fine to be issued from other context.
> >>> +	 */
> >>> +	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
> >>> +			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
> >>>  		goto out;
> >>>  
> >>
> >> Hi Ming,
> >>
> >> What is your use case that fused io_uring cmd is issued from another thread?
> >> I think it is good practice to operate one io_uring instance in one thread
> >> only.
> > 
> > So far we limit io command has to be issued from the queue context,
> > which is still not friendly from userspace viewpoint, the reason is
> > that we can't get io_uring exit notification and ublk's use case is
> > very special since the queued io command may not be completed forever,
> 
> OK, so UBLK_IO_FUSED_SUBMIT_IO is guaranteed to be completed because it is
> not queued. FETCH_REQ and COMMIT_AMD_FETCH are queued io commands and could
> not be completed forever so they have to be issued from ubq_daemon. Right?

Yeah, any io command should be issued from ubq daemon context.

> 
> BTW, maybe NEED_GET_DATA can be issued from other context...

So far it won't be supported.

As I mentioned in the link, if io_uring can provide io_uring exit
callback, we may relax this limit.

> 
> > see:
> > 
> > https://lore.kernel.org/linux-fsdevel/ZBxTdCj60+s1aZqA@ovpn-8-16.pek2.redhat.com/
> > 
> > I remember that people raised concern about this implementation.
> > 
> > But for normal IO, it could be issued from io wq simply because of
> > link(dependency) or whatever, and userspace is still allowed to submit
> > io from another pthread via same io_uring ctx.
> 
> Yes, we can submit to the same ctx from different pthread but lock may be required.

Right.

> IMO, users may only choose ubq_daemon as the only submitter.

At least any io command should be issued from ubq daemon now, but normal
io can be issued from any context.


Thanks,
Ming
Ziyang Zhang April 3, 2023, 8:38 a.m. UTC | #5
On 2023/3/29 18:52, Ming Lei wrote:
> On Wed, Mar 29, 2023 at 06:01:16PM +0800, Ziyang Zhang wrote:
>> On 2023/3/29 17:00, Ming Lei wrote:
>>> On Wed, Mar 29, 2023 at 10:57:53AM +0800, Ziyang Zhang wrote:
>>>> On 2023/3/28 23:09, Ming Lei wrote:
>>>>> Apply io_uring fused command for supporting zero copy:
>>>>>
>>>>
>>>> [...]
>>>>
>>>>>  
>>>>> @@ -1374,7 +1533,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
>>>>>  	if (!ubq || ub_cmd->q_id != ubq->q_id)
>>>>>  		goto out;
>>>>>  
>>>>> -	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
>>>>> +	/*
>>>>> +	 * The fused command reads the io buffer data structure only, so it
>>>>> +	 * is fine to be issued from other context.
>>>>> +	 */
>>>>> +	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
>>>>> +			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
>>>>>  		goto out;
>>>>>  
>>>>
>>>> Hi Ming,
>>>>
>>>> What is your use case that fused io_uring cmd is issued from another thread?
>>>> I think it is good practice to operate one io_uring instance in one thread
>>>> only.
>>>
>>> So far we limit io command has to be issued from the queue context,
>>> which is still not friendly from userspace viewpoint, the reason is
>>> that we can't get io_uring exit notification and ublk's use case is
>>> very special since the queued io command may not be completed forever,
>>
>> OK, so UBLK_IO_FUSED_SUBMIT_IO is guaranteed to be completed because it is
>> not queued. FETCH_REQ and COMMIT_AMD_FETCH are queued io commands and could
>> not be completed forever so they have to be issued from ubq_daemon. Right?
> 
> Yeah, any io command should be issued from ubq daemon context.
> 
>>
>> BTW, maybe NEED_GET_DATA can be issued from other context...
> 
> So far it won't be supported.
> 
> As I mentioned in the link, if io_uring can provide io_uring exit
> callback, we may relax this limit.
> 

Hi, Ming

Sorry, I do not understand... I think UBLK_IO_NEED_GET_DATA is normal IO just like
UBLK_IO_FUSED_SUBMIT_IO. It is issued from one pthread(ubq_daemon for now) and
is completed just in time(not queued). So I think we can allow UBLK_IO_NEED_GET_DATA
to be issued from other context.
Ming Lei April 3, 2023, 9:22 a.m. UTC | #6
On Mon, Apr 03, 2023 at 04:38:30PM +0800, Ziyang Zhang wrote:
> On 2023/3/29 18:52, Ming Lei wrote:
> > On Wed, Mar 29, 2023 at 06:01:16PM +0800, Ziyang Zhang wrote:
> >> On 2023/3/29 17:00, Ming Lei wrote:
> >>> On Wed, Mar 29, 2023 at 10:57:53AM +0800, Ziyang Zhang wrote:
> >>>> On 2023/3/28 23:09, Ming Lei wrote:
> >>>>> Apply io_uring fused command for supporting zero copy:
> >>>>>
> >>>>
> >>>> [...]
> >>>>
> >>>>>  
> >>>>> @@ -1374,7 +1533,12 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
> >>>>>  	if (!ubq || ub_cmd->q_id != ubq->q_id)
> >>>>>  		goto out;
> >>>>>  
> >>>>> -	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
> >>>>> +	/*
> >>>>> +	 * The fused command reads the io buffer data structure only, so it
> >>>>> +	 * is fine to be issued from other context.
> >>>>> +	 */
> >>>>> +	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
> >>>>> +			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
> >>>>>  		goto out;
> >>>>>  
> >>>>
> >>>> Hi Ming,
> >>>>
> >>>> What is your use case that fused io_uring cmd is issued from another thread?
> >>>> I think it is good practice to operate one io_uring instance in one thread
> >>>> only.
> >>>
> >>> So far we limit io command has to be issued from the queue context,
> >>> which is still not friendly from userspace viewpoint, the reason is
> >>> that we can't get io_uring exit notification and ublk's use case is
> >>> very special since the queued io command may not be completed forever,
> >>
> >> OK, so UBLK_IO_FUSED_SUBMIT_IO is guaranteed to be completed because it is
> >> not queued. FETCH_REQ and COMMIT_AMD_FETCH are queued io commands and could
> >> not be completed forever so they have to be issued from ubq_daemon. Right?
> > 
> > Yeah, any io command should be issued from ubq daemon context.
> > 
> >>
> >> BTW, maybe NEED_GET_DATA can be issued from other context...
> > 
> > So far it won't be supported.
> > 
> > As I mentioned in the link, if io_uring can provide io_uring exit
> > callback, we may relax this limit.
> > 
> 
> Hi, Ming
> 
> Sorry, I do not understand... I think UBLK_IO_NEED_GET_DATA is normal IO just like
> UBLK_IO_FUSED_SUBMIT_IO. It is issued from one pthread(ubq_daemon for now) and
> is completed just in time(not queued). So I think we can allow UBLK_IO_NEED_GET_DATA
> to be issued from other context.
 
No, it isn't.

UBLK_IO_FUSED_SUBMIT_IO is actually for handling target IO, and this
command just reads/provides IO buffer meta to io_uring in read-only
approach, and io buffer meta won't be changed, and any io state won't
be changed, so it is fine to call concurrently.

UBLK_IO_NEED_GET_DATA is still part of io commands, in which io->addr
needs to be set, and io->flags is touched, and it can't be done safely
concurrently.

Also after zero-copy is supported, UBLK_IO_NEED_GET_DATA may become
legacy code path, because ublk server can read/write io data directly
in userspace via read()/write(), and there isn't buffer allocation issue
any more.


Thanks,
Ming
diff mbox series

Patch

diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 1713b2890abb..7b7aa24e9729 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -297,18 +297,126 @@  with specified IO tag in the command data:
   ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
   the server buffer (pages) read to the IO request pages.
 
-Future development
-==================
+- ``UBLK_IO_FUSED_SUBMIT_IO``
+
+  Used for implementing zero copy feature.
+
+  It has to been the primary command of io_uring fused command. This command
+  submits the generic secondary IO request with io buffer provided by our primary
+  command, and won't be completed until the secondary request is done.
+
+  The provided buffer is represented as ``io_uring_bvec_buf``, which is
+  actually ublk request buffer's reference, and the reference is shared &
+  read-only, so the generic secondary request can retrieve any part of the buffer
+  by passing buffer offset & length.
 
 Zero copy
----------
+=========
+
+What is zero copy?
+------------------
+
+When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io)
+or page cache buffer(buffered io) or kernel buffer(meta io often) is used
+for submitting data to ublk driver, and all kinds of these buffers are
+represented by bio/bvecs(ublk request buffer) finally. Before supporting
+zero copy, data in these buffers has to be copied to ublk server userspace
+buffer before handling WRITE IO, or after handing READ IO, so that ublk
+server can handle IO for ``/dev/ublkb*`` with the copied data.
+
+The extra copy between ublk request buffer and ublk server userspace buffer
+not only increases CPU utilization(such as pinning pages, copy data), but
+also consumes memory bandwidth, and the cost could be very big when IO size
+is big. It is observed that ublk-null IOPS may be increased to ~5X if the
+extra copy can be avoided.
+
+So zero copy is very important for supporting high performance block device
+in userspace.
+
+Technical requirements
+----------------------
+
+- ublk request buffer use
+
+ublk request buffer is represented by bio/bvec, which is immutable, so do
+not try to change bvec via buffer reference; data can be read from or
+written to the buffer according to buffer direction, but bvec can't be
+changed
+
+- buffer lifetime
+
+Ublk server borrows ublk request buffer for handling ublk IO, ublk request
+buffer reference is used. Reference can't outlive the referent buffer. That
+means all request buffer references have to be released by ublk server
+before ublk driver completes this request, when request buffer ownership
+is transferred to upper layer(FS, application, ...).
+
+Also after ublk request is completed, any page belonging to this ublk
+request can not be written or read any more from ublk server since it is
+one block device from kernel viewpoint.
+
+- buffer direction
+
+For ublk WRITE request, ublk request buffer should only be accessed as data
+source, and the buffer can't be written by ublk server
+
+For ublk READ request, ublk request buffer should only be accessed as data
+destination, and the buffer can't be read by ublk server, otherwise kernel
+data is leaked to ublk server, which can be unprivileged application.
+
+- arbitrary size sub-buffer needs to be retrieved from ublk server
+
+ublk is one generic framework for implementing block device in userspace,
+and typical requirements include logical volume manager(mirror, stripped, ...),
+distributed network storage, compressed target, ...
+
+ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and
+ublk server needs to submit IOs with these sub-buffer(s). That also means
+arbitrary size sub-buffer(s) can be used to submit IO multiple times.
+
+Any sub-buffer is actually one reference of ublk request buffer, which
+ownership can't be transferred to upper layer if any reference is held
+by ublk server.
+
+Why slice isn't good for ublk zero copy
+---------------------------------------
+
+- spliced page from ->splice_read() can't be written
+
+ublk READ request can't be handled because spliced page can't be written to, and
+extending splice for ublk zero copy isn't one good solution [#splice_extend]_
+
+- it is very hard to meet above requirements  wrt. request buffer lifetime
+
+splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
+attention to ublk request buffer lifetime. If is very inefficient to respect
+request buffer lifetime by using all pipe buffer's ->release() which requires
+all pipe buffers and pipe to be kept when ublk server handles IO. That means
+one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
+provided buffer, and the pipe needs to be populated with pages in ublk request
+buffer.
+
+
+io_uring fused command based zero copy
+--------------------------------------
+
+io_uring fused command includes one primary command(uring command) and one
+generic secondary request. The primary command is responsible for submitting
+secondary request with provided buffer from ublk request, and primary command
+won't be completed until the secondary request is completed.
+
+Typical ublk IO handling includes network and FS IO, so it is usual enough
+for io_uring net & fs to support IO with provided buffer from primary command.
 
-Zero copy is a generic requirement for nbd, fuse or similar drivers. A
-problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
-can't be remapped any more in kernel with existing mm interfaces. This can
-occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
-big requests (IO size >= 256 KB) may benefit a lot from zero copy.
+Once primary command is submitted successfully, ublk driver guarantees that
+the ublk request buffer won't be gone away since secondary request actually
+grabs the buffer's reference. This way also guarantees that multiple
+concurrent fused commands associated with same request buffer works fine,
+as the provided buffer reference is shared & read-only.
 
+Also buffer usage direction flag is passed to primary command from userspace,
+so ublk driver can validate if it is legal to use buffer with requested
+direction.
 
 References
 ==========
@@ -323,4 +431,4 @@  References
 
 .. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
 
-.. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
+.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@mail.gmail.com/
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index a49b4de5ae1e..52b0a6e2be6e 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -74,10 +74,15 @@  struct ublk_rq_data {
 	 *   successfully
 	 */
 	struct kref ref;
+	bool allocated_bvec;
+	struct io_uring_bvec_buf buf[0];
 };
 
 struct ublk_uring_cmd_pdu {
-	struct ublk_queue *ubq;
+	union {
+		struct ublk_queue *ubq;
+		struct request *req;
+	};
 };
 
 /*
@@ -565,6 +570,69 @@  static size_t ublk_copy_user_pages(const struct request *req,
 	return done;
 }
 
+/*
+ * The built command buffer is immutable, so it is fine to feed it to
+ * concurrent io_uring fused commands
+ */
+static int ublk_init_zero_copy_buffer(struct request *rq)
+{
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+	struct io_uring_bvec_buf *imu = data->buf;
+	struct req_iterator rq_iter;
+	unsigned int nr_bvecs = 0;
+	struct bio_vec *bvec;
+	unsigned int offset;
+	struct bio_vec bv;
+
+	if (!ublk_rq_has_data(rq))
+		goto exit;
+
+	rq_for_each_bvec(bv, rq, rq_iter)
+		nr_bvecs++;
+
+	if (!nr_bvecs)
+		goto exit;
+
+	if (rq->bio != rq->biotail) {
+		int idx = 0;
+
+		bvec = kvmalloc_array(sizeof(struct bio_vec), nr_bvecs,
+				GFP_NOIO);
+		if (!bvec)
+			return -ENOMEM;
+
+		offset = 0;
+		rq_for_each_bvec(bv, rq, rq_iter)
+			bvec[idx++] = bv;
+		data->allocated_bvec = true;
+	} else {
+		struct bio *bio = rq->bio;
+
+		offset = bio->bi_iter.bi_bvec_done;
+		bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
+	}
+	imu->bvec = bvec;
+	imu->nr_bvecs = nr_bvecs;
+	imu->offset = offset;
+	imu->len = blk_rq_bytes(rq);
+
+	return 0;
+exit:
+	imu->bvec = NULL;
+	return 0;
+}
+
+static void ublk_deinit_zero_copy_buffer(struct request *rq)
+{
+	struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
+	struct io_uring_bvec_buf *imu = data->buf;
+
+	if (data->allocated_bvec) {
+		kvfree(imu->bvec);
+		data->allocated_bvec = false;
+	}
+}
+
 static inline bool ublk_need_map_req(const struct request *req)
 {
 	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
@@ -575,11 +643,23 @@  static inline bool ublk_need_unmap_req(const struct request *req)
 	return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
 }
 
-static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
+static int ublk_map_io(const struct ublk_queue *ubq, struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
+	if (ublk_support_zc(ubq)) {
+		int ret = ublk_init_zero_copy_buffer(req);
+
+		/*
+		 * The only failure is -ENOMEM for allocating fused cmd
+		 * buffer, return zero so that we can requeue this req.
+		 */
+		if (unlikely(ret))
+			return 0;
+		return rq_bytes;
+	}
+
 	/*
 	 * no zero copy, we delay copy WRITE request data into ublksrv
 	 * context and the big benefit is that pinning pages in current
@@ -599,11 +679,17 @@  static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
 }
 
 static int ublk_unmap_io(const struct ublk_queue *ubq,
-		const struct request *req,
+		struct request *req,
 		struct ublk_io *io)
 {
 	const unsigned int rq_bytes = blk_rq_bytes(req);
 
+	if (ublk_support_zc(ubq)) {
+		ublk_deinit_zero_copy_buffer(req);
+
+		return rq_bytes;
+	}
+
 	if (ublk_need_unmap_req(req)) {
 		struct iov_iter iter;
 		struct iovec iov;
@@ -687,6 +773,12 @@  static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
 	return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
 }
 
+static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu(
+		struct io_uring_cmd *ioucmd)
+{
+	return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu;
+}
+
 static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
 {
 	return ubq->ubq_daemon->flags & PF_EXITING;
@@ -742,6 +834,7 @@  static inline void __ublk_complete_rq(struct request *req)
 
 	return;
 exit:
+	ublk_deinit_zero_copy_buffer(req);
 	blk_mq_end_request(req, res);
 }
 
@@ -1352,6 +1445,68 @@  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
 	return NULL;
 }
 
+static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd,
+		unsigned issue_flags)
+{
+	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
+	struct request *req = pdu->req;
+	struct ublk_queue *ubq = req->mq_hctx->driver_data;
+
+	ublk_put_req_ref(ubq, req);
+	io_uring_cmd_done(cmd, cmd->fused.data.secondary_res, 0, issue_flags);
+}
+
+static inline bool ublk_check_fused_buf_dir(const struct request *req,
+		unsigned int flags)
+{
+	flags &= IO_URING_F_FUSED;
+
+	if (req_op(req) == REQ_OP_READ && flags == IO_URING_F_FUSED_BUF_DEST)
+		return true;
+
+	if (req_op(req) == REQ_OP_WRITE && flags == IO_URING_F_FUSED_BUF_SRC)
+		return true;
+
+	return false;
+}
+
+static int ublk_handle_fused_cmd(struct io_uring_cmd *cmd,
+		struct ublk_queue *ubq, int tag, unsigned int issue_flags)
+{
+	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
+	struct ublk_device *ub = cmd->file->private_data;
+	struct ublk_rq_data *data;
+	struct request *req;
+
+	if (!ub)
+		return -EPERM;
+
+	if (!(issue_flags & IO_URING_F_FUSED))
+		goto exit;
+
+	req = __ublk_check_and_get_req(ub, ubq, tag, 0);
+	if (!req)
+		goto exit;
+
+	pr_devel("%s: qid %d tag %u request bytes %u, issue flags %x\n",
+			__func__, tag, ubq->q_id, blk_rq_bytes(req),
+			issue_flags);
+
+	if (!ublk_check_fused_buf_dir(req, issue_flags))
+		goto exit_put_ref;
+
+	pdu->req = req;
+	data = blk_mq_rq_to_pdu(req);
+	io_fused_cmd_start_secondary_req(cmd, !(issue_flags & IO_URING_F_UNLOCKED),
+			data->buf, ublk_fused_cmd_done_cb);
+	return -EIOCBQUEUED;
+
+exit_put_ref:
+	ublk_put_req_ref(ubq, req);
+exit:
+	return -EINVAL;
+}
+
 static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 {
 	struct ublksrv_io_cmd *ub_cmd = (struct ublksrv_io_cmd *)cmd->cmd;
@@ -1367,6 +1522,10 @@  static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 			__func__, cmd->cmd_op, ub_cmd->q_id, tag,
 			ub_cmd->result);
 
+	if ((issue_flags & IO_URING_F_FUSED) &&
+			cmd_op != UBLK_IO_FUSED_SUBMIT_IO)
+		return -EOPNOTSUPP;
+
 	if (ub_cmd->q_id >= ub->dev_info.nr_hw_queues)
 		goto out;
 
@@ -1374,7 +1533,12 @@  static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 	if (!ubq || ub_cmd->q_id != ubq->q_id)
 		goto out;
 
-	if (ubq->ubq_daemon && ubq->ubq_daemon != current)
+	/*
+	 * The fused command reads the io buffer data structure only, so it
+	 * is fine to be issued from other context.
+	 */
+	if ((ubq->ubq_daemon && ubq->ubq_daemon != current) &&
+			(cmd_op != UBLK_IO_FUSED_SUBMIT_IO))
 		goto out;
 
 	if (tag >= ubq->q_depth)
@@ -1397,6 +1561,9 @@  static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
 		goto out;
 
 	switch (cmd_op) {
+	case UBLK_IO_FUSED_SUBMIT_IO:
+		return ublk_handle_fused_cmd(cmd, ubq, tag, issue_flags);
+
 	case UBLK_IO_FETCH_REQ:
 		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
 		if (ublk_queue_ready(ubq)) {
@@ -1726,11 +1893,14 @@  static void ublk_align_max_io_size(struct ublk_device *ub)
 
 static int ublk_add_tag_set(struct ublk_device *ub)
 {
+	int zc = !!(ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY);
+	struct ublk_rq_data *data;
+
 	ub->tag_set.ops = &ublk_mq_ops;
 	ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues;
 	ub->tag_set.queue_depth = ub->dev_info.queue_depth;
 	ub->tag_set.numa_node = NUMA_NO_NODE;
-	ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
+	ub->tag_set.cmd_size = struct_size(data, buf, zc);
 	ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
 	ub->tag_set.driver_data = ub;
 	return blk_mq_alloc_tag_set(&ub->tag_set);
@@ -1946,12 +2116,18 @@  static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
 	 */
 	ub->dev_info.flags &= UBLK_F_ALL;
 
+	/*
+	 * NEED_GET_DATA doesn't make sense any more in case that
+	 * ZERO_COPY is requested. Another reason is that userspace
+	 * can read/write io request buffer by pread()/pwrite() with
+	 * each io buffer's position.
+	 */
+	if (ub->dev_info.flags & UBLK_F_SUPPORT_ZERO_COPY)
+		ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
+
 	if (!IS_BUILTIN(CONFIG_BLK_DEV_UBLK))
 		ub->dev_info.flags |= UBLK_F_URING_CMD_COMP_IN_TASK;
 
-	/* We are not ready to support zero copy */
-	ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
 	ub->dev_info.nr_hw_queues = min_t(unsigned int,
 			ub->dev_info.nr_hw_queues, nr_cpu_ids);
 	ublk_align_max_io_size(ub);
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index d1a6b3dc0327..c4f3465399cf 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -44,6 +44,7 @@ 
 #define	UBLK_IO_FETCH_REQ		0x20
 #define	UBLK_IO_COMMIT_AND_FETCH_REQ	0x21
 #define	UBLK_IO_NEED_GET_DATA	0x22
+#define	UBLK_IO_FUSED_SUBMIT_IO	0x23
 
 /* only ABORT means that no re-fetch */
 #define UBLK_IO_RES_OK			0
@@ -85,10 +86,7 @@  static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset)
 		((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset);
 }
 
-/*
- * zero copy requires 4k block size, and can remap ublk driver's io
- * request into ublksrv's vm space
- */
+/* io_uring fused command based zero copy */
 #define UBLK_F_SUPPORT_ZERO_COPY	(1ULL << 0)
 
 /*