[0/3] Consistently look up fixed buffers before going async

Message ID	20250321184819.3847386-1-csander@purestorage.com (mailing list archive)
Headers	show Received: from mail-il1-f225.google.com (mail-il1-f225.google.com [209.85.166.225]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 990AE13C914 for <io-uring@vger.kernel.org>; Fri, 21 Mar 2025 18:48:34 +0000 (UTC) From: Caleb Sander Mateos <csander@purestorage.com> To: Jens Axboe <axboe@kernel.dk>, Pavel Begunkov <asml.silence@gmail.com>, Ming Lei <ming.lei@redhat.com>, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me> Cc: Xinyu Zhang <xizhang@purestorage.com>, io-uring@vger.kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, Caleb Sander Mateos <csander@purestorage.com> Subject: [PATCH 0/3] Consistently look up fixed buffers before going async Date: Fri, 21 Mar 2025 12:48:16 -0600 Message-ID: <20250321184819.3847386-1-csander@purestorage.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Consistently look up fixed buffers before going async \| expand [0/3] Consistently look up fixed buffers before going async [1/3] io_uring/net: only import send_zc buffer once [2/3] io_uring/net: import send_zc fixed buffer before going async [3/3] io_uring/uring_cmd: import fixed buffer before going async

Caleb Sander Mateos March 21, 2025, 6:48 p.m. UTC

To use ublk zero copy, an application submits a sequence of io_uring
operations:
(1) Register a ublk request's buffer into the fixed buffer table
(2) Use the fixed buffer in some I/O operation
(3) Unregister the buffer from the fixed buffer table

The ordering of these operations is critical; if the fixed buffer lookup
occurs before the register or after the unregister operation, the I/O
will fail with EFAULT or even corrupt a different ublk request's buffer.
It is possible to guarantee the correct order by linking the operations,
but that adds overhead and doesn't allow multiple I/O operations to
execute in parallel using the same ublk request's buffer. Ideally, the
application could just submit the register, I/O, and unregister SQEs in
the desired order without links and io_uring would ensure the ordering.
This mostly works, leveraging the fact that each io_uring SQE is prepped
and issued non-blocking in order (barring link, drain, and force-async
flags). But it requires the fixed buffer lookup to occur during the
initial non-blocking issue.

This patch series fixes the 2 gaps where the initial issue can return
EAGAIN before looking up the fixed buffer:
- IORING_OP_SEND_ZC using IORING_RECVSEND_POLL_FIRST
- IORING_OP_URING_CMD, of which NVMe passthru is currently the only
  fixed buffer user. blk_mq_alloc_request() can return EAGAIN before
  io_uring_cmd_import_fixed() is called to look up the fixed buffer.

Caleb Sander Mateos (3):
  io_uring/net: only import send_zc buffer once
  io_uring/net: import send_zc fixed buffer before going async
  io_uring/uring_cmd: import fixed buffer before going async

 drivers/nvme/host/ioctl.c    | 10 ++++------
 include/linux/io_uring/cmd.h |  6 ++----
 io_uring/net.c               | 13 ++++++++-----
 io_uring/rsrc.c              |  6 ++++++
 io_uring/rsrc.h              |  2 ++
 io_uring/uring_cmd.c         | 10 +++++++---
 6 files changed, 29 insertions(+), 18 deletions(-)

Jens Axboe March 21, 2025, 7:53 p.m. UTC | #1

On Fri, 21 Mar 2025 12:48:16 -0600, Caleb Sander Mateos wrote:
> To use ublk zero copy, an application submits a sequence of io_uring
> operations:
> (1) Register a ublk request's buffer into the fixed buffer table
> (2) Use the fixed buffer in some I/O operation
> (3) Unregister the buffer from the fixed buffer table
> 
> The ordering of these operations is critical; if the fixed buffer lookup
> occurs before the register or after the unregister operation, the I/O
> will fail with EFAULT or even corrupt a different ublk request's buffer.
> It is possible to guarantee the correct order by linking the operations,
> but that adds overhead and doesn't allow multiple I/O operations to
> execute in parallel using the same ublk request's buffer. Ideally, the
> application could just submit the register, I/O, and unregister SQEs in
> the desired order without links and io_uring would ensure the ordering.
> This mostly works, leveraging the fact that each io_uring SQE is prepped
> and issued non-blocking in order (barring link, drain, and force-async
> flags). But it requires the fixed buffer lookup to occur during the
> initial non-blocking issue.
> 
> [...]

Applied, thanks!

[1/3] io_uring/net: only import send_zc buffer once
      commit: 8e3100fcc5cbba03518b8b5c059624aba5c29d50
[2/3] io_uring/net: import send_zc fixed buffer before going async
      commit: 15f4c96bec5d0791904ee68c0f83ba18cab7466d
[3/3] io_uring/uring_cmd: import fixed buffer before going async
      commit: 70085217bec1eb8bbd19e661da9f1734ed8d35ca

Best regards,

Pavel Begunkov March 21, 2025, 8:24 p.m. UTC | #2

On 3/21/25 18:48, Caleb Sander Mateos wrote:
> To use ublk zero copy, an application submits a sequence of io_uring
> operations:
> (1) Register a ublk request's buffer into the fixed buffer table
> (2) Use the fixed buffer in some I/O operation
> (3) Unregister the buffer from the fixed buffer table
> 
> The ordering of these operations is critical; if the fixed buffer lookup
> occurs before the register or after the unregister operation, the I/O
> will fail with EFAULT or even corrupt a different ublk request's buffer.
> It is possible to guarantee the correct order by linking the operations,
> but that adds overhead and doesn't allow multiple I/O operations to
> execute in parallel using the same ublk request's buffer. Ideally, the
> application could just submit the register, I/O, and unregister SQEs in
> the desired order without links and io_uring would ensure the ordering.
> This mostly works, leveraging the fact that each io_uring SQE is prepped
> and issued non-blocking in order (barring link, drain, and force-async
> flags). But it requires the fixed buffer lookup to occur during the
> initial non-blocking issue.

In other words, leveraging internal details that is not a part
of the uapi, should never be relied upon by the user and is fragile.
Any drain request or IOSQE_ASYNC and it'll break, or for any reason
why it might be desirable to change the behaviour in the future.

Sorry, but no, we absolutely can't have that, it'll be an absolute
nightmare to maintain as basically every request scheduling decision
now becomes a part of the uapi.

There is an api to order requests, if you want to order them you
either have to use that or do it in user space. In your particular
case you can try to opportunistically issue them without ordering
by making sure the reg buffer slot is not reused in the meantime
and handling request failures.

Caleb Sander Mateos March 21, 2025, 9:24 p.m. UTC | #3

On Fri, Mar 21, 2025 at 1:23 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 3/21/25 18:48, Caleb Sander Mateos wrote:
> > To use ublk zero copy, an application submits a sequence of io_uring
> > operations:
> > (1) Register a ublk request's buffer into the fixed buffer table
> > (2) Use the fixed buffer in some I/O operation
> > (3) Unregister the buffer from the fixed buffer table
> >
> > The ordering of these operations is critical; if the fixed buffer lookup
> > occurs before the register or after the unregister operation, the I/O
> > will fail with EFAULT or even corrupt a different ublk request's buffer.
> > It is possible to guarantee the correct order by linking the operations,
> > but that adds overhead and doesn't allow multiple I/O operations to
> > execute in parallel using the same ublk request's buffer. Ideally, the
> > application could just submit the register, I/O, and unregister SQEs in
> > the desired order without links and io_uring would ensure the ordering.
> > This mostly works, leveraging the fact that each io_uring SQE is prepped
> > and issued non-blocking in order (barring link, drain, and force-async
> > flags). But it requires the fixed buffer lookup to occur during the
> > initial non-blocking issue.
>
> In other words, leveraging internal details that is not a part
> of the uapi, should never be relied upon by the user and is fragile.
> Any drain request or IOSQE_ASYNC and it'll break, or for any reason
> why it might be desirable to change the behaviour in the future.
>
> Sorry, but no, we absolutely can't have that, it'll be an absolute
> nightmare to maintain as basically every request scheduling decision
> now becomes a part of the uapi.

I thought we discussed this on the ublk zero copy patchset, but I
can't seem to find the email. My recollection is that Jens thought it
was reasonable for userspace to rely on the sequential prep + issue of
each SQE as long as it's not setting any of these flags that affect
their order. (Please correct me if that's not what you remember.)
I don't have a strong opinion about whether or not io_uring should
provide this guarantee, but I was under the impression this had
already been decided. I was just trying to fix the few gaps in this
guarantee, but I'm fine dropping the patches if Jens also feels
userspace shouldn't rely on this io_uring behavior.

>
> There is an api to order requests, if you want to order them you
> either have to use that or do it in user space. In your particular
> case you can try to opportunistically issue them without ordering
> by making sure the reg buffer slot is not reused in the meantime
> and handling request failures.

Yes, I am aware of the other options. Unfortunately, io_uring's linked
operation interface isn't rich enough to express an arbitrary
dependency graph. We have multiple I/O operations operating on the
same ublk request's buffer, so we would either need to link the I/O
operations (which would prevent them from executing in parallel), or
use a separate register/unregister operation for every I/O operation
(which has considerable overhead). We can also wait for the completion
of the I/O operations before submitting the unregister operation, but
that adds latency to the ublk request and requires another
io_uring_enter syscall.

We are using separate registered buffer indices for each ublk request
so at least this scenario doesn't lead to data corruption. And we can
certainly handle the EFAULT when the operation goes asynchronous, but
it would be preferable not to need to do that.

Best,
Caleb

Ming Lei March 22, 2025, 7:33 a.m. UTC | #4

On Fri, Mar 21, 2025 at 12:48:16PM -0600, Caleb Sander Mateos wrote:
> To use ublk zero copy, an application submits a sequence of io_uring
> operations:
> (1) Register a ublk request's buffer into the fixed buffer table
> (2) Use the fixed buffer in some I/O operation
> (3) Unregister the buffer from the fixed buffer table
> 
> The ordering of these operations is critical; if the fixed buffer lookup
> occurs before the register or after the unregister operation, the I/O
> will fail with EFAULT or even corrupt a different ublk request's buffer.
> It is possible to guarantee the correct order by linking the operations,
> but that adds overhead and doesn't allow multiple I/O operations to
> execute in parallel using the same ublk request's buffer. Ideally, the
> application could just submit the register, I/O, and unregister SQEs in
> the desired order without links and io_uring would ensure the ordering.

So far there are only two ways to provide the order guarantee in io_uring
syscall viewpoint:

1) IOSQE_IO_LINK

2) submit register_buffer operation and wait its completion, then submit IO
operations

Otherwise, you may just depend on the implementation, and there isn't such
order guarantee, and it is hard to write generic io_uring application.

I posted sqe group patchset for addressing this particular requirement in
API level.

https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/

Now I'd suggest to re-consider this approach for respecting the order
in API level, so both application and io_uring needn't play trick for
addressing this real problem.

With sqe group, just two OPs are needed:

- provide_buffer OP(group leader)

- other generic OPs(group members)

group leader won't be completed until all group member OPs are done.

The whole group share same IO_LINK/IO_HARDLINK flag.

That is all the concept, and this approach takes less SQEs, and application
will become simpler too.

Thanks,
Ming

Ming Lei March 22, 2025, 7:42 a.m. UTC | #5

On Fri, Mar 21, 2025 at 08:24:43PM +0000, Pavel Begunkov wrote:
> On 3/21/25 18:48, Caleb Sander Mateos wrote:
> > To use ublk zero copy, an application submits a sequence of io_uring
> > operations:
> > (1) Register a ublk request's buffer into the fixed buffer table
> > (2) Use the fixed buffer in some I/O operation
> > (3) Unregister the buffer from the fixed buffer table
> > 
> > The ordering of these operations is critical; if the fixed buffer lookup
> > occurs before the register or after the unregister operation, the I/O
> > will fail with EFAULT or even corrupt a different ublk request's buffer.
> > It is possible to guarantee the correct order by linking the operations,
> > but that adds overhead and doesn't allow multiple I/O operations to
> > execute in parallel using the same ublk request's buffer. Ideally, the
> > application could just submit the register, I/O, and unregister SQEs in
> > the desired order without links and io_uring would ensure the ordering.
> > This mostly works, leveraging the fact that each io_uring SQE is prepped
> > and issued non-blocking in order (barring link, drain, and force-async
> > flags). But it requires the fixed buffer lookup to occur during the
> > initial non-blocking issue.
> 
> In other words, leveraging internal details that is not a part
> of the uapi, should never be relied upon by the user and is fragile.
> Any drain request or IOSQE_ASYNC and it'll break, or for any reason
> why it might be desirable to change the behaviour in the future.
> 
> Sorry, but no, we absolutely can't have that, it'll be an absolute
> nightmare to maintain as basically every request scheduling decision
> now becomes a part of the uapi.
> 
> There is an api to order requests, if you want to order them you
> either have to use that or do it in user space. In your particular
> case you can try to opportunistically issue them without ordering
> by making sure the reg buffer slot is not reused in the meantime
> and handling request failures.

I agree, the order should be provided from UAPI/syscall level.

SQE group does address this order issue, and now it can work with
fixed buffer registering OP together.

If no one objects, I will post out the patch for review.


Thanks, 
Ming

Pavel Begunkov March 22, 2025, 12:33 p.m. UTC | #6

On 3/21/25 21:24, Caleb Sander Mateos wrote:
> On Fri, Mar 21, 2025 at 1:23 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 3/21/25 18:48, Caleb Sander Mateos wrote:
>>> To use ublk zero copy, an application submits a sequence of io_uring
>>> operations:
>>> (1) Register a ublk request's buffer into the fixed buffer table
>>> (2) Use the fixed buffer in some I/O operation
>>> (3) Unregister the buffer from the fixed buffer table
>>>
>>> The ordering of these operations is critical; if the fixed buffer lookup
>>> occurs before the register or after the unregister operation, the I/O
>>> will fail with EFAULT or even corrupt a different ublk request's buffer.
>>> It is possible to guarantee the correct order by linking the operations,
>>> but that adds overhead and doesn't allow multiple I/O operations to
>>> execute in parallel using the same ublk request's buffer. Ideally, the
>>> application could just submit the register, I/O, and unregister SQEs in
>>> the desired order without links and io_uring would ensure the ordering.
>>> This mostly works, leveraging the fact that each io_uring SQE is prepped
>>> and issued non-blocking in order (barring link, drain, and force-async
>>> flags). But it requires the fixed buffer lookup to occur during the
>>> initial non-blocking issue.
>>
>> In other words, leveraging internal details that is not a part
>> of the uapi, should never be relied upon by the user and is fragile.
>> Any drain request or IOSQE_ASYNC and it'll break, or for any reason
>> why it might be desirable to change the behaviour in the future.
>>
>> Sorry, but no, we absolutely can't have that, it'll be an absolute
>> nightmare to maintain as basically every request scheduling decision
>> now becomes a part of the uapi.
> 
> I thought we discussed this on the ublk zero copy patchset, but I
> can't seem to find the email. My recollection is that Jens thought it
> was reasonable for userspace to rely on the sequential prep + issue of
> each SQE as long as it's not setting any of these flags that affect
> their order. (Please correct me if that's not what you remember.)

Well, my opinions are my own. I think it's reasonable to assume that
for optimisation purposes IFF the user space can sanely handle
errors when the assumption fails.

In your case, the user space should expect that an unregistration
op can happen before a read/write had resolved the buffer (node), in
which case the rw request will find that the buffer slot is empty
and fail. And that should be handled in the user space, e.g.
by reissuing the rw request of failing.

> I don't have a strong opinion about whether or not io_uring should
> provide this guarantee, but I was under the impression this had
> already been decided. I was just trying to fix the few gaps in this

I don't think so, it's a major uapi change, and a huge burden
for many future core io_uring changes.

> guarantee, but I'm fine dropping the patches if Jens also feels
> userspace shouldn't rely on this io_uring behavior.
> 
>>
>> There is an api to order requests, if you want to order them you
>> either have to use that or do it in user space. In your particular
>> case you can try to opportunistically issue them without ordering
>> by making sure the reg buffer slot is not reused in the meantime
>> and handling request failures.
> 
> Yes, I am aware of the other options. Unfortunately, io_uring's linked
> operation interface isn't rich enough to express an arbitrary
> dependency graph. We have multiple I/O operations operating on the
> same ublk request's buffer, so we would either need to link the I/O
> operations (which would prevent them from executing in parallel), or
> use a separate register/unregister operation for every I/O operation
> (which has considerable overhead). We can also wait for the completion
> of the I/O operations before submitting the unregister operation, but
> that adds latency to the ublk request and requires another
> io_uring_enter syscall.
> 
> We are using separate registered buffer indices for each ublk request
> so at least this scenario doesn't lead to data corruption. And we can
> certainly handle the EFAULT when the operation goes asynchronous, but
> it would be preferable not to need to do that.

Pavel Begunkov March 24, 2025, 4:41 p.m. UTC | #7

On 3/22/25 07:33, Ming Lei wrote:
> On Fri, Mar 21, 2025 at 12:48:16PM -0600, Caleb Sander Mateos wrote:
>> To use ublk zero copy, an application submits a sequence of io_uring
>> operations:
>> (1) Register a ublk request's buffer into the fixed buffer table
>> (2) Use the fixed buffer in some I/O operation
>> (3) Unregister the buffer from the fixed buffer table
>>
>> The ordering of these operations is critical; if the fixed buffer lookup
>> occurs before the register or after the unregister operation, the I/O
>> will fail with EFAULT or even corrupt a different ublk request's buffer.
>> It is possible to guarantee the correct order by linking the operations,
>> but that adds overhead and doesn't allow multiple I/O operations to
>> execute in parallel using the same ublk request's buffer. Ideally, the
>> application could just submit the register, I/O, and unregister SQEs in
>> the desired order without links and io_uring would ensure the ordering.
> 
> So far there are only two ways to provide the order guarantee in io_uring
> syscall viewpoint:
> 
> 1) IOSQE_IO_LINK
> 
> 2) submit register_buffer operation and wait its completion, then submit IO
> operations
> 
> Otherwise, you may just depend on the implementation, and there isn't such
> order guarantee, and it is hard to write generic io_uring application.
> 
> I posted sqe group patchset for addressing this particular requirement in
> API level.
> 
> https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/
> 
> Now I'd suggest to re-consider this approach for respecting the order
> in API level, so both application and io_uring needn't play trick for
> addressing this real problem.

The group api was one of the major sources of uneasiness for previous
iterations of ublk zc. The kernel side was messy, even though I
understand that the messiness was necessitated from the choice of
the API and the mismatch with existing io_uring bits.

The question is whether it can be made simpler and more streamlined
now, internally and from the point of uapi as well. E.g. can it
extend traditional links paths without leaking into other core io_uring
parts where it shouldn't be? And to be honest, I can't say I like the
idea, just as I'm not excited by links we already have. They're a
pain to keep around, the abstraction is leaking in all unexpected
places, and it's not flexible enough and needs kernel changes for
every new simple case, not to mention something more complicated
like reading a memory and deciding about the next request from that.

I'd rather argue for letting the user to do that in bpf and make
it responsible for all error parsing and argument inference, as
in patches I sent around December, though they need to be extended
to go beyond cqe-sqe manipulation interface.

> With sqe group, just two OPs are needed:
> 
> - provide_buffer OP(group leader)
> 
> - other generic OPs(group members)
> 
> group leader won't be completed until all group member OPs are done.
> 
> The whole group share same IO_LINK/IO_HARDLINK flag.
> 
> That is all the concept, and this approach takes less SQEs, and application
> will become simpler too.

[0/3] Consistently look up fixed buffers before going async

Message

Comments