[0/2] uring_cmd SQE corruptions

Message ID	20250212204546.3751645-1-csander@purestorage.com (mailing list archive)
Headers	show Received: from mail-io1-f100.google.com (mail-io1-f100.google.com [209.85.166.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C65041EEA4A for <io-uring@vger.kernel.org>; Wed, 12 Feb 2025 20:46:33 +0000 (UTC) From: Caleb Sander Mateos <csander@purestorage.com> To: Jens Axboe <axboe@kernel.dk>, Pavel Begunkov <asml.silence@gmail.com> Cc: Riley Thomasson <riley@purestorage.com>, io-uring@vger.kernel.org, linux-kernel@vger.kernel.org, Caleb Sander Mateos <csander@purestorage.com> Subject: [PATCH 0/2] uring_cmd SQE corruptions Date: Wed, 12 Feb 2025 13:45:44 -0700 Message-ID: <20250212204546.3751645-1-csander@purestorage.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	uring_cmd SQE corruptions \| expand [0/2] uring_cmd SQE corruptions [1/2] io_uring/uring_cmd: don't assume io_uring_cmd_data layout [2/2] io_uring/uring_cmd: switch sqe to async_data on EAGAIN

Caleb Sander Mateos Feb. 12, 2025, 8:45 p.m. UTC

In our application issuing NVMe passthru commands, we have observed
nvme_uring_cmd fields being corrupted between when userspace initializes
the io_uring SQE and when nvme_uring_cmd_io() processes it.

We hypothesized that the uring_cmd's were executing asynchronously after
the io_uring_enter() syscall returned, yet were still reading the SQE in
the userspace-mapped SQ. Since io_uring_enter() had already incremented
the SQ head index, userspace reused the SQ slot for a new SQE once the
SQ wrapped around to it.

We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
index in userspace upon return from io_uring_enter(). By overwriting the
nvme_uring_cmd nsid field with a known garbage value, we were able to
trigger the err message in nvme_validate_passthru_nsid(), which logged
the garbage nsid value.

The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
SQE copying until it's needed"). With this commit reverted, the poisoned
values in the SQEs are no longer seen by nvme_uring_cmd_io().

Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
to async_data at prep time. The commit moved this memcpy() to 2 cases
when the request goes async:
- If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
- If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue

This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
pointer is not updated to point to async_data after the memcpy(),
as it correctly is in the REQ_F_FORCE_ASYNC case.

However, uring_cmd's can be issued async in other cases not enumerated
by 5eff57fa9f3a, also leading to SQE corruption. These include requests
besides the first in a linked chain, which are only issued once prior
requests complete. Requests waiting for a drain to complete would also
be initially issued async.

While it's probably possible for io_uring_cmd_prep_setup() to check for
each of these cases and avoid deferring the SQE memcpy(), we feel it
might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
As discussed recently in regard to the ublk zero-copy patches[1], new
async paths added in the future could break these delicate assumptions.

Thoughts?

[1]: https://lore.kernel.org/io-uring/7c2c2668-4f23-41d9-9cdf-c8ddd1f13f7c@gmail.com/

Caleb Sander Mateos (2):
  io_uring/uring_cmd: don't assume io_uring_cmd_data layout
  io_uring/uring_cmd: switch sqe to async_data on EAGAIN

 io_uring/uring_cmd.c | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

Jens Axboe Feb. 12, 2025, 8:55 p.m. UTC | #1

On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
> In our application issuing NVMe passthru commands, we have observed
> nvme_uring_cmd fields being corrupted between when userspace initializes
> the io_uring SQE and when nvme_uring_cmd_io() processes it.
> 
> We hypothesized that the uring_cmd's were executing asynchronously after
> the io_uring_enter() syscall returned, yet were still reading the SQE in
> the userspace-mapped SQ. Since io_uring_enter() had already incremented
> the SQ head index, userspace reused the SQ slot for a new SQE once the
> SQ wrapped around to it.
> 
> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
> index in userspace upon return from io_uring_enter(). By overwriting the
> nvme_uring_cmd nsid field with a known garbage value, we were able to
> trigger the err message in nvme_validate_passthru_nsid(), which logged
> the garbage nsid value.
> 
> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
> SQE copying until it's needed"). With this commit reverted, the poisoned
> values in the SQEs are no longer seen by nvme_uring_cmd_io().
> 
> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
> to async_data at prep time. The commit moved this memcpy() to 2 cases
> when the request goes async:
> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
> 
> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
> pointer is not updated to point to async_data after the memcpy(),
> as it correctly is in the REQ_F_FORCE_ASYNC case.
> 
> However, uring_cmd's can be issued async in other cases not enumerated
> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
> besides the first in a linked chain, which are only issued once prior
> requests complete. Requests waiting for a drain to complete would also
> be initially issued async.
> 
> While it's probably possible for io_uring_cmd_prep_setup() to check for
> each of these cases and avoid deferring the SQE memcpy(), we feel it
> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
> As discussed recently in regard to the ublk zero-copy patches[1], new
> async paths added in the future could break these delicate assumptions.

I don't think it's particularly delicate - did you manage to catch the
case queueing a request for async execution where the sqe wasn't already
copied? I did take a quick look after our out-of-band conversation, and
the only missing bit I immediately spotted is using SQPOLL. But I don't
think you're using that, right? And in any case, lifetime of SQEs with
SQPOLL is the duration of the request anyway, so should not pose any
risk of overwriting SQEs. But I do think the code should copy for that
case too, just to avoid it being a harder-to-use thing than it should
be.

The two patches here look good, I'll go ahead with those. That'll give
us a bit of time to figure out where this missing copy is.

Jens Axboe Feb. 12, 2025, 9:02 p.m. UTC | #2

On 2/12/25 1:55 PM, Jens Axboe wrote:
> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
>> In our application issuing NVMe passthru commands, we have observed
>> nvme_uring_cmd fields being corrupted between when userspace initializes
>> the io_uring SQE and when nvme_uring_cmd_io() processes it.
>>
>> We hypothesized that the uring_cmd's were executing asynchronously after
>> the io_uring_enter() syscall returned, yet were still reading the SQE in
>> the userspace-mapped SQ. Since io_uring_enter() had already incremented
>> the SQ head index, userspace reused the SQ slot for a new SQE once the
>> SQ wrapped around to it.
>>
>> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
>> index in userspace upon return from io_uring_enter(). By overwriting the
>> nvme_uring_cmd nsid field with a known garbage value, we were able to
>> trigger the err message in nvme_validate_passthru_nsid(), which logged
>> the garbage nsid value.
>>
>> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
>> SQE copying until it's needed"). With this commit reverted, the poisoned
>> values in the SQEs are no longer seen by nvme_uring_cmd_io().
>>
>> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
>> to async_data at prep time. The commit moved this memcpy() to 2 cases
>> when the request goes async:
>> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
>> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
>>
>> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
>> pointer is not updated to point to async_data after the memcpy(),
>> as it correctly is in the REQ_F_FORCE_ASYNC case.
>>
>> However, uring_cmd's can be issued async in other cases not enumerated
>> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
>> besides the first in a linked chain, which are only issued once prior
>> requests complete. Requests waiting for a drain to complete would also
>> be initially issued async.
>>
>> While it's probably possible for io_uring_cmd_prep_setup() to check for
>> each of these cases and avoid deferring the SQE memcpy(), we feel it
>> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
>> As discussed recently in regard to the ublk zero-copy patches[1], new
>> async paths added in the future could break these delicate assumptions.
> 
> I don't think it's particularly delicate - did you manage to catch the
> case queueing a request for async execution where the sqe wasn't already
> copied? I did take a quick look after our out-of-band conversation, and
> the only missing bit I immediately spotted is using SQPOLL. But I don't
> think you're using that, right? And in any case, lifetime of SQEs with
> SQPOLL is the duration of the request anyway, so should not pose any
> risk of overwriting SQEs. But I do think the code should copy for that
> case too, just to avoid it being a harder-to-use thing than it should
> be.
> 
> The two patches here look good, I'll go ahead with those. That'll give
> us a bit of time to figure out where this missing copy is.

Can you try this on top of your 2 and see if you still hit anything odd?

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index bcfca18395c4..15a8a67f556e 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -177,10 +177,13 @@ static void io_uring_cmd_cache_sqes(struct io_kiocb *req)
 	ioucmd->sqe = cache->sqes;
 }
 
+#define SQE_COPY_FLAGS	(REQ_F_FORCE_ASYNC|REQ_F_LINK|REQ_F_HARDLINK|REQ_F_IO_DRAIN)
+
 static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 				   const struct io_uring_sqe *sqe)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	struct io_ring_ctx *ctx = req->ctx;
 	struct io_uring_cmd_data *cache;
 
 	cache = io_uring_alloc_async_data(&req->ctx->uring_cache, req);
@@ -190,7 +193,7 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 
 	ioucmd->sqe = sqe;
 	/* defer memcpy until we need it */
-	if (unlikely(req->flags & REQ_F_FORCE_ASYNC))
+	if (unlikely(ctx->flags & IORING_SETUP_SQPOLL || req->flags & SQE_COPY_FLAGS))
 		io_uring_cmd_cache_sqes(req);
 	return 0;
 }

Caleb Sander Mateos Feb. 12, 2025, 9:54 p.m. UTC | #3

On Wed, Feb 12, 2025 at 12:55 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
> > In our application issuing NVMe passthru commands, we have observed
> > nvme_uring_cmd fields being corrupted between when userspace initializes
> > the io_uring SQE and when nvme_uring_cmd_io() processes it.
> >
> > We hypothesized that the uring_cmd's were executing asynchronously after
> > the io_uring_enter() syscall returned, yet were still reading the SQE in
> > the userspace-mapped SQ. Since io_uring_enter() had already incremented
> > the SQ head index, userspace reused the SQ slot for a new SQE once the
> > SQ wrapped around to it.
> >
> > We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
> > index in userspace upon return from io_uring_enter(). By overwriting the
> > nvme_uring_cmd nsid field with a known garbage value, we were able to
> > trigger the err message in nvme_validate_passthru_nsid(), which logged
> > the garbage nsid value.
> >
> > The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
> > SQE copying until it's needed"). With this commit reverted, the poisoned
> > values in the SQEs are no longer seen by nvme_uring_cmd_io().
> >
> > Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
> > to async_data at prep time. The commit moved this memcpy() to 2 cases
> > when the request goes async:
> > - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
> > - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
> >
> > This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
> > pointer is not updated to point to async_data after the memcpy(),
> > as it correctly is in the REQ_F_FORCE_ASYNC case.
> >
> > However, uring_cmd's can be issued async in other cases not enumerated
> > by 5eff57fa9f3a, also leading to SQE corruption. These include requests
> > besides the first in a linked chain, which are only issued once prior
> > requests complete. Requests waiting for a drain to complete would also
> > be initially issued async.
> >
> > While it's probably possible for io_uring_cmd_prep_setup() to check for
> > each of these cases and avoid deferring the SQE memcpy(), we feel it
> > might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
> > As discussed recently in regard to the ublk zero-copy patches[1], new
> > async paths added in the future could break these delicate assumptions.
>
> I don't think it's particularly delicate - did you manage to catch the
> case queueing a request for async execution where the sqe wasn't already
> copied? I did take a quick look after our out-of-band conversation, and
> the only missing bit I immediately spotted is using SQPOLL. But I don't
> think you're using that, right? And in any case, lifetime of SQEs with
> SQPOLL is the duration of the request anyway, so should not pose any
> risk of overwriting SQEs. But I do think the code should copy for that
> case too, just to avoid it being a harder-to-use thing than it should
> be.

Yes, even with the EAGAIN case fixed, nvme_validate_passthru_nsid() is
still catching the poisoned nsids. However, the log lines now come
from the application thread rather than the iou-wrk thread.
Indeed, we are not using SQPOLL. But we are using IOSQE_SQE_GROUP from
Ming's SQE group patch set[1]. Like IOSQE_IO_LINK, subsequent
operations in a group are issued only once the first completes. The
first operation in the group is a UBLK_IO_PROVIDE_IO_BUF from Ming's
other patch set[2], which should complete synchronously. The
completion should be processed in __io_submit_flush_completions() ->
io_free_batch_list() and queue the remaining grouped operations to be
issued in task work. And all the pending task work should be processed
during io_uring_enter()'s return to userspace.
But some NVMe passthru operations must be going async because they are
observing the poisoned values the application thread writes into the
io_uring SQEs after io_uring_enter() returns. We don't yet have an
explanation for how they end up being issued async; ftrace shows that
in the typical case, all the NVMe passthru operations in the group are
issued during task work before io_uring_enter() returns to userspace.
Perhaps a pending signal could short-circuit the task work processing?

Best,
Caleb

[1]: https://lore.kernel.org/io-uring/20240511001214.173711-2-ming.lei@redhat.com/T/
[2]: https://lore.kernel.org/linux-block/20241107110149.890530-1-ming.lei@redhat.com/T/

>
> The two patches here look good, I'll go ahead with those. That'll give
> us a bit of time to figure out where this missing copy is.
>
> --
> Jens Axboe

Caleb Sander Mateos Feb. 12, 2025, 9:58 p.m. UTC | #4

On Wed, Feb 12, 2025 at 1:02 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/12/25 1:55 PM, Jens Axboe wrote:
> > On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
> >> In our application issuing NVMe passthru commands, we have observed
> >> nvme_uring_cmd fields being corrupted between when userspace initializes
> >> the io_uring SQE and when nvme_uring_cmd_io() processes it.
> >>
> >> We hypothesized that the uring_cmd's were executing asynchronously after
> >> the io_uring_enter() syscall returned, yet were still reading the SQE in
> >> the userspace-mapped SQ. Since io_uring_enter() had already incremented
> >> the SQ head index, userspace reused the SQ slot for a new SQE once the
> >> SQ wrapped around to it.
> >>
> >> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
> >> index in userspace upon return from io_uring_enter(). By overwriting the
> >> nvme_uring_cmd nsid field with a known garbage value, we were able to
> >> trigger the err message in nvme_validate_passthru_nsid(), which logged
> >> the garbage nsid value.
> >>
> >> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
> >> SQE copying until it's needed"). With this commit reverted, the poisoned
> >> values in the SQEs are no longer seen by nvme_uring_cmd_io().
> >>
> >> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
> >> to async_data at prep time. The commit moved this memcpy() to 2 cases
> >> when the request goes async:
> >> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
> >> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
> >>
> >> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
> >> pointer is not updated to point to async_data after the memcpy(),
> >> as it correctly is in the REQ_F_FORCE_ASYNC case.
> >>
> >> However, uring_cmd's can be issued async in other cases not enumerated
> >> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
> >> besides the first in a linked chain, which are only issued once prior
> >> requests complete. Requests waiting for a drain to complete would also
> >> be initially issued async.
> >>
> >> While it's probably possible for io_uring_cmd_prep_setup() to check for
> >> each of these cases and avoid deferring the SQE memcpy(), we feel it
> >> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
> >> As discussed recently in regard to the ublk zero-copy patches[1], new
> >> async paths added in the future could break these delicate assumptions.
> >
> > I don't think it's particularly delicate - did you manage to catch the
> > case queueing a request for async execution where the sqe wasn't already
> > copied? I did take a quick look after our out-of-band conversation, and
> > the only missing bit I immediately spotted is using SQPOLL. But I don't
> > think you're using that, right? And in any case, lifetime of SQEs with
> > SQPOLL is the duration of the request anyway, so should not pose any
> > risk of overwriting SQEs. But I do think the code should copy for that
> > case too, just to avoid it being a harder-to-use thing than it should
> > be.
> >
> > The two patches here look good, I'll go ahead with those. That'll give
> > us a bit of time to figure out where this missing copy is.
>
> Can you try this on top of your 2 and see if you still hit anything odd?
>
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index bcfca18395c4..15a8a67f556e 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -177,10 +177,13 @@ static void io_uring_cmd_cache_sqes(struct io_kiocb *req)
>         ioucmd->sqe = cache->sqes;
>  }
>
> +#define SQE_COPY_FLAGS (REQ_F_FORCE_ASYNC|REQ_F_LINK|REQ_F_HARDLINK|REQ_F_IO_DRAIN)

I believe this still misses the last request in a linked chain, which
won't have REQ_F_LINK/REQ_F_HARDLINK set?
IOSQE_IO_DRAIN also causes subsequent operations to be issued async;
is REQ_F_IO_DRAIN set on those operations too?

Thanks,
Caleb

> +
>  static int io_uring_cmd_prep_setup(struct io_kiocb *req,
>                                    const struct io_uring_sqe *sqe)
>  {
>         struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> +       struct io_ring_ctx *ctx = req->ctx;
>         struct io_uring_cmd_data *cache;
>
>         cache = io_uring_alloc_async_data(&req->ctx->uring_cache, req);
> @@ -190,7 +193,7 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
>
>         ioucmd->sqe = sqe;
>         /* defer memcpy until we need it */
> -       if (unlikely(req->flags & REQ_F_FORCE_ASYNC))
> +       if (unlikely(ctx->flags & IORING_SETUP_SQPOLL || req->flags & SQE_COPY_FLAGS))
>                 io_uring_cmd_cache_sqes(req);
>         return 0;
>  }
>
> --
> Jens Axboe

Jens Axboe Feb. 12, 2025, 10:34 p.m. UTC | #5

On 2/12/25 2:58 PM, Caleb Sander wrote:
> On Wed, Feb 12, 2025 at 1:02?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/12/25 1:55 PM, Jens Axboe wrote:
>>> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
>>>> In our application issuing NVMe passthru commands, we have observed
>>>> nvme_uring_cmd fields being corrupted between when userspace initializes
>>>> the io_uring SQE and when nvme_uring_cmd_io() processes it.
>>>>
>>>> We hypothesized that the uring_cmd's were executing asynchronously after
>>>> the io_uring_enter() syscall returned, yet were still reading the SQE in
>>>> the userspace-mapped SQ. Since io_uring_enter() had already incremented
>>>> the SQ head index, userspace reused the SQ slot for a new SQE once the
>>>> SQ wrapped around to it.
>>>>
>>>> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
>>>> index in userspace upon return from io_uring_enter(). By overwriting the
>>>> nvme_uring_cmd nsid field with a known garbage value, we were able to
>>>> trigger the err message in nvme_validate_passthru_nsid(), which logged
>>>> the garbage nsid value.
>>>>
>>>> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
>>>> SQE copying until it's needed"). With this commit reverted, the poisoned
>>>> values in the SQEs are no longer seen by nvme_uring_cmd_io().
>>>>
>>>> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
>>>> to async_data at prep time. The commit moved this memcpy() to 2 cases
>>>> when the request goes async:
>>>> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
>>>> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
>>>>
>>>> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
>>>> pointer is not updated to point to async_data after the memcpy(),
>>>> as it correctly is in the REQ_F_FORCE_ASYNC case.
>>>>
>>>> However, uring_cmd's can be issued async in other cases not enumerated
>>>> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
>>>> besides the first in a linked chain, which are only issued once prior
>>>> requests complete. Requests waiting for a drain to complete would also
>>>> be initially issued async.
>>>>
>>>> While it's probably possible for io_uring_cmd_prep_setup() to check for
>>>> each of these cases and avoid deferring the SQE memcpy(), we feel it
>>>> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
>>>> As discussed recently in regard to the ublk zero-copy patches[1], new
>>>> async paths added in the future could break these delicate assumptions.
>>>
>>> I don't think it's particularly delicate - did you manage to catch the
>>> case queueing a request for async execution where the sqe wasn't already
>>> copied? I did take a quick look after our out-of-band conversation, and
>>> the only missing bit I immediately spotted is using SQPOLL. But I don't
>>> think you're using that, right? And in any case, lifetime of SQEs with
>>> SQPOLL is the duration of the request anyway, so should not pose any
>>> risk of overwriting SQEs. But I do think the code should copy for that
>>> case too, just to avoid it being a harder-to-use thing than it should
>>> be.
>>>
>>> The two patches here look good, I'll go ahead with those. That'll give
>>> us a bit of time to figure out where this missing copy is.
>>
>> Can you try this on top of your 2 and see if you still hit anything odd?
>>
>> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
>> index bcfca18395c4..15a8a67f556e 100644
>> --- a/io_uring/uring_cmd.c
>> +++ b/io_uring/uring_cmd.c
>> @@ -177,10 +177,13 @@ static void io_uring_cmd_cache_sqes(struct io_kiocb *req)
>>         ioucmd->sqe = cache->sqes;
>>  }
>>
>> +#define SQE_COPY_FLAGS (REQ_F_FORCE_ASYNC|REQ_F_LINK|REQ_F_HARDLINK|REQ_F_IO_DRAIN)
> 
> I believe this still misses the last request in a linked chain, which
> won't have REQ_F_LINK/REQ_F_HARDLINK set?

Yeah good point, I think we should just be looking at link->head instead
to see if the request is a link, or part of a linked submission. That
may overshoot a bit, but that should be fine - it'll be a false
positive. Alternatively, we can still check link flags and compare with
link->last instead...

But the whole thing still feels a bit iffy. The whole uring_cmd setup
with an SQE that's sometimes the actual SQE, and sometimes a copy when
needed, does not fill me with joy.

> IOSQE_IO_DRAIN also causes subsequent operations to be issued async;
> is REQ_F_IO_DRAIN set on those operations too?

The first 8 flags are directly set in the io_kiocb at init time. So if
IOSQE_IO_DRAIN is set, then REQ_F_IO_DRAIN will be set as they are one
and the same.

diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index bcfca18395c4..9e60b5bb5a60 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -177,10 +177,14 @@ static void io_uring_cmd_cache_sqes(struct io_kiocb *req)
 	ioucmd->sqe = cache->sqes;
 }
 
+#define SQE_COPY_FLAGS	(REQ_F_FORCE_ASYNC|REQ_F_IO_DRAIN)
+
 static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 				   const struct io_uring_sqe *sqe)
 {
 	struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_submit_link *link = &ctx->submit_state.link;
 	struct io_uring_cmd_data *cache;
 
 	cache = io_uring_alloc_async_data(&req->ctx->uring_cache, req);
@@ -190,7 +194,8 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
 
 	ioucmd->sqe = sqe;
 	/* defer memcpy until we need it */
-	if (unlikely(req->flags & REQ_F_FORCE_ASYNC))
+	if (unlikely(ctx->flags & IORING_SETUP_SQPOLL ||
+		     req->flags & SQE_COPY_FLAGS || link->head))
 		io_uring_cmd_cache_sqes(req);
 	return 0;
 }

Jens Axboe Feb. 12, 2025, 10:39 p.m. UTC | #6

On 2/12/25 2:54 PM, Caleb Sander wrote:
> On Wed, Feb 12, 2025 at 12:55?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
>>> In our application issuing NVMe passthru commands, we have observed
>>> nvme_uring_cmd fields being corrupted between when userspace initializes
>>> the io_uring SQE and when nvme_uring_cmd_io() processes it.
>>>
>>> We hypothesized that the uring_cmd's were executing asynchronously after
>>> the io_uring_enter() syscall returned, yet were still reading the SQE in
>>> the userspace-mapped SQ. Since io_uring_enter() had already incremented
>>> the SQ head index, userspace reused the SQ slot for a new SQE once the
>>> SQ wrapped around to it.
>>>
>>> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
>>> index in userspace upon return from io_uring_enter(). By overwriting the
>>> nvme_uring_cmd nsid field with a known garbage value, we were able to
>>> trigger the err message in nvme_validate_passthru_nsid(), which logged
>>> the garbage nsid value.
>>>
>>> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
>>> SQE copying until it's needed"). With this commit reverted, the poisoned
>>> values in the SQEs are no longer seen by nvme_uring_cmd_io().
>>>
>>> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
>>> to async_data at prep time. The commit moved this memcpy() to 2 cases
>>> when the request goes async:
>>> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
>>> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
>>>
>>> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
>>> pointer is not updated to point to async_data after the memcpy(),
>>> as it correctly is in the REQ_F_FORCE_ASYNC case.
>>>
>>> However, uring_cmd's can be issued async in other cases not enumerated
>>> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
>>> besides the first in a linked chain, which are only issued once prior
>>> requests complete. Requests waiting for a drain to complete would also
>>> be initially issued async.
>>>
>>> While it's probably possible for io_uring_cmd_prep_setup() to check for
>>> each of these cases and avoid deferring the SQE memcpy(), we feel it
>>> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
>>> As discussed recently in regard to the ublk zero-copy patches[1], new
>>> async paths added in the future could break these delicate assumptions.
>>
>> I don't think it's particularly delicate - did you manage to catch the
>> case queueing a request for async execution where the sqe wasn't already
>> copied? I did take a quick look after our out-of-band conversation, and
>> the only missing bit I immediately spotted is using SQPOLL. But I don't
>> think you're using that, right? And in any case, lifetime of SQEs with
>> SQPOLL is the duration of the request anyway, so should not pose any
>> risk of overwriting SQEs. But I do think the code should copy for that
>> case too, just to avoid it being a harder-to-use thing than it should
>> be.
> 
> Yes, even with the EAGAIN case fixed, nvme_validate_passthru_nsid() is
> still catching the poisoned nsids. However, the log lines now come
> from the application thread rather than the iou-wrk thread.
> Indeed, we are not using SQPOLL. But we are using IOSQE_SQE_GROUP from
> Ming's SQE group patch set[1]. Like IOSQE_IO_LINK, subsequent
> operations in a group are issued only once the first completes. The
> first operation in the group is a UBLK_IO_PROVIDE_IO_BUF from Ming's
> other patch set[2], which should complete synchronously. The
> completion should be processed in __io_submit_flush_completions() ->
> io_free_batch_list() and queue the remaining grouped operations to be
> issued in task work. And all the pending task work should be processed
> during io_uring_enter()'s return to userspace.
> But some NVMe passthru operations must be going async because they are
> observing the poisoned values the application thread writes into the
> io_uring SQEs after io_uring_enter() returns. We don't yet have an
> explanation for how they end up being issued async; ftrace shows that
> in the typical case, all the NVMe passthru operations in the group are
> issued during task work before io_uring_enter() returns to userspace.
> Perhaps a pending signal could short-circuit the task work processing?

I think it'd be a good idea to move testing to the newer patchset, as
the sqe grouping was never really fully vetted or included. It's not
impossible that it's a side effect of that, even though it very well
cold be a generic issue that exists already without those patches. In
any case, that'd move us all to a closer-to-upstream test base, without
having older patches that aren't upstream in the mix.

How are the rings setup for what you're testing? I'm specifically
thinking of the task_work related flags. Is it using DEFER_TASKRUN, or
is it using the older signal style task_work?

Caleb Sander Mateos Feb. 12, 2025, 10:52 p.m. UTC | #7

On Wed, Feb 12, 2025 at 2:34 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/12/25 2:58 PM, Caleb Sander wrote:
> > On Wed, Feb 12, 2025 at 1:02?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 2/12/25 1:55 PM, Jens Axboe wrote:
> >>> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
> >>>> In our application issuing NVMe passthru commands, we have observed
> >>>> nvme_uring_cmd fields being corrupted between when userspace initializes
> >>>> the io_uring SQE and when nvme_uring_cmd_io() processes it.
> >>>>
> >>>> We hypothesized that the uring_cmd's were executing asynchronously after
> >>>> the io_uring_enter() syscall returned, yet were still reading the SQE in
> >>>> the userspace-mapped SQ. Since io_uring_enter() had already incremented
> >>>> the SQ head index, userspace reused the SQ slot for a new SQE once the
> >>>> SQ wrapped around to it.
> >>>>
> >>>> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
> >>>> index in userspace upon return from io_uring_enter(). By overwriting the
> >>>> nvme_uring_cmd nsid field with a known garbage value, we were able to
> >>>> trigger the err message in nvme_validate_passthru_nsid(), which logged
> >>>> the garbage nsid value.
> >>>>
> >>>> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
> >>>> SQE copying until it's needed"). With this commit reverted, the poisoned
> >>>> values in the SQEs are no longer seen by nvme_uring_cmd_io().
> >>>>
> >>>> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
> >>>> to async_data at prep time. The commit moved this memcpy() to 2 cases
> >>>> when the request goes async:
> >>>> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
> >>>> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
> >>>>
> >>>> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
> >>>> pointer is not updated to point to async_data after the memcpy(),
> >>>> as it correctly is in the REQ_F_FORCE_ASYNC case.
> >>>>
> >>>> However, uring_cmd's can be issued async in other cases not enumerated
> >>>> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
> >>>> besides the first in a linked chain, which are only issued once prior
> >>>> requests complete. Requests waiting for a drain to complete would also
> >>>> be initially issued async.
> >>>>
> >>>> While it's probably possible for io_uring_cmd_prep_setup() to check for
> >>>> each of these cases and avoid deferring the SQE memcpy(), we feel it
> >>>> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
> >>>> As discussed recently in regard to the ublk zero-copy patches[1], new
> >>>> async paths added in the future could break these delicate assumptions.
> >>>
> >>> I don't think it's particularly delicate - did you manage to catch the
> >>> case queueing a request for async execution where the sqe wasn't already
> >>> copied? I did take a quick look after our out-of-band conversation, and
> >>> the only missing bit I immediately spotted is using SQPOLL. But I don't
> >>> think you're using that, right? And in any case, lifetime of SQEs with
> >>> SQPOLL is the duration of the request anyway, so should not pose any
> >>> risk of overwriting SQEs. But I do think the code should copy for that
> >>> case too, just to avoid it being a harder-to-use thing than it should
> >>> be.
> >>>
> >>> The two patches here look good, I'll go ahead with those. That'll give
> >>> us a bit of time to figure out where this missing copy is.
> >>
> >> Can you try this on top of your 2 and see if you still hit anything odd?
> >>
> >> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> >> index bcfca18395c4..15a8a67f556e 100644
> >> --- a/io_uring/uring_cmd.c
> >> +++ b/io_uring/uring_cmd.c
> >> @@ -177,10 +177,13 @@ static void io_uring_cmd_cache_sqes(struct io_kiocb *req)
> >>         ioucmd->sqe = cache->sqes;
> >>  }
> >>
> >> +#define SQE_COPY_FLAGS (REQ_F_FORCE_ASYNC|REQ_F_LINK|REQ_F_HARDLINK|REQ_F_IO_DRAIN)
> >
> > I believe this still misses the last request in a linked chain, which
> > won't have REQ_F_LINK/REQ_F_HARDLINK set?
>
> Yeah good point, I think we should just be looking at link->head instead
> to see if the request is a link, or part of a linked submission. That
> may overshoot a bit, but that should be fine - it'll be a false
> positive. Alternatively, we can still check link flags and compare with
> link->last instead...

Yeah, checking link.head sounds good to me. I don't think it should
catch any extra requests. link.head will still be NULL when ->prep()
is called on the first request of the chain, since it is set in
io_submit_sqe() after io_init_req() (which calls ->prep()).

>
> But the whole thing still feels a bit iffy. The whole uring_cmd setup
> with an SQE that's sometimes the actual SQE, and sometimes a copy when
> needed, does not fill me with joy.
>
> > IOSQE_IO_DRAIN also causes subsequent operations to be issued async;
> > is REQ_F_IO_DRAIN set on those operations too?
>
> The first 8 flags are directly set in the io_kiocb at init time. So if
> IOSQE_IO_DRAIN is set, then REQ_F_IO_DRAIN will be set as they are one
> and the same.

Sorry, I meant that a request marked IOSQE_IO_DRAIN/REQ_F_IO_DRAIN
potentially causes both that request and any following requests to be
submitted async. The first request waits for any outstanding requests
to complete, and the following requests wait for the request marked
IOSQE_IO_DRAIN to complete. I know REQ_F_IO_DRAIN = IOSQE_IO_DRAIN is
already set on the first request, but will it also be set on any
following requests that have to wait on that one?

>
> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
> index bcfca18395c4..9e60b5bb5a60 100644
> --- a/io_uring/uring_cmd.c
> +++ b/io_uring/uring_cmd.c
> @@ -177,10 +177,14 @@ static void io_uring_cmd_cache_sqes(struct io_kiocb *req)
>         ioucmd->sqe = cache->sqes;
>  }
>
> +#define SQE_COPY_FLAGS (REQ_F_FORCE_ASYNC|REQ_F_IO_DRAIN)
> +
>  static int io_uring_cmd_prep_setup(struct io_kiocb *req,
>                                    const struct io_uring_sqe *sqe)
>  {
>         struct io_uring_cmd *ioucmd = io_kiocb_to_cmd(req, struct io_uring_cmd);
> +       struct io_ring_ctx *ctx = req->ctx;
> +       struct io_submit_link *link = &ctx->submit_state.link;
>         struct io_uring_cmd_data *cache;
>
>         cache = io_uring_alloc_async_data(&req->ctx->uring_cache, req);
> @@ -190,7 +194,8 @@ static int io_uring_cmd_prep_setup(struct io_kiocb *req,
>
>         ioucmd->sqe = sqe;
>         /* defer memcpy until we need it */
> -       if (unlikely(req->flags & REQ_F_FORCE_ASYNC))
> +       if (unlikely(ctx->flags & IORING_SETUP_SQPOLL ||
> +                    req->flags & SQE_COPY_FLAGS || link->head))
>                 io_uring_cmd_cache_sqes(req);
>         return 0;
>  }
>
> --
> Jens Axboe

Jens Axboe Feb. 12, 2025, 10:56 p.m. UTC | #8

On 2/12/25 3:52 PM, Caleb Sander wrote:
> On Wed, Feb 12, 2025 at 2:34?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 2/12/25 2:58 PM, Caleb Sander wrote:
>>> On Wed, Feb 12, 2025 at 1:02?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 2/12/25 1:55 PM, Jens Axboe wrote:
>>>>> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
>>>>>> In our application issuing NVMe passthru commands, we have observed
>>>>>> nvme_uring_cmd fields being corrupted between when userspace initializes
>>>>>> the io_uring SQE and when nvme_uring_cmd_io() processes it.
>>>>>>
>>>>>> We hypothesized that the uring_cmd's were executing asynchronously after
>>>>>> the io_uring_enter() syscall returned, yet were still reading the SQE in
>>>>>> the userspace-mapped SQ. Since io_uring_enter() had already incremented
>>>>>> the SQ head index, userspace reused the SQ slot for a new SQE once the
>>>>>> SQ wrapped around to it.
>>>>>>
>>>>>> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
>>>>>> index in userspace upon return from io_uring_enter(). By overwriting the
>>>>>> nvme_uring_cmd nsid field with a known garbage value, we were able to
>>>>>> trigger the err message in nvme_validate_passthru_nsid(), which logged
>>>>>> the garbage nsid value.
>>>>>>
>>>>>> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
>>>>>> SQE copying until it's needed"). With this commit reverted, the poisoned
>>>>>> values in the SQEs are no longer seen by nvme_uring_cmd_io().
>>>>>>
>>>>>> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
>>>>>> to async_data at prep time. The commit moved this memcpy() to 2 cases
>>>>>> when the request goes async:
>>>>>> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
>>>>>> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
>>>>>>
>>>>>> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
>>>>>> pointer is not updated to point to async_data after the memcpy(),
>>>>>> as it correctly is in the REQ_F_FORCE_ASYNC case.
>>>>>>
>>>>>> However, uring_cmd's can be issued async in other cases not enumerated
>>>>>> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
>>>>>> besides the first in a linked chain, which are only issued once prior
>>>>>> requests complete. Requests waiting for a drain to complete would also
>>>>>> be initially issued async.
>>>>>>
>>>>>> While it's probably possible for io_uring_cmd_prep_setup() to check for
>>>>>> each of these cases and avoid deferring the SQE memcpy(), we feel it
>>>>>> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
>>>>>> As discussed recently in regard to the ublk zero-copy patches[1], new
>>>>>> async paths added in the future could break these delicate assumptions.
>>>>>
>>>>> I don't think it's particularly delicate - did you manage to catch the
>>>>> case queueing a request for async execution where the sqe wasn't already
>>>>> copied? I did take a quick look after our out-of-band conversation, and
>>>>> the only missing bit I immediately spotted is using SQPOLL. But I don't
>>>>> think you're using that, right? And in any case, lifetime of SQEs with
>>>>> SQPOLL is the duration of the request anyway, so should not pose any
>>>>> risk of overwriting SQEs. But I do think the code should copy for that
>>>>> case too, just to avoid it being a harder-to-use thing than it should
>>>>> be.
>>>>>
>>>>> The two patches here look good, I'll go ahead with those. That'll give
>>>>> us a bit of time to figure out where this missing copy is.
>>>>
>>>> Can you try this on top of your 2 and see if you still hit anything odd?
>>>>
>>>> diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
>>>> index bcfca18395c4..15a8a67f556e 100644
>>>> --- a/io_uring/uring_cmd.c
>>>> +++ b/io_uring/uring_cmd.c
>>>> @@ -177,10 +177,13 @@ static void io_uring_cmd_cache_sqes(struct io_kiocb *req)
>>>>         ioucmd->sqe = cache->sqes;
>>>>  }
>>>>
>>>> +#define SQE_COPY_FLAGS (REQ_F_FORCE_ASYNC|REQ_F_LINK|REQ_F_HARDLINK|REQ_F_IO_DRAIN)
>>>
>>> I believe this still misses the last request in a linked chain, which
>>> won't have REQ_F_LINK/REQ_F_HARDLINK set?
>>
>> Yeah good point, I think we should just be looking at link->head instead
>> to see if the request is a link, or part of a linked submission. That
>> may overshoot a bit, but that should be fine - it'll be a false
>> positive. Alternatively, we can still check link flags and compare with
>> link->last instead...
> 
> Yeah, checking link.head sounds good to me. I don't think it should
> catch any extra requests. link.head will still be NULL when ->prep()
> is called on the first request of the chain, since it is set in
> io_submit_sqe() after io_init_req() (which calls ->prep()).

It'll catch potentially extra ones in the sense that you can submit a
link chain with other requests that aren't related.

>> But the whole thing still feels a bit iffy. The whole uring_cmd setup
>> with an SQE that's sometimes the actual SQE, and sometimes a copy when
>> needed, does not fill me with joy.
>>
>>> IOSQE_IO_DRAIN also causes subsequent operations to be issued async;
>>> is REQ_F_IO_DRAIN set on those operations too?
>>
>> The first 8 flags are directly set in the io_kiocb at init time. So if
>> IOSQE_IO_DRAIN is set, then REQ_F_IO_DRAIN will be set as they are one
>> and the same.
> 
> Sorry, I meant that a request marked IOSQE_IO_DRAIN/REQ_F_IO_DRAIN
> potentially causes both that request and any following requests to be
> submitted async. The first request waits for any outstanding requests
> to complete, and the following requests wait for the request marked
> IOSQE_IO_DRAIN to complete. I know REQ_F_IO_DRAIN = IOSQE_IO_DRAIN is
> already set on the first request, but will it also be set on any
> following requests that have to wait on that one?

Yeah good point. All of this works nicely in io_uring as-is, because
post ->prep() everything _must_ be stable. But uring_cmd violates that,
which is why we need to jump through hoops here.

Might in fact just be better to know exactly how much of the SQE that
needs copying here, and have it reside in stable storage post prep. A
bit tricky with the various sub-types of requests that can be in there
however.

Caleb Sander Mateos Feb. 12, 2025, 11:07 p.m. UTC | #9

On Wed, Feb 12, 2025 at 2:39 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 2/12/25 2:54 PM, Caleb Sander wrote:
> > On Wed, Feb 12, 2025 at 12:55?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
> >>> In our application issuing NVMe passthru commands, we have observed
> >>> nvme_uring_cmd fields being corrupted between when userspace initializes
> >>> the io_uring SQE and when nvme_uring_cmd_io() processes it.
> >>>
> >>> We hypothesized that the uring_cmd's were executing asynchronously after
> >>> the io_uring_enter() syscall returned, yet were still reading the SQE in
> >>> the userspace-mapped SQ. Since io_uring_enter() had already incremented
> >>> the SQ head index, userspace reused the SQ slot for a new SQE once the
> >>> SQ wrapped around to it.
> >>>
> >>> We confirmed this hypothesis by "poisoning" all SQEs up to the SQ head
> >>> index in userspace upon return from io_uring_enter(). By overwriting the
> >>> nvme_uring_cmd nsid field with a known garbage value, we were able to
> >>> trigger the err message in nvme_validate_passthru_nsid(), which logged
> >>> the garbage nsid value.
> >>>
> >>> The issue is caused by commit 5eff57fa9f3a ("io_uring/uring_cmd: defer
> >>> SQE copying until it's needed"). With this commit reverted, the poisoned
> >>> values in the SQEs are no longer seen by nvme_uring_cmd_io().
> >>>
> >>> Prior to the commit, each uring_cmd SQE was unconditionally memcpy()ed
> >>> to async_data at prep time. The commit moved this memcpy() to 2 cases
> >>> when the request goes async:
> >>> - If REQ_F_FORCE_ASYNC is set to force the initial issue to go async
> >>> - If ->uring_cmd() returns -EAGAIN in the initial non-blocking issue
> >>>
> >>> This patch set fixes a bug in the EAGAIN case where the uring_cmd's sqe
> >>> pointer is not updated to point to async_data after the memcpy(),
> >>> as it correctly is in the REQ_F_FORCE_ASYNC case.
> >>>
> >>> However, uring_cmd's can be issued async in other cases not enumerated
> >>> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
> >>> besides the first in a linked chain, which are only issued once prior
> >>> requests complete. Requests waiting for a drain to complete would also
> >>> be initially issued async.
> >>>
> >>> While it's probably possible for io_uring_cmd_prep_setup() to check for
> >>> each of these cases and avoid deferring the SQE memcpy(), we feel it
> >>> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
> >>> As discussed recently in regard to the ublk zero-copy patches[1], new
> >>> async paths added in the future could break these delicate assumptions.
> >>
> >> I don't think it's particularly delicate - did you manage to catch the
> >> case queueing a request for async execution where the sqe wasn't already
> >> copied? I did take a quick look after our out-of-band conversation, and
> >> the only missing bit I immediately spotted is using SQPOLL. But I don't
> >> think you're using that, right? And in any case, lifetime of SQEs with
> >> SQPOLL is the duration of the request anyway, so should not pose any
> >> risk of overwriting SQEs. But I do think the code should copy for that
> >> case too, just to avoid it being a harder-to-use thing than it should
> >> be.
> >
> > Yes, even with the EAGAIN case fixed, nvme_validate_passthru_nsid() is
> > still catching the poisoned nsids. However, the log lines now come
> > from the application thread rather than the iou-wrk thread.
> > Indeed, we are not using SQPOLL. But we are using IOSQE_SQE_GROUP from
> > Ming's SQE group patch set[1]. Like IOSQE_IO_LINK, subsequent
> > operations in a group are issued only once the first completes. The
> > first operation in the group is a UBLK_IO_PROVIDE_IO_BUF from Ming's
> > other patch set[2], which should complete synchronously. The
> > completion should be processed in __io_submit_flush_completions() ->
> > io_free_batch_list() and queue the remaining grouped operations to be
> > issued in task work. And all the pending task work should be processed
> > during io_uring_enter()'s return to userspace.
> > But some NVMe passthru operations must be going async because they are
> > observing the poisoned values the application thread writes into the
> > io_uring SQEs after io_uring_enter() returns. We don't yet have an
> > explanation for how they end up being issued async; ftrace shows that
> > in the typical case, all the NVMe passthru operations in the group are
> > issued during task work before io_uring_enter() returns to userspace.
> > Perhaps a pending signal could short-circuit the task work processing?
>
> I think it'd be a good idea to move testing to the newer patchset, as
> the sqe grouping was never really fully vetted or included. It's not
> impossible that it's a side effect of that, even though it very well
> cold be a generic issue that exists already without those patches. In
> any case, that'd move us all to a closer-to-upstream test base, without
> having older patches that aren't upstream in the mix.

Yes, we completely agree. We are working on incorporating Keith's
patchset now. It looks like there is still an open question about
whether userspace will need to enforce ordering between the requests
(either using linked operations or waiting for completions before
submitting the subsequent operations).
Ming's "grouped" requests seem to be based on linked requests, which
is why I suspect we would hit the same SQE corruption upstream if we
were using linked requests.

>
> How are the rings setup for what you're testing? I'm specifically
> thinking of the task_work related flags. Is it using DEFER_TASKRUN, or
> is it using the older signal style task_work?

The io_uring is set up with only the IORING_SETUP_SQE128 and
IORING_SETUP_CQE32 flags, so they are probably using the older task
work implementation. There's no particular reason not to use
IORING_SETUP_DEFER_TASKRUN, we just haven't tried it out yet.

Thanks,
Caleb

>
> --
> Jens Axboe

Keith Busch Feb. 12, 2025, 11:21 p.m. UTC | #10

On Wed, Feb 12, 2025 at 03:07:30PM -0800, Caleb Sander Mateos wrote:
> 
> Yes, we completely agree. We are working on incorporating Keith's
> patchset now. It looks like there is still an open question about
> whether userspace will need to enforce ordering between the requests
> (either using linked operations or waiting for completions before
> submitting the subsequent operations).

In its current form, my series depends on you *not* using linked
requests. I didn't think it would be a problem as it follows an existing
pattern from the IORING_OP_FILES_UPDATE operation. That has to complete
in its entirety before prepping any subsequent commands that reference
the index, and using links would get the wrong results.

Caleb Sander Mateos Feb. 12, 2025, 11:46 p.m. UTC | #11

On Wed, Feb 12, 2025 at 3:21 PM Keith Busch <kbusch@kernel.org> wrote:
>
> On Wed, Feb 12, 2025 at 03:07:30PM -0800, Caleb Sander Mateos wrote:
> >
> > Yes, we completely agree. We are working on incorporating Keith's
> > patchset now. It looks like there is still an open question about
> > whether userspace will need to enforce ordering between the requests
> > (either using linked operations or waiting for completions before
> > submitting the subsequent operations).
>
> In its current form, my series depends on you *not* using linked
> requests. I didn't think it would be a problem as it follows an existing
> pattern from the IORING_OP_FILES_UPDATE operation. That has to complete
> in its entirety before prepping any subsequent commands that reference
> the index, and using links would get the wrong results.

As implementers of a ublk server, we would also prefer the current
interface in your patch series! Having to explicitly order the
requests would definitely make the interface more cumbersome and
probably less performant. I was just saying that Ming and Pavel had
raised some concerns about guaranteeing the order in which io_uring
issues SQEs. IORING_OP_FILES_UPDATE is a good analogy. Do we have any
examples of how applications use it? Are they waiting for a
completion, linking it, or relying on io_uring to issue it
synchronously?

Jens Axboe Feb. 12, 2025, 11:55 p.m. UTC | #12

On 2/12/25 4:46 PM, Caleb Sander Mateos wrote:
> On Wed, Feb 12, 2025 at 3:21?PM Keith Busch <kbusch@kernel.org> wrote:
>>
>> On Wed, Feb 12, 2025 at 03:07:30PM -0800, Caleb Sander Mateos wrote:
>>>
>>> Yes, we completely agree. We are working on incorporating Keith's
>>> patchset now. It looks like there is still an open question about
>>> whether userspace will need to enforce ordering between the requests
>>> (either using linked operations or waiting for completions before
>>> submitting the subsequent operations).
>>
>> In its current form, my series depends on you *not* using linked
>> requests. I didn't think it would be a problem as it follows an existing
>> pattern from the IORING_OP_FILES_UPDATE operation. That has to complete
>> in its entirety before prepping any subsequent commands that reference
>> the index, and using links would get the wrong results.
> 
> As implementers of a ublk server, we would also prefer the current
> interface in your patch series! Having to explicitly order the
> requests would definitely make the interface more cumbersome and
> probably less performant. I was just saying that Ming and Pavel had
> raised some concerns about guaranteeing the order in which io_uring
> issues SQEs. IORING_OP_FILES_UPDATE is a good analogy. Do we have any
> examples of how applications use it? Are they waiting for a
> completion, linking it, or relying on io_uring to issue it
> synchronously?

Yes it's a good similar example - and I don't think it matters much
how it's used. If you rely on its completion before making progress AND
you set flags like IOSQE_ASYNC or LINK/DRAIN that will make it go async
on purposes, then yes you'd need to similarly link dependents on it. If
you don't set anything that forces it to go async, then it WILL complete
inline - there's nothing in its implementation that would cause it
needing to retry. Any failure would be fatal.

This is very much the same thing with the buf update / insertion, it'll
behave in exactly the same way. You could argue "but you need to handle
failures" and that is true. But if the failure case is that your
consumer of the buffer fails with an import failure, then you can just
as well handle that as you can the request getting failed with
-ECANCELED because your dependent link failed.

Pavel Begunkov Feb. 13, 2025, 2:48 p.m. UTC | #13

On 2/12/25 20:55, Jens Axboe wrote:
> On 2/12/25 1:45 PM, Caleb Sander Mateos wrote:
...
>> However, uring_cmd's can be issued async in other cases not enumerated
>> by 5eff57fa9f3a, also leading to SQE corruption. These include requests
>> besides the first in a linked chain, which are only issued once prior
>> requests complete. Requests waiting for a drain to complete would also
>> be initially issued async.
>>
>> While it's probably possible for io_uring_cmd_prep_setup() to check for
>> each of these cases and avoid deferring the SQE memcpy(), we feel it
>> might be safer to revert 5eff57fa9f3a to avoid the corruption risk.
>> As discussed recently in regard to the ublk zero-copy patches[1], new
>> async paths added in the future could break these delicate assumptions.
> 
> I don't think it's particularly delicate - did you manage to catch the
> case queueing a request for async execution where the sqe wasn't already
> copied? I did take a quick look after our out-of-band conversation, and
> the only missing bit I immediately spotted is using SQPOLL. But I don't
> think you're using that, right? And in any case, lifetime of SQEs with
> SQPOLL is the duration of the request anyway, so should not pose any

Not really, it's not bound to the syscall, but the user could always
check or wait (i.e. IORING_ENTER_SQ_WAIT) for empty sqes and reuse
them before completion.

Pavel Begunkov Feb. 13, 2025, 4:11 p.m. UTC | #14

On 2/12/25 23:46, Caleb Sander Mateos wrote:
> On Wed, Feb 12, 2025 at 3:21 PM Keith Busch <kbusch@kernel.org> wrote:
>>
>> On Wed, Feb 12, 2025 at 03:07:30PM -0800, Caleb Sander Mateos wrote:
>>>
>>> Yes, we completely agree. We are working on incorporating Keith's
>>> patchset now. It looks like there is still an open question about
>>> whether userspace will need to enforce ordering between the requests
>>> (either using linked operations or waiting for completions before
>>> submitting the subsequent operations).
>>
>> In its current form, my series depends on you *not* using linked
>> requests. I didn't think it would be a problem as it follows an existing
>> pattern from the IORING_OP_FILES_UPDATE operation. That has to complete
>> in its entirety before prepping any subsequent commands that reference
>> the index, and using links would get the wrong results.
> 
> As implementers of a ublk server, we would also prefer the current
> interface in your patch series! Having to explicitly order the
> requests would definitely make the interface more cumbersome and
> probably less performant. I was just saying that Ming and Pavel had
> raised some concerns about guaranteeing the order in which io_uring
> issues SQEs. IORING_OP_FILES_UPDATE is a good analogy. Do we have any
> examples of how applications use it? Are they waiting for a
> completion, linking it, or relying on io_uring to issue it
> synchronously?

With files you either handle it in user space by waiting for
completion or link them. The difference with buffers is that
linking wouldn't currently work for buffers, but we can change
that.

Pavel Begunkov Feb. 13, 2025, 4:28 p.m. UTC | #15

On 2/12/25 23:55, Jens Axboe wrote:
> On 2/12/25 4:46 PM, Caleb Sander Mateos wrote:
>> On Wed, Feb 12, 2025 at 3:21?PM Keith Busch <kbusch@kernel.org> wrote:
>>>
>>> On Wed, Feb 12, 2025 at 03:07:30PM -0800, Caleb Sander Mateos wrote:
>>>>
>>>> Yes, we completely agree. We are working on incorporating Keith's
>>>> patchset now. It looks like there is still an open question about
>>>> whether userspace will need to enforce ordering between the requests
>>>> (either using linked operations or waiting for completions before
>>>> submitting the subsequent operations).
>>>
>>> In its current form, my series depends on you *not* using linked
>>> requests. I didn't think it would be a problem as it follows an existing
>>> pattern from the IORING_OP_FILES_UPDATE operation. That has to complete
>>> in its entirety before prepping any subsequent commands that reference
>>> the index, and using links would get the wrong results.
>>
>> As implementers of a ublk server, we would also prefer the current
>> interface in your patch series! Having to explicitly order the
>> requests would definitely make the interface more cumbersome and
>> probably less performant. I was just saying that Ming and Pavel had
>> raised some concerns about guaranteeing the order in which io_uring
>> issues SQEs. IORING_OP_FILES_UPDATE is a good analogy. Do we have any
>> examples of how applications use it? Are they waiting for a
>> completion, linking it, or relying on io_uring to issue it
>> synchronously?
> 
> Yes it's a good similar example - and I don't think it matters much
> how it's used. If you rely on its completion before making progress AND
> you set flags like IOSQE_ASYNC or LINK/DRAIN that will make it go async
> on purposes, then yes you'd need to similarly link dependents on it. If
> you don't set anything that forces it to go async, then it WILL complete
> inline - there's nothing in its implementation that would cause it
> needing to retry. Any failure would be fatal.
> 
> This is very much the same thing with the buf update / insertion, it'll
> behave in exactly the same way. You could argue "but you need to handle
> failures" and that is true. But if the failure case is that your
> consumer of the buffer fails with an import failure, then you can just
> as well handle that as you can the request getting failed with
> -ECANCELED because your dependent link failed.

That should be fine as long as the kernel reserves the ability
to change the order, i.e. in case of an error the implementation
will need to fall back to proper ordering like linking. It might
need a flag from io_uring whether you should ever attempt the
unordered version not to get a perf regression down the road
from submitting it twice.

For the unregistration though it'd likely need proper ordering
with the IO request in case it fails.

Jens Axboe Feb. 13, 2025, 6:13 p.m. UTC | #16

On Wed, 12 Feb 2025 13:45:44 -0700, Caleb Sander Mateos wrote:
> In our application issuing NVMe passthru commands, we have observed
> nvme_uring_cmd fields being corrupted between when userspace initializes
> the io_uring SQE and when nvme_uring_cmd_io() processes it.
> 
> We hypothesized that the uring_cmd's were executing asynchronously after
> the io_uring_enter() syscall returned, yet were still reading the SQE in
> the userspace-mapped SQ. Since io_uring_enter() had already incremented
> the SQ head index, userspace reused the SQ slot for a new SQE once the
> SQ wrapped around to it.
> 
> [...]

Applied, thanks!

[1/2] io_uring/uring_cmd: don't assume io_uring_cmd_data layout
      commit: 34cae91215c6f65bed2a124fb9283da6ec0b8dd9
[2/2] io_uring/uring_cmd: switch sqe to async_data on EAGAIN
      commit: e663da62ba8672aaa66843f1af8b20e3bb1a0515

Best regards,

[0/2] uring_cmd SQE corruptions

Message

Comments