[RFC,0/2] io_uring: examine request result only after completion

Message ID	1571908688-22488-1-git-send-email-bijan.mottahedeh@oracle.com (mailing list archive)
Headers	show Return-Path: <SRS0=pYcy=YR=vger.kernel.org=linux-block-owner@kernel.org> From: Bijan Mottahedeh <bijan.mottahedeh@oracle.com> To: axboe@kernel.dk Cc: linux-block@vger.kernel.org Subject: [RFC 0/2] io_uring: examine request result only after completion Date: Thu, 24 Oct 2019 02:18:06 -0700 Message-Id: <1571908688-22488-1-git-send-email-bijan.mottahedeh@oracle.com> Sender: linux-block-owner@vger.kernel.org Precedence: bulk
Series	io_uring: examine request result only after completion \| expand [RFC,0/2] io_uring: examine request result only after completion [RFC,1/2] io_uring: create io_queue_async() function [RFC,2/2] io_uring: examine request result only after completion

Bijan Mottahedeh Oct. 24, 2019, 9:18 a.m. UTC

Running an fio test consistenly crashes the kernel with the trace included
below.  The root cause seems to be the code in __io_submit_sqe() that
checks the result of a request for -EAGAIN in polled mode, without
ensuring first that the request has completed:

	if (ctx->flags & IORING_SETUP_IOPOLL) {
		if (req->result == -EAGAIN)
			return -EAGAIN;

The request will be immediately resubmitted in io_sq_wq_submit_work(),
potentially before the the fisrt submission has completed.  This creates
a race where the original completion may set REQ_F_IOPOLL_COMPLETED in
a freed submission request, overwriting the poisoned bits, casusing the
panic below.

	do {
		ret = __io_submit_sqe(ctx, req, s, false);
		/*
		 * We can get EAGAIN for polled IO even though
		 * we're forcing a sync submission from here,
		 * since we can't wait for request slots on the
		 * block side.
		 */
		if (ret != -EAGAIN)
			break;
		cond_resched();
	} while (1);

The suggested fix is to move a submitted request to the poll list
unconditionally in polled mode.  The request can then be retried if
necessary once the original submission has indeed completed.

This bug raises an issue however since REQ_F_IOPOLL_COMPLETED is set
in io_complete_rw_iopoll() from interrupt context.  NVMe polled queues
however are not supposed to generate interrupts so it is not clear what
is the reason for this apparent inconsitency.

fio job
-------
[global]
filename=/dev/nvme0n1
rw=randread
bs=4k
size=4G
direct=1
time_based=1
runtime=60
randrepeat=1
gtod_reduce=1

fio test
--------
fio nvme.fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs --hipri --numjobs=8

panic trace
-----------
[  450.395076] BUG io_kiocb (Not tainted): Poison overwritten
[  450.537797] -----------------------------------------------------------------------------
[  450.537799] INFO: 0x00000000cb333e0b-0x00000000cb333e0b. First byte 0x7b instead of 0x6b
[  450.656496] RIP: 0010:blkdev_bio_end_io+0x71/0xe0
[  450.772066] INFO: Allocated in io_submit_sqe+0x84/0x3d0 age=555 cpu=9 pid=3665
[  450.772070]  __slab_alloc+0x40/0x62
[  450.868914] Code: 75 3c 0f b6 43 32 4c 8b 2b 84 c0 75 0a 48 8b 73 08 49 01 75 08 eb 0b 0f b6 f8 e8 aa 9c 0e 00 48 63 f0 48 8b 03 31 d2 4c 89 ef <ff> 50 10 f6 43 14 01 74 32 48 8d 7b 18 e8 0d 56 0e 00 eb 27 48 8b
[  450.925197]  kmem_cache_alloc+0xa3/0x260
[  450.925198]  io_submit_sqe+0x84/0x3d0
[  451.011642] RSP: 0018:ffffc90006908e28 EFLAGS: 00010046
[  451.053353]  io_ring_submit+0xd5/0x150
[  451.053355]  __x64_sys_io_uring_enter+0x14e/0x290

Bijan Mottahedeh (2):
  io_uring: create io_queue_async() function
  io_uring: examine request result only after completion

 fs/io_uring.c | 122 ++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 89 insertions(+), 33 deletions(-)

Jens Axboe Oct. 24, 2019, 5:09 p.m. UTC | #1

On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
> Running an fio test consistenly crashes the kernel with the trace included
> below.  The root cause seems to be the code in __io_submit_sqe() that
> checks the result of a request for -EAGAIN in polled mode, without
> ensuring first that the request has completed:
> 
> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
> 		if (req->result == -EAGAIN)
> 			return -EAGAIN;

I'm a little confused, because we should be holding the submission
reference to the request still at this point. So how is it going away?
I must be missing something...

> The request will be immediately resubmitted in io_sq_wq_submit_work(),
> potentially before the the fisrt submission has completed.  This creates
> a race where the original completion may set REQ_F_IOPOLL_COMPLETED in
> a freed submission request, overwriting the poisoned bits, casusing the
> panic below.
> 
> 	do {
> 		ret = __io_submit_sqe(ctx, req, s, false);
> 		/*
> 		 * We can get EAGAIN for polled IO even though
> 		 * we're forcing a sync submission from here,
> 		 * since we can't wait for request slots on the
> 		 * block side.
> 		 */
> 		if (ret != -EAGAIN)
> 			break;
> 		cond_resched();
> 	} while (1);
> 
> The suggested fix is to move a submitted request to the poll list
> unconditionally in polled mode.  The request can then be retried if
> necessary once the original submission has indeed completed.
>
> This bug raises an issue however since REQ_F_IOPOLL_COMPLETED is set
> in io_complete_rw_iopoll() from interrupt context.  NVMe polled queues
> however are not supposed to generate interrupts so it is not clear what
> is the reason for this apparent inconsitency.

It's because you're not running with poll queues for NVMe, hence you're
throwing a lot of performance away. Load nvme with poll_queues=X (or boot
with nvme.poll_queues=X, if built in) to have a set of separate queues
for polling. These don't have IRQs enabled, and it'll work much faster
for you.

Bijan Mottahedeh Oct. 24, 2019, 7:18 p.m. UTC | #2

On 10/24/19 10:09 AM, Jens Axboe wrote:
> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>> Running an fio test consistenly crashes the kernel with the trace included
>> below.  The root cause seems to be the code in __io_submit_sqe() that
>> checks the result of a request for -EAGAIN in polled mode, without
>> ensuring first that the request has completed:
>>
>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>> 		if (req->result == -EAGAIN)
>> 			return -EAGAIN;
> I'm a little confused, because we should be holding the submission
> reference to the request still at this point. So how is it going away?
> I must be missing something...

I don't think the submission reference is going away...

I *think* the problem has to do with the fact that 
io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being 
called from interrupt context in my configuration and so there is a 
potential race between updating the request there and checking it in 
__io_submit_sqe().

My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the 
code snippet above:

     if (req->result == --EAGAIN) {

         poll for REQ_F_IOPOLL_COMPLETED

         return -EAGAIN;

}

and that got rid of the problem.

>
>> The request will be immediately resubmitted in io_sq_wq_submit_work(),
>> potentially before the the fisrt submission has completed.  This creates
>> a race where the original completion may set REQ_F_IOPOLL_COMPLETED in
>> a freed submission request, overwriting the poisoned bits, casusing the
>> panic below.
>>
>> 	do {
>> 		ret = __io_submit_sqe(ctx, req, s, false);
>> 		/*
>> 		 * We can get EAGAIN for polled IO even though
>> 		 * we're forcing a sync submission from here,
>> 		 * since we can't wait for request slots on the
>> 		 * block side.
>> 		 */
>> 		if (ret != -EAGAIN)
>> 			break;
>> 		cond_resched();
>> 	} while (1);
>>
>> The suggested fix is to move a submitted request to the poll list
>> unconditionally in polled mode.  The request can then be retried if
>> necessary once the original submission has indeed completed.
>>
>> This bug raises an issue however since REQ_F_IOPOLL_COMPLETED is set
>> in io_complete_rw_iopoll() from interrupt context.  NVMe polled queues
>> however are not supposed to generate interrupts so it is not clear what
>> is the reason for this apparent inconsitency.
> It's because you're not running with poll queues for NVMe, hence you're
> throwing a lot of performance away. Load nvme with poll_queues=X (or boot
> with nvme.poll_queues=X, if built in) to have a set of separate queues
> for polling. These don't have IRQs enabled, and it'll work much faster
> for you.
>
That's what I did in fact.  I booted with nvme.poll_queues=36 (I figured 
1 per core but I'm not sure what is a reasonable number).

I also checked that /sys/block/<nvme>/queue/io_poll = 1.

What's really odd is that the irq/sec numbers from mpstat and perf show 
equivalent values with/without polling (with/without fio "hipri" option) 
even though I can see from perf top that we are in fact polling in one 
case. I don't if I missing a step or something is off in my config.

Thanks.

--bijan

Jens Axboe Oct. 24, 2019, 10:31 p.m. UTC | #3

On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
> 
> On 10/24/19 10:09 AM, Jens Axboe wrote:
>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>> Running an fio test consistenly crashes the kernel with the trace included
>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>> checks the result of a request for -EAGAIN in polled mode, without
>>> ensuring first that the request has completed:
>>>
>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>> 		if (req->result == -EAGAIN)
>>> 			return -EAGAIN;
>> I'm a little confused, because we should be holding the submission
>> reference to the request still at this point. So how is it going away?
>> I must be missing something...
> 
> I don't think the submission reference is going away...
> 
> I *think* the problem has to do with the fact that
> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
> called from interrupt context in my configuration and so there is a
> potential race between updating the request there and checking it in
> __io_submit_sqe().
> 
> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
> code snippet above:
> 
>       if (req->result == --EAGAIN) {
> 
>           poll for REQ_F_IOPOLL_COMPLETED
> 
>           return -EAGAIN;
> 
> }
> 
> and that got rid of the problem.

But that will not work at all for a proper poll setup, where you don't
trigger any IRQs... It only happens to work for this case because you're
still triggering interrupts. But even in that case, it's not a real
solution, but I don't think that's the argument here ;-)

I see what the race is now, it's specific to IRQ driven polling. We
really should just disallow that, to be honest, it doesn't make any
sense. But let me think about if we can do a reasonable solution to this
that doesn't involve adding overhead for a proper setup.

>>> The request will be immediately resubmitted in io_sq_wq_submit_work(),
>>> potentially before the the fisrt submission has completed.  This creates
>>> a race where the original completion may set REQ_F_IOPOLL_COMPLETED in
>>> a freed submission request, overwriting the poisoned bits, casusing the
>>> panic below.
>>>
>>> 	do {
>>> 		ret = __io_submit_sqe(ctx, req, s, false);
>>> 		/*
>>> 		 * We can get EAGAIN for polled IO even though
>>> 		 * we're forcing a sync submission from here,
>>> 		 * since we can't wait for request slots on the
>>> 		 * block side.
>>> 		 */
>>> 		if (ret != -EAGAIN)
>>> 			break;
>>> 		cond_resched();
>>> 	} while (1);
>>>
>>> The suggested fix is to move a submitted request to the poll list
>>> unconditionally in polled mode.  The request can then be retried if
>>> necessary once the original submission has indeed completed.
>>>
>>> This bug raises an issue however since REQ_F_IOPOLL_COMPLETED is set
>>> in io_complete_rw_iopoll() from interrupt context.  NVMe polled queues
>>> however are not supposed to generate interrupts so it is not clear what
>>> is the reason for this apparent inconsitency.
>> It's because you're not running with poll queues for NVMe, hence you're
>> throwing a lot of performance away. Load nvme with poll_queues=X (or boot
>> with nvme.poll_queues=X, if built in) to have a set of separate queues
>> for polling. These don't have IRQs enabled, and it'll work much faster
>> for you.
>>
> That's what I did in fact.  I booted with nvme.poll_queues=36 (I figured
> 1 per core but I'm not sure what is a reasonable number).
> 
> I also checked that /sys/block/<nvme>/queue/io_poll = 1.
> 
> What's really odd is that the irq/sec numbers from mpstat and perf show
> equivalent values with/without polling (with/without fio "hipri" option)
> even though I can see from perf top that we are in fact polling in one
> case. I don't if I missing a step or something is off in my config.

Doesn't sound right. io_poll just says if polling is enabled, not if
it's IRQ driven or not. Try this, assuming nvme0n1 is the device in
question:

# cd /sys/kernel/debug/block/nvme0n1
# grep . hctx*/type
hctx0/type:default
hctx1/type:read
hctx2/type:read
hctx3/type:read
hctx4/type:poll
hctx5/type:poll
hctx6/type:poll
hctx7/type:poll

This is just a vm I have, hence just the 8 queues. But this shows a
proper mapping setup - 1 default queue, 3 read, and 4 poll queues.

Is nvme a module on your box? Or is it built-in?

Jens Axboe Oct. 25, 2019, 2:07 p.m. UTC | #4

On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
> 
> On 10/24/19 3:31 PM, Jens Axboe wrote:
>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>> Running an fio test consistenly crashes the kernel with the trace included
>>>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>>>> checks the result of a request for -EAGAIN in polled mode, without
>>>>> ensuring first that the request has completed:
>>>>>
>>>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>> 		if (req->result == -EAGAIN)
>>>>> 			return -EAGAIN;
>>>> I'm a little confused, because we should be holding the submission
>>>> reference to the request still at this point. So how is it going away?
>>>> I must be missing something...
>>> I don't think the submission reference is going away...
>>>
>>> I *think* the problem has to do with the fact that
>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>> called from interrupt context in my configuration and so there is a
>>> potential race between updating the request there and checking it in
>>> __io_submit_sqe().
>>>
>>> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
>>> code snippet above:
>>>
>>>        if (req->result == --EAGAIN) {
>>>
>>>            poll for REQ_F_IOPOLL_COMPLETED
>>>
>>>            return -EAGAIN;
>>>
>>> }
>>>
>>> and that got rid of the problem.
>> But that will not work at all for a proper poll setup, where you don't
>> trigger any IRQs... It only happens to work for this case because you're
>> still triggering interrupts. But even in that case, it's not a real
>> solution, but I don't think that's the argument here ;-)
> 
> Sure.
> 
> I'm just curious though as how it would break the poll case because
> io_complete_rw_iopoll() would still be called though through polling,
> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
> should be able to reliably check req->result.

It'd break the poll case because the task doing the submission is
generally also the one that finds and reaps completion. Hence if you
block that task just polling on that completion bit, you are preventing
that very task from going and reaping completions. The condition would
never become true, and you are now looping forever.

> The same poll test seemed to run ok with nvme interrupts not being
> triggered. Anyway, no argument that it's not needed!

A few reasons why it would make progress:

- You eventually trigger a timeout on the nvme side, as blk-mq finds the
  request hasn't been completed by an IRQ. But that's a 30 second ordeal
  before that event occurs.

- There was still interrupts enabled.

- You have two threads, one doing submission and one doing completions.
  Maybe using SQPOLL? If that's the case, then yes, it'd still work as
  you have separate threads for submission and completion.

For the "generic" case of just using one thread and IRQs disabled, it'd
deadlock.

>> I see what the race is now, it's specific to IRQ driven polling. We
>> really should just disallow that, to be honest, it doesn't make any
>> sense. But let me think about if we can do a reasonable solution to this
>> that doesn't involve adding overhead for a proper setup.
> 
> It's a nonsensical config in a way and so disallowing it would make
> the most sense.

Definitely. The nvme driver should not set .poll() if it doesn't have
non-irq poll queues. Something like this:


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 869f462e6b6e..ac1b0a9387a4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1594,6 +1594,16 @@ static const struct blk_mq_ops nvme_mq_ops = {
 	.poll		= nvme_poll,
 };
 
+static const struct blk_mq_ops nvme_mq_nopoll_ops = {
+	.queue_rq	= nvme_queue_rq,
+	.complete	= nvme_pci_complete_rq,
+	.commit_rqs	= nvme_commit_rqs,
+	.init_hctx	= nvme_init_hctx,
+	.init_request	= nvme_init_request,
+	.map_queues	= nvme_pci_map_queues,
+	.timeout	= nvme_timeout,
+};
+
 static void nvme_dev_remove_admin(struct nvme_dev *dev)
 {
 	if (dev->ctrl.admin_q && !blk_queue_dying(dev->ctrl.admin_q)) {
@@ -2269,11 +2279,14 @@ static void nvme_dev_add(struct nvme_dev *dev)
 	int ret;
 
 	if (!dev->ctrl.tagset) {
-		dev->tagset.ops = &nvme_mq_ops;
 		dev->tagset.nr_hw_queues = dev->online_queues - 1;
 		dev->tagset.nr_maps = 2; /* default + read */
-		if (dev->io_queues[HCTX_TYPE_POLL])
+		if (dev->io_queues[HCTX_TYPE_POLL]) {
+			dev->tagset.ops = &nvme_mq_ops;
 			dev->tagset.nr_maps++;
+		} else {
+			dev->tagset.ops = &nvme_mq_nopoll_ops;
+		}
 		dev->tagset.timeout = NVME_IO_TIMEOUT;
 		dev->tagset.numa_node = dev_to_node(dev->dev);
 		dev->tagset.queue_depth =

Jens Axboe Oct. 25, 2019, 2:18 p.m. UTC | #5

On 10/25/19 8:07 AM, Jens Axboe wrote:
> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>
>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>> Running an fio test consistenly crashes the kernel with the trace included
>>>>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>>>>> checks the result of a request for -EAGAIN in polled mode, without
>>>>>> ensuring first that the request has completed:
>>>>>>
>>>>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>> 		if (req->result == -EAGAIN)
>>>>>> 			return -EAGAIN;
>>>>> I'm a little confused, because we should be holding the submission
>>>>> reference to the request still at this point. So how is it going away?
>>>>> I must be missing something...
>>>> I don't think the submission reference is going away...
>>>>
>>>> I *think* the problem has to do with the fact that
>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>> called from interrupt context in my configuration and so there is a
>>>> potential race between updating the request there and checking it in
>>>> __io_submit_sqe().
>>>>
>>>> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
>>>> code snippet above:
>>>>
>>>>         if (req->result == --EAGAIN) {
>>>>
>>>>             poll for REQ_F_IOPOLL_COMPLETED
>>>>
>>>>             return -EAGAIN;
>>>>
>>>> }
>>>>
>>>> and that got rid of the problem.
>>> But that will not work at all for a proper poll setup, where you don't
>>> trigger any IRQs... It only happens to work for this case because you're
>>> still triggering interrupts. But even in that case, it's not a real
>>> solution, but I don't think that's the argument here ;-)
>>
>> Sure.
>>
>> I'm just curious though as how it would break the poll case because
>> io_complete_rw_iopoll() would still be called though through polling,
>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>> should be able to reliably check req->result.
> 
> It'd break the poll case because the task doing the submission is
> generally also the one that finds and reaps completion. Hence if you
> block that task just polling on that completion bit, you are preventing
> that very task from going and reaping completions. The condition would
> never become true, and you are now looping forever.
> 
>> The same poll test seemed to run ok with nvme interrupts not being
>> triggered. Anyway, no argument that it's not needed!
> 
> A few reasons why it would make progress:
> 
> - You eventually trigger a timeout on the nvme side, as blk-mq finds the
>    request hasn't been completed by an IRQ. But that's a 30 second ordeal
>    before that event occurs.
> 
> - There was still interrupts enabled.
> 
> - You have two threads, one doing submission and one doing completions.
>    Maybe using SQPOLL? If that's the case, then yes, it'd still work as
>    you have separate threads for submission and completion.
> 
> For the "generic" case of just using one thread and IRQs disabled, it'd
> deadlock.
> 
>>> I see what the race is now, it's specific to IRQ driven polling. We
>>> really should just disallow that, to be honest, it doesn't make any
>>> sense. But let me think about if we can do a reasonable solution to this
>>> that doesn't involve adding overhead for a proper setup.
>>
>> It's a nonsensical config in a way and so disallowing it would make
>> the most sense.
> 
> Definitely. The nvme driver should not set .poll() if it doesn't have
> non-irq poll queues. Something like this:

Actually, we already disable polling if we don't have specific poll
queues:

        if (set->nr_maps > HCTX_TYPE_POLL &&
            set->map[HCTX_TYPE_POLL].nr_queues)
                blk_queue_flag_set(QUEUE_FLAG_POLL, q);

Did you see any timeouts in your tests? I wonder if the use-after-free
triggered when the timeout found the request while you had the busy-spin
logic we discussed previously.

Jens Axboe Oct. 25, 2019, 2:21 p.m. UTC | #6

On 10/25/19 8:18 AM, Jens Axboe wrote:
> On 10/25/19 8:07 AM, Jens Axboe wrote:
>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>
>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>> Running an fio test consistenly crashes the kernel with the trace included
>>>>>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>>>>>> checks the result of a request for -EAGAIN in polled mode, without
>>>>>>> ensuring first that the request has completed:
>>>>>>>
>>>>>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>> 		if (req->result == -EAGAIN)
>>>>>>> 			return -EAGAIN;
>>>>>> I'm a little confused, because we should be holding the submission
>>>>>> reference to the request still at this point. So how is it going away?
>>>>>> I must be missing something...
>>>>> I don't think the submission reference is going away...
>>>>>
>>>>> I *think* the problem has to do with the fact that
>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>> called from interrupt context in my configuration and so there is a
>>>>> potential race between updating the request there and checking it in
>>>>> __io_submit_sqe().
>>>>>
>>>>> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
>>>>> code snippet above:
>>>>>
>>>>>          if (req->result == --EAGAIN) {
>>>>>
>>>>>              poll for REQ_F_IOPOLL_COMPLETED
>>>>>
>>>>>              return -EAGAIN;
>>>>>
>>>>> }
>>>>>
>>>>> and that got rid of the problem.
>>>> But that will not work at all for a proper poll setup, where you don't
>>>> trigger any IRQs... It only happens to work for this case because you're
>>>> still triggering interrupts. But even in that case, it's not a real
>>>> solution, but I don't think that's the argument here ;-)
>>>
>>> Sure.
>>>
>>> I'm just curious though as how it would break the poll case because
>>> io_complete_rw_iopoll() would still be called though through polling,
>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>> should be able to reliably check req->result.
>>
>> It'd break the poll case because the task doing the submission is
>> generally also the one that finds and reaps completion. Hence if you
>> block that task just polling on that completion bit, you are preventing
>> that very task from going and reaping completions. The condition would
>> never become true, and you are now looping forever.
>>
>>> The same poll test seemed to run ok with nvme interrupts not being
>>> triggered. Anyway, no argument that it's not needed!
>>
>> A few reasons why it would make progress:
>>
>> - You eventually trigger a timeout on the nvme side, as blk-mq finds the
>>     request hasn't been completed by an IRQ. But that's a 30 second ordeal
>>     before that event occurs.
>>
>> - There was still interrupts enabled.
>>
>> - You have two threads, one doing submission and one doing completions.
>>     Maybe using SQPOLL? If that's the case, then yes, it'd still work as
>>     you have separate threads for submission and completion.
>>
>> For the "generic" case of just using one thread and IRQs disabled, it'd
>> deadlock.
>>
>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>> really should just disallow that, to be honest, it doesn't make any
>>>> sense. But let me think about if we can do a reasonable solution to this
>>>> that doesn't involve adding overhead for a proper setup.
>>>
>>> It's a nonsensical config in a way and so disallowing it would make
>>> the most sense.
>>
>> Definitely. The nvme driver should not set .poll() if it doesn't have
>> non-irq poll queues. Something like this:
> 
> Actually, we already disable polling if we don't have specific poll
> queues:
> 
>          if (set->nr_maps > HCTX_TYPE_POLL &&
>              set->map[HCTX_TYPE_POLL].nr_queues)
>                  blk_queue_flag_set(QUEUE_FLAG_POLL, q);
> 
> Did you see any timeouts in your tests? I wonder if the use-after-free
> triggered when the timeout found the request while you had the busy-spin
> logic we discussed previously.

Ah, but we still have fops->iopoll() set for that case. So we just won't
poll for it, it'll get completed by IRQ. So I do think we need to handle
this case in io_uring. I'll get back to you.

Bijan Mottahedeh Oct. 25, 2019, 2:42 p.m. UTC | #7

On 10/25/19 7:18 AM, Jens Axboe wrote:
> On 10/25/19 8:07 AM, Jens Axboe wrote:
>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>> Running an fio test consistenly crashes the kernel with the trace included
>>>>>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>>>>>> checks the result of a request for -EAGAIN in polled mode, without
>>>>>>> ensuring first that the request has completed:
>>>>>>>
>>>>>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>> 		if (req->result == -EAGAIN)
>>>>>>> 			return -EAGAIN;
>>>>>> I'm a little confused, because we should be holding the submission
>>>>>> reference to the request still at this point. So how is it going away?
>>>>>> I must be missing something...
>>>>> I don't think the submission reference is going away...
>>>>>
>>>>> I *think* the problem has to do with the fact that
>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>> called from interrupt context in my configuration and so there is a
>>>>> potential race between updating the request there and checking it in
>>>>> __io_submit_sqe().
>>>>>
>>>>> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
>>>>> code snippet above:
>>>>>
>>>>>          if (req->result == --EAGAIN) {
>>>>>
>>>>>              poll for REQ_F_IOPOLL_COMPLETED
>>>>>
>>>>>              return -EAGAIN;
>>>>>
>>>>> }
>>>>>
>>>>> and that got rid of the problem.
>>>> But that will not work at all for a proper poll setup, where you don't
>>>> trigger any IRQs... It only happens to work for this case because you're
>>>> still triggering interrupts. But even in that case, it's not a real
>>>> solution, but I don't think that's the argument here ;-)
>>> Sure.
>>>
>>> I'm just curious though as how it would break the poll case because
>>> io_complete_rw_iopoll() would still be called though through polling,
>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>> should be able to reliably check req->result.
>> It'd break the poll case because the task doing the submission is
>> generally also the one that finds and reaps completion. Hence if you
>> block that task just polling on that completion bit, you are preventing
>> that very task from going and reaping completions. The condition would
>> never become true, and you are now looping forever.
>>
>>> The same poll test seemed to run ok with nvme interrupts not being
>>> triggered. Anyway, no argument that it's not needed!
>> A few reasons why it would make progress:
>>
>> - You eventually trigger a timeout on the nvme side, as blk-mq finds the
>>     request hasn't been completed by an IRQ. But that's a 30 second ordeal
>>     before that event occurs.
>>
>> - There was still interrupts enabled.
>>
>> - You have two threads, one doing submission and one doing completions.
>>     Maybe using SQPOLL? If that's the case, then yes, it'd still work as
>>     you have separate threads for submission and completion.
>>
>> For the "generic" case of just using one thread and IRQs disabled, it'd
>> deadlock.
>>
>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>> really should just disallow that, to be honest, it doesn't make any
>>>> sense. But let me think about if we can do a reasonable solution to this
>>>> that doesn't involve adding overhead for a proper setup.
>>> It's a nonsensical config in a way and so disallowing it would make
>>> the most sense.
>> Definitely. The nvme driver should not set .poll() if it doesn't have
>> non-irq poll queues. Something like this:
> Actually, we already disable polling if we don't have specific poll
> queues:
>
>          if (set->nr_maps > HCTX_TYPE_POLL &&
>              set->map[HCTX_TYPE_POLL].nr_queues)
>                  blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>
> Did you see any timeouts in your tests? I wonder if the use-after-free
> triggered when the timeout found the request while you had the busy-spin
> logic we discussed previously.
>
I didn't notice any timeouts.  The error triggered without me making any 
changes.  The busy-spin workaround I mentioned before actually got rid 
of the error.


--bijan

Bijan Mottahedeh Oct. 29, 2019, 7:17 p.m. UTC | #8

On 10/25/19 7:21 AM, Jens Axboe wrote:
> On 10/25/19 8:18 AM, Jens Axboe wrote:
>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>> Running an fio test consistenly crashes the kernel with the trace included
>>>>>>>> below.  The root cause seems to be the code in __io_submit_sqe() that
>>>>>>>> checks the result of a request for -EAGAIN in polled mode, without
>>>>>>>> ensuring first that the request has completed:
>>>>>>>>
>>>>>>>> 	if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>> 		if (req->result == -EAGAIN)
>>>>>>>> 			return -EAGAIN;
>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>> reference to the request still at this point. So how is it going away?
>>>>>>> I must be missing something...
>>>>>> I don't think the submission reference is going away...
>>>>>>
>>>>>> I *think* the problem has to do with the fact that
>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>> called from interrupt context in my configuration and so there is a
>>>>>> potential race between updating the request there and checking it in
>>>>>> __io_submit_sqe().
>>>>>>
>>>>>> My first workaround was to simply poll for REQ_F_IOPOLL_COMPLETED in the
>>>>>> code snippet above:
>>>>>>
>>>>>>           if (req->result == --EAGAIN) {
>>>>>>
>>>>>>               poll for REQ_F_IOPOLL_COMPLETED
>>>>>>
>>>>>>               return -EAGAIN;
>>>>>>
>>>>>> }
>>>>>>
>>>>>> and that got rid of the problem.
>>>>> But that will not work at all for a proper poll setup, where you don't
>>>>> trigger any IRQs... It only happens to work for this case because you're
>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>> solution, but I don't think that's the argument here ;-)
>>>> Sure.
>>>>
>>>> I'm just curious though as how it would break the poll case because
>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>> should be able to reliably check req->result.
>>> It'd break the poll case because the task doing the submission is
>>> generally also the one that finds and reaps completion. Hence if you
>>> block that task just polling on that completion bit, you are preventing
>>> that very task from going and reaping completions. The condition would
>>> never become true, and you are now looping forever.
>>>
>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>> triggered. Anyway, no argument that it's not needed!
>>> A few reasons why it would make progress:
>>>
>>> - You eventually trigger a timeout on the nvme side, as blk-mq finds the
>>>      request hasn't been completed by an IRQ. But that's a 30 second ordeal
>>>      before that event occurs.
>>>
>>> - There was still interrupts enabled.
>>>
>>> - You have two threads, one doing submission and one doing completions.
>>>      Maybe using SQPOLL? If that's the case, then yes, it'd still work as
>>>      you have separate threads for submission and completion.
>>>
>>> For the "generic" case of just using one thread and IRQs disabled, it'd
>>> deadlock.
>>>
>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>> sense. But let me think about if we can do a reasonable solution to this
>>>>> that doesn't involve adding overhead for a proper setup.
>>>> It's a nonsensical config in a way and so disallowing it would make
>>>> the most sense.
>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>> non-irq poll queues. Something like this:
>> Actually, we already disable polling if we don't have specific poll
>> queues:
>>
>>           if (set->nr_maps > HCTX_TYPE_POLL &&
>>               set->map[HCTX_TYPE_POLL].nr_queues)
>>                   blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>
>> Did you see any timeouts in your tests? I wonder if the use-after-free
>> triggered when the timeout found the request while you had the busy-spin
>> logic we discussed previously.
> Ah, but we still have fops->iopoll() set for that case. So we just won't
> poll for it, it'll get completed by IRQ. So I do think we need to handle
> this case in io_uring. I'll get back to you.
>

I ran the same test on linux-next-20191029 in polled mode and got the 
same free-after-user panic:

- I booted with nvme.poll_queues set and verified that all queues except 
default where of type poll

- I added three assertions to verify the following:

     - nvme_timeout() is not called

     - io_complete_rw_iopoll() is not called from interrupt context

     - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set

Is it possible that the race is there also in polled mode since a 
request submitted by one thread could conceivably be polled for and 
completed by a different thread, e.g. in 
io_uring_enter()->io_iopoll_check()?

--bijan

Bijan Mottahedeh Oct. 29, 2019, 7:23 p.m. UTC | #9

On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>
> On 10/25/19 7:21 AM, Jens Axboe wrote:
>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>> Running an fio test consistenly crashes the kernel with the 
>>>>>>>>> trace included
>>>>>>>>> below.  The root cause seems to be the code in 
>>>>>>>>> __io_submit_sqe() that
>>>>>>>>> checks the result of a request for -EAGAIN in polled mode, 
>>>>>>>>> without
>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>
>>>>>>>>>     if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>         if (req->result == -EAGAIN)
>>>>>>>>>             return -EAGAIN;
>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>> reference to the request still at this point. So how is it 
>>>>>>>> going away?
>>>>>>>> I must be missing something...
>>>>>>> I don't think the submission reference is going away...
>>>>>>>
>>>>>>> I *think* the problem has to do with the fact that
>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>> potential race between updating the request there and checking 
>>>>>>> it in
>>>>>>> __io_submit_sqe().
>>>>>>>
>>>>>>> My first workaround was to simply poll for 
>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>> code snippet above:
>>>>>>>
>>>>>>>           if (req->result == --EAGAIN) {
>>>>>>>
>>>>>>>               poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>
>>>>>>>               return -EAGAIN;
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> and that got rid of the problem.
>>>>>> But that will not work at all for a proper poll setup, where you 
>>>>>> don't
>>>>>> trigger any IRQs... It only happens to work for this case because 
>>>>>> you're
>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>> solution, but I don't think that's the argument here ;-)
>>>>> Sure.
>>>>>
>>>>> I'm just curious though as how it would break the poll case because
>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>> should be able to reliably check req->result.
>>>> It'd break the poll case because the task doing the submission is
>>>> generally also the one that finds and reaps completion. Hence if you
>>>> block that task just polling on that completion bit, you are 
>>>> preventing
>>>> that very task from going and reaping completions. The condition would
>>>> never become true, and you are now looping forever.
>>>>
>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>> triggered. Anyway, no argument that it's not needed!
>>>> A few reasons why it would make progress:
>>>>
>>>> - You eventually trigger a timeout on the nvme side, as blk-mq 
>>>> finds the
>>>>      request hasn't been completed by an IRQ. But that's a 30 
>>>> second ordeal
>>>>      before that event occurs.
>>>>
>>>> - There was still interrupts enabled.
>>>>
>>>> - You have two threads, one doing submission and one doing 
>>>> completions.
>>>>      Maybe using SQPOLL? If that's the case, then yes, it'd still 
>>>> work as
>>>>      you have separate threads for submission and completion.
>>>>
>>>> For the "generic" case of just using one thread and IRQs disabled, 
>>>> it'd
>>>> deadlock.
>>>>
>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>> sense. But let me think about if we can do a reasonable solution 
>>>>>> to this
>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>> the most sense.
>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>> non-irq poll queues. Something like this:
>>> Actually, we already disable polling if we don't have specific poll
>>> queues:
>>>
>>>           if (set->nr_maps > HCTX_TYPE_POLL &&
>>>               set->map[HCTX_TYPE_POLL].nr_queues)
>>>                   blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>
>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>> triggered when the timeout found the request while you had the 
>>> busy-spin
>>> logic we discussed previously.
>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>> this case in io_uring. I'll get back to you.
>>
>
> I ran the same test on linux-next-20191029 in polled mode and got the 
> same free-after-user panic:
>
> - I booted with nvme.poll_queues set and verified that all queues 
> except default where of type poll
>
> - I added three assertions to verify the following:
>
>     - nvme_timeout() is not called
>
>     - io_complete_rw_iopoll() is not called from interrupt context
>
>     - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>
> Is it possible that the race is there also in polled mode since a 
> request submitted by one thread could conceivably be polled for and 
> completed by a different thread, e.g. in 
> io_uring_enter()->io_iopoll_check()?
>
> --bijan
>
>
I also tested my RFC again with 1 thread and with queue depths of 1 to 
1024 in multiples of 8 and didn't see any hangs.

Just to be clear, the busy-spin logic discussed before was only a 
workaround an not in the RFC.

--bijan

Jens Axboe Oct. 29, 2019, 7:27 p.m. UTC | #10

On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
> 
> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>
>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>> trace included
>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>> without
>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>
>>>>>>>>>>      if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>          if (req->result == -EAGAIN)
>>>>>>>>>>              return -EAGAIN;
>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>> going away?
>>>>>>>>> I must be missing something...
>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>
>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>> potential race between updating the request there and checking
>>>>>>>> it in
>>>>>>>> __io_submit_sqe().
>>>>>>>>
>>>>>>>> My first workaround was to simply poll for
>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>> code snippet above:
>>>>>>>>
>>>>>>>>            if (req->result == --EAGAIN) {
>>>>>>>>
>>>>>>>>                poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>
>>>>>>>>                return -EAGAIN;
>>>>>>>>
>>>>>>>> }
>>>>>>>>
>>>>>>>> and that got rid of the problem.
>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>> don't
>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>> you're
>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>> Sure.
>>>>>>
>>>>>> I'm just curious though as how it would break the poll case because
>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>> should be able to reliably check req->result.
>>>>> It'd break the poll case because the task doing the submission is
>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>> block that task just polling on that completion bit, you are
>>>>> preventing
>>>>> that very task from going and reaping completions. The condition would
>>>>> never become true, and you are now looping forever.
>>>>>
>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>> A few reasons why it would make progress:
>>>>>
>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>> finds the
>>>>>       request hasn't been completed by an IRQ. But that's a 30
>>>>> second ordeal
>>>>>       before that event occurs.
>>>>>
>>>>> - There was still interrupts enabled.
>>>>>
>>>>> - You have two threads, one doing submission and one doing
>>>>> completions.
>>>>>       Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>> work as
>>>>>       you have separate threads for submission and completion.
>>>>>
>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>> it'd
>>>>> deadlock.
>>>>>
>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>> to this
>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>> the most sense.
>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>> non-irq poll queues. Something like this:
>>>> Actually, we already disable polling if we don't have specific poll
>>>> queues:
>>>>
>>>>            if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>                set->map[HCTX_TYPE_POLL].nr_queues)
>>>>                    blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>
>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>> triggered when the timeout found the request while you had the
>>>> busy-spin
>>>> logic we discussed previously.
>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>> this case in io_uring. I'll get back to you.
>>>
>>
>> I ran the same test on linux-next-20191029 in polled mode and got the
>> same free-after-user panic:
>>
>> - I booted with nvme.poll_queues set and verified that all queues
>> except default where of type poll
>>
>> - I added three assertions to verify the following:
>>
>>      - nvme_timeout() is not called
>>
>>      - io_complete_rw_iopoll() is not called from interrupt context
>>
>>      - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>
>> Is it possible that the race is there also in polled mode since a
>> request submitted by one thread could conceivably be polled for and
>> completed by a different thread, e.g. in
>> io_uring_enter()->io_iopoll_check()?
>>
>> --bijan
>>
>>
> I also tested my RFC again with 1 thread and with queue depths of 1 to
> 1024 in multiples of 8 and didn't see any hangs.
> 
> Just to be clear, the busy-spin logic discussed before was only a
> workaround an not in the RFC.

What is your exact test case?

Bijan Mottahedeh Oct. 29, 2019, 7:31 p.m. UTC | #11

On 10/29/19 12:27 PM, Jens Axboe wrote:
> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>> trace included
>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>> without
>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>
>>>>>>>>>>>       if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>           if (req->result == -EAGAIN)
>>>>>>>>>>>               return -EAGAIN;
>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>> going away?
>>>>>>>>>> I must be missing something...
>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>
>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>> it in
>>>>>>>>> __io_submit_sqe().
>>>>>>>>>
>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>> code snippet above:
>>>>>>>>>
>>>>>>>>>             if (req->result == --EAGAIN) {
>>>>>>>>>
>>>>>>>>>                 poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>
>>>>>>>>>                 return -EAGAIN;
>>>>>>>>>
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> and that got rid of the problem.
>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>> don't
>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>> you're
>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>> Sure.
>>>>>>>
>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>> should be able to reliably check req->result.
>>>>>> It'd break the poll case because the task doing the submission is
>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>> block that task just polling on that completion bit, you are
>>>>>> preventing
>>>>>> that very task from going and reaping completions. The condition would
>>>>>> never become true, and you are now looping forever.
>>>>>>
>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>> A few reasons why it would make progress:
>>>>>>
>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>> finds the
>>>>>>        request hasn't been completed by an IRQ. But that's a 30
>>>>>> second ordeal
>>>>>>        before that event occurs.
>>>>>>
>>>>>> - There was still interrupts enabled.
>>>>>>
>>>>>> - You have two threads, one doing submission and one doing
>>>>>> completions.
>>>>>>        Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>> work as
>>>>>>        you have separate threads for submission and completion.
>>>>>>
>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>> it'd
>>>>>> deadlock.
>>>>>>
>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>> to this
>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>> the most sense.
>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>> non-irq poll queues. Something like this:
>>>>> Actually, we already disable polling if we don't have specific poll
>>>>> queues:
>>>>>
>>>>>             if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>                 set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>                     blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>
>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>> triggered when the timeout found the request while you had the
>>>>> busy-spin
>>>>> logic we discussed previously.
>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>> this case in io_uring. I'll get back to you.
>>>>
>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>> same free-after-user panic:
>>>
>>> - I booted with nvme.poll_queues set and verified that all queues
>>> except default where of type poll
>>>
>>> - I added three assertions to verify the following:
>>>
>>>       - nvme_timeout() is not called
>>>
>>>       - io_complete_rw_iopoll() is not called from interrupt context
>>>
>>>       - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>
>>> Is it possible that the race is there also in polled mode since a
>>> request submitted by one thread could conceivably be polled for and
>>> completed by a different thread, e.g. in
>>> io_uring_enter()->io_iopoll_check()?
>>>
>>> --bijan
>>>
>>>
>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>> 1024 in multiples of 8 and didn't see any hangs.
>>
>> Just to be clear, the busy-spin logic discussed before was only a
>> workaround an not in the RFC.
> What is your exact test case?
>
See original cover letter.  I can reproduce the failure with numjobs between 8 and 32.

Jens Axboe Oct. 29, 2019, 7:33 p.m. UTC | #12

On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
> 
> On 10/29/19 12:27 PM, Jens Axboe wrote:
>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>> trace included
>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>> without
>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>
>>>>>>>>>>>>        if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>            if (req->result == -EAGAIN)
>>>>>>>>>>>>                return -EAGAIN;
>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>> going away?
>>>>>>>>>>> I must be missing something...
>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>
>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>> it in
>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>
>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>> code snippet above:
>>>>>>>>>>
>>>>>>>>>>              if (req->result == --EAGAIN) {
>>>>>>>>>>
>>>>>>>>>>                  poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>
>>>>>>>>>>                  return -EAGAIN;
>>>>>>>>>>
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> and that got rid of the problem.
>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>> don't
>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>> you're
>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>> Sure.
>>>>>>>>
>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>> should be able to reliably check req->result.
>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>> block that task just polling on that completion bit, you are
>>>>>>> preventing
>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>> never become true, and you are now looping forever.
>>>>>>>
>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>> A few reasons why it would make progress:
>>>>>>>
>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>> finds the
>>>>>>>         request hasn't been completed by an IRQ. But that's a 30
>>>>>>> second ordeal
>>>>>>>         before that event occurs.
>>>>>>>
>>>>>>> - There was still interrupts enabled.
>>>>>>>
>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>> completions.
>>>>>>>         Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>> work as
>>>>>>>         you have separate threads for submission and completion.
>>>>>>>
>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>> it'd
>>>>>>> deadlock.
>>>>>>>
>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>> to this
>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>> the most sense.
>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>> non-irq poll queues. Something like this:
>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>> queues:
>>>>>>
>>>>>>              if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>                  set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>                      blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>
>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>> triggered when the timeout found the request while you had the
>>>>>> busy-spin
>>>>>> logic we discussed previously.
>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>> this case in io_uring. I'll get back to you.
>>>>>
>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>> same free-after-user panic:
>>>>
>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>> except default where of type poll
>>>>
>>>> - I added three assertions to verify the following:
>>>>
>>>>        - nvme_timeout() is not called
>>>>
>>>>        - io_complete_rw_iopoll() is not called from interrupt context
>>>>
>>>>        - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>
>>>> Is it possible that the race is there also in polled mode since a
>>>> request submitted by one thread could conceivably be polled for and
>>>> completed by a different thread, e.g. in
>>>> io_uring_enter()->io_iopoll_check()?
>>>>
>>>> --bijan
>>>>
>>>>
>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>> 1024 in multiples of 8 and didn't see any hangs.
>>>
>>> Just to be clear, the busy-spin logic discussed before was only a
>>> workaround an not in the RFC.
>> What is your exact test case?
>>
> See original cover letter.  I can reproduce the failure with numjobs
> between 8 and 32.

And how many poll queues are you using?

Bijan Mottahedeh Oct. 29, 2019, 7:40 p.m. UTC | #13

On 10/29/19 12:33 PM, Jens Axboe wrote:
> On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
>> On 10/29/19 12:27 PM, Jens Axboe wrote:
>>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>>> trace included
>>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>>> without
>>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>>
>>>>>>>>>>>>>         if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>>             if (req->result == -EAGAIN)
>>>>>>>>>>>>>                 return -EAGAIN;
>>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>>> going away?
>>>>>>>>>>>> I must be missing something...
>>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>>
>>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>>> it in
>>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>>
>>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>>> code snippet above:
>>>>>>>>>>>
>>>>>>>>>>>               if (req->result == --EAGAIN) {
>>>>>>>>>>>
>>>>>>>>>>>                   poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>>
>>>>>>>>>>>                   return -EAGAIN;
>>>>>>>>>>>
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> and that got rid of the problem.
>>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>>> don't
>>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>>> you're
>>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>>> Sure.
>>>>>>>>>
>>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>>> should be able to reliably check req->result.
>>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>>> block that task just polling on that completion bit, you are
>>>>>>>> preventing
>>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>>> never become true, and you are now looping forever.
>>>>>>>>
>>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>>> A few reasons why it would make progress:
>>>>>>>>
>>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>>> finds the
>>>>>>>>          request hasn't been completed by an IRQ. But that's a 30
>>>>>>>> second ordeal
>>>>>>>>          before that event occurs.
>>>>>>>>
>>>>>>>> - There was still interrupts enabled.
>>>>>>>>
>>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>>> completions.
>>>>>>>>          Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>>> work as
>>>>>>>>          you have separate threads for submission and completion.
>>>>>>>>
>>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>>> it'd
>>>>>>>> deadlock.
>>>>>>>>
>>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>>> to this
>>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>>> the most sense.
>>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>>> non-irq poll queues. Something like this:
>>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>>> queues:
>>>>>>>
>>>>>>>               if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>>                   set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>>                       blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>>
>>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>>> triggered when the timeout found the request while you had the
>>>>>>> busy-spin
>>>>>>> logic we discussed previously.
>>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>>> this case in io_uring. I'll get back to you.
>>>>>>
>>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>>> same free-after-user panic:
>>>>>
>>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>>> except default where of type poll
>>>>>
>>>>> - I added three assertions to verify the following:
>>>>>
>>>>>         - nvme_timeout() is not called
>>>>>
>>>>>         - io_complete_rw_iopoll() is not called from interrupt context
>>>>>
>>>>>         - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>>
>>>>> Is it possible that the race is there also in polled mode since a
>>>>> request submitted by one thread could conceivably be polled for and
>>>>> completed by a different thread, e.g. in
>>>>> io_uring_enter()->io_iopoll_check()?
>>>>>
>>>>> --bijan
>>>>>
>>>>>
>>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>>> 1024 in multiples of 8 and didn't see any hangs.
>>>>
>>>> Just to be clear, the busy-spin logic discussed before was only a
>>>> workaround an not in the RFC.
>>> What is your exact test case?
>>>
>> See original cover letter.  I can reproduce the failure with numjobs
>> between 8 and 32.
> And how many poll queues are you using?
>
30

Jens Axboe Oct. 29, 2019, 7:46 p.m. UTC | #14

On 10/29/19 1:40 PM, Bijan Mottahedeh wrote:
> 
> On 10/29/19 12:33 PM, Jens Axboe wrote:
>> On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
>>> On 10/29/19 12:27 PM, Jens Axboe wrote:
>>>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>>>> trace included
>>>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>>>> without
>>>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>          if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>>>              if (req->result == -EAGAIN)
>>>>>>>>>>>>>>                  return -EAGAIN;
>>>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>>>> going away?
>>>>>>>>>>>>> I must be missing something...
>>>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>>>
>>>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>>>> it in
>>>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>>>
>>>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>>>> code snippet above:
>>>>>>>>>>>>
>>>>>>>>>>>>                if (req->result == --EAGAIN) {
>>>>>>>>>>>>
>>>>>>>>>>>>                    poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>>>
>>>>>>>>>>>>                    return -EAGAIN;
>>>>>>>>>>>>
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> and that got rid of the problem.
>>>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>>>> don't
>>>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>>>> you're
>>>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>>>> Sure.
>>>>>>>>>>
>>>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>>>> should be able to reliably check req->result.
>>>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>>>> block that task just polling on that completion bit, you are
>>>>>>>>> preventing
>>>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>>>> never become true, and you are now looping forever.
>>>>>>>>>
>>>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>>>> A few reasons why it would make progress:
>>>>>>>>>
>>>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>>>> finds the
>>>>>>>>>           request hasn't been completed by an IRQ. But that's a 30
>>>>>>>>> second ordeal
>>>>>>>>>           before that event occurs.
>>>>>>>>>
>>>>>>>>> - There was still interrupts enabled.
>>>>>>>>>
>>>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>>>> completions.
>>>>>>>>>           Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>>>> work as
>>>>>>>>>           you have separate threads for submission and completion.
>>>>>>>>>
>>>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>>>> it'd
>>>>>>>>> deadlock.
>>>>>>>>>
>>>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>>>> to this
>>>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>>>> the most sense.
>>>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>>>> non-irq poll queues. Something like this:
>>>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>>>> queues:
>>>>>>>>
>>>>>>>>                if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>>>                    set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>>>                        blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>>>
>>>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>>>> triggered when the timeout found the request while you had the
>>>>>>>> busy-spin
>>>>>>>> logic we discussed previously.
>>>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>>>> this case in io_uring. I'll get back to you.
>>>>>>>
>>>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>>>> same free-after-user panic:
>>>>>>
>>>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>>>> except default where of type poll
>>>>>>
>>>>>> - I added three assertions to verify the following:
>>>>>>
>>>>>>          - nvme_timeout() is not called
>>>>>>
>>>>>>          - io_complete_rw_iopoll() is not called from interrupt context
>>>>>>
>>>>>>          - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>>>
>>>>>> Is it possible that the race is there also in polled mode since a
>>>>>> request submitted by one thread could conceivably be polled for and
>>>>>> completed by a different thread, e.g. in
>>>>>> io_uring_enter()->io_iopoll_check()?
>>>>>>
>>>>>> --bijan
>>>>>>
>>>>>>
>>>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>>>> 1024 in multiples of 8 and didn't see any hangs.
>>>>>
>>>>> Just to be clear, the busy-spin logic discussed before was only a
>>>>> workaround an not in the RFC.
>>>> What is your exact test case?
>>>>
>>> See original cover letter.  I can reproduce the failure with numjobs
>>> between 8 and 32.
>> And how many poll queues are you using?
>>
> 30

And how many threads/cores in the box? Trying to get a sense for how
many CPUs share a single poll queue, if any.

Bijan Mottahedeh Oct. 29, 2019, 7:51 p.m. UTC | #15

On 10/29/19 12:46 PM, Jens Axboe wrote:
> On 10/29/19 1:40 PM, Bijan Mottahedeh wrote:
>> On 10/29/19 12:33 PM, Jens Axboe wrote:
>>> On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
>>>> On 10/29/19 12:27 PM, Jens Axboe wrote:
>>>>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>>>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>>>>> trace included
>>>>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>           if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>>>>               if (req->result == -EAGAIN)
>>>>>>>>>>>>>>>                   return -EAGAIN;
>>>>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>>>>> going away?
>>>>>>>>>>>>>> I must be missing something...
>>>>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>>>>
>>>>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>>>>> it in
>>>>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>>>>
>>>>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>>>>> code snippet above:
>>>>>>>>>>>>>
>>>>>>>>>>>>>                 if (req->result == --EAGAIN) {
>>>>>>>>>>>>>
>>>>>>>>>>>>>                     poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>>>>
>>>>>>>>>>>>>                     return -EAGAIN;
>>>>>>>>>>>>>
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> and that got rid of the problem.
>>>>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>>>>> don't
>>>>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>>>>> you're
>>>>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>>>>> Sure.
>>>>>>>>>>>
>>>>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>>>>> should be able to reliably check req->result.
>>>>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>>>>> block that task just polling on that completion bit, you are
>>>>>>>>>> preventing
>>>>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>>>>> never become true, and you are now looping forever.
>>>>>>>>>>
>>>>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>>>>> A few reasons why it would make progress:
>>>>>>>>>>
>>>>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>>>>> finds the
>>>>>>>>>>            request hasn't been completed by an IRQ. But that's a 30
>>>>>>>>>> second ordeal
>>>>>>>>>>            before that event occurs.
>>>>>>>>>>
>>>>>>>>>> - There was still interrupts enabled.
>>>>>>>>>>
>>>>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>>>>> completions.
>>>>>>>>>>            Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>>>>> work as
>>>>>>>>>>            you have separate threads for submission and completion.
>>>>>>>>>>
>>>>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>>>>> it'd
>>>>>>>>>> deadlock.
>>>>>>>>>>
>>>>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>>>>> to this
>>>>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>>>>> the most sense.
>>>>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>>>>> non-irq poll queues. Something like this:
>>>>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>>>>> queues:
>>>>>>>>>
>>>>>>>>>                 if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>>>>                     set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>>>>                         blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>>>>
>>>>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>>>>> triggered when the timeout found the request while you had the
>>>>>>>>> busy-spin
>>>>>>>>> logic we discussed previously.
>>>>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>>>>> this case in io_uring. I'll get back to you.
>>>>>>>>
>>>>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>>>>> same free-after-user panic:
>>>>>>>
>>>>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>>>>> except default where of type poll
>>>>>>>
>>>>>>> - I added three assertions to verify the following:
>>>>>>>
>>>>>>>           - nvme_timeout() is not called
>>>>>>>
>>>>>>>           - io_complete_rw_iopoll() is not called from interrupt context
>>>>>>>
>>>>>>>           - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>>>>
>>>>>>> Is it possible that the race is there also in polled mode since a
>>>>>>> request submitted by one thread could conceivably be polled for and
>>>>>>> completed by a different thread, e.g. in
>>>>>>> io_uring_enter()->io_iopoll_check()?
>>>>>>>
>>>>>>> --bijan
>>>>>>>
>>>>>>>
>>>>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>>>>> 1024 in multiples of 8 and didn't see any hangs.
>>>>>>
>>>>>> Just to be clear, the busy-spin logic discussed before was only a
>>>>>> workaround an not in the RFC.
>>>>> What is your exact test case?
>>>>>
>>>> See original cover letter.  I can reproduce the failure with numjobs
>>>> between 8 and 32.
>>> And how many poll queues are you using?
>>>
>> 30
> And how many threads/cores in the box? Trying to get a sense for how
> many CPUs share a single poll queue, if any.
>
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2

Jens Axboe Oct. 29, 2019, 7:52 p.m. UTC | #16

On 10/29/19 1:51 PM, Bijan Mottahedeh wrote:
> 
> On 10/29/19 12:46 PM, Jens Axboe wrote:
>> On 10/29/19 1:40 PM, Bijan Mottahedeh wrote:
>>> On 10/29/19 12:33 PM, Jens Axboe wrote:
>>>> On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
>>>>> On 10/29/19 12:27 PM, Jens Axboe wrote:
>>>>>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>>>>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>>>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>>>>>> trace included
>>>>>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>            if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>>>>>                if (req->result == -EAGAIN)
>>>>>>>>>>>>>>>>                    return -EAGAIN;
>>>>>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>>>>>> going away?
>>>>>>>>>>>>>>> I must be missing something...
>>>>>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>>>>>> it in
>>>>>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>>>>>> code snippet above:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                  if (req->result == --EAGAIN) {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                      poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                      return -EAGAIN;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> and that got rid of the problem.
>>>>>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>>>>>> don't
>>>>>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>>>>>> you're
>>>>>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>>>>>> Sure.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>>>>>> should be able to reliably check req->result.
>>>>>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>>>>>> block that task just polling on that completion bit, you are
>>>>>>>>>>> preventing
>>>>>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>>>>>> never become true, and you are now looping forever.
>>>>>>>>>>>
>>>>>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>>>>>> A few reasons why it would make progress:
>>>>>>>>>>>
>>>>>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>>>>>> finds the
>>>>>>>>>>>             request hasn't been completed by an IRQ. But that's a 30
>>>>>>>>>>> second ordeal
>>>>>>>>>>>             before that event occurs.
>>>>>>>>>>>
>>>>>>>>>>> - There was still interrupts enabled.
>>>>>>>>>>>
>>>>>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>>>>>> completions.
>>>>>>>>>>>             Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>>>>>> work as
>>>>>>>>>>>             you have separate threads for submission and completion.
>>>>>>>>>>>
>>>>>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>>>>>> it'd
>>>>>>>>>>> deadlock.
>>>>>>>>>>>
>>>>>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>>>>>> to this
>>>>>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>>>>>> the most sense.
>>>>>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>>>>>> non-irq poll queues. Something like this:
>>>>>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>>>>>> queues:
>>>>>>>>>>
>>>>>>>>>>                  if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>>>>>                      set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>>>>>                          blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>>>>>
>>>>>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>>>>>> triggered when the timeout found the request while you had the
>>>>>>>>>> busy-spin
>>>>>>>>>> logic we discussed previously.
>>>>>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>>>>>> this case in io_uring. I'll get back to you.
>>>>>>>>>
>>>>>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>>>>>> same free-after-user panic:
>>>>>>>>
>>>>>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>>>>>> except default where of type poll
>>>>>>>>
>>>>>>>> - I added three assertions to verify the following:
>>>>>>>>
>>>>>>>>            - nvme_timeout() is not called
>>>>>>>>
>>>>>>>>            - io_complete_rw_iopoll() is not called from interrupt context
>>>>>>>>
>>>>>>>>            - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>>>>>
>>>>>>>> Is it possible that the race is there also in polled mode since a
>>>>>>>> request submitted by one thread could conceivably be polled for and
>>>>>>>> completed by a different thread, e.g. in
>>>>>>>> io_uring_enter()->io_iopoll_check()?
>>>>>>>>
>>>>>>>> --bijan
>>>>>>>>
>>>>>>>>
>>>>>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>>>>>> 1024 in multiples of 8 and didn't see any hangs.
>>>>>>>
>>>>>>> Just to be clear, the busy-spin logic discussed before was only a
>>>>>>> workaround an not in the RFC.
>>>>>> What is your exact test case?
>>>>>>
>>>>> See original cover letter.  I can reproduce the failure with numjobs
>>>>> between 8 and 32.
>>>> And how many poll queues are you using?
>>>>
>>> 30
>> And how many threads/cores in the box? Trying to get a sense for how
>> many CPUs share a single poll queue, if any.
>>
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             2

OK, so 2*8*2 == 32 threads, hence some threads will share a poll queue.

Jens Axboe Oct. 30, 2019, 1:02 a.m. UTC | #17

On 10/29/19 1:52 PM, Jens Axboe wrote:
> On 10/29/19 1:51 PM, Bijan Mottahedeh wrote:
>>
>> On 10/29/19 12:46 PM, Jens Axboe wrote:
>>> On 10/29/19 1:40 PM, Bijan Mottahedeh wrote:
>>>> On 10/29/19 12:33 PM, Jens Axboe wrote:
>>>>> On 10/29/19 1:31 PM, Bijan Mottahedeh wrote:
>>>>>> On 10/29/19 12:27 PM, Jens Axboe wrote:
>>>>>>> On 10/29/19 1:23 PM, Bijan Mottahedeh wrote:
>>>>>>>> On 10/29/19 12:17 PM, Bijan Mottahedeh wrote:
>>>>>>>>> On 10/25/19 7:21 AM, Jens Axboe wrote:
>>>>>>>>>> On 10/25/19 8:18 AM, Jens Axboe wrote:
>>>>>>>>>>> On 10/25/19 8:07 AM, Jens Axboe wrote:
>>>>>>>>>>>> On 10/25/19 7:46 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>> On 10/24/19 3:31 PM, Jens Axboe wrote:
>>>>>>>>>>>>>> On 10/24/19 1:18 PM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>>> On 10/24/19 10:09 AM, Jens Axboe wrote:
>>>>>>>>>>>>>>>> On 10/24/19 3:18 AM, Bijan Mottahedeh wrote:
>>>>>>>>>>>>>>>>> Running an fio test consistenly crashes the kernel with the
>>>>>>>>>>>>>>>>> trace included
>>>>>>>>>>>>>>>>> below.  The root cause seems to be the code in
>>>>>>>>>>>>>>>>> __io_submit_sqe() that
>>>>>>>>>>>>>>>>> checks the result of a request for -EAGAIN in polled mode,
>>>>>>>>>>>>>>>>> without
>>>>>>>>>>>>>>>>> ensuring first that the request has completed:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>             if (ctx->flags & IORING_SETUP_IOPOLL) {
>>>>>>>>>>>>>>>>>                 if (req->result == -EAGAIN)
>>>>>>>>>>>>>>>>>                     return -EAGAIN;
>>>>>>>>>>>>>>>> I'm a little confused, because we should be holding the submission
>>>>>>>>>>>>>>>> reference to the request still at this point. So how is it
>>>>>>>>>>>>>>>> going away?
>>>>>>>>>>>>>>>> I must be missing something...
>>>>>>>>>>>>>>> I don't think the submission reference is going away...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I *think* the problem has to do with the fact that
>>>>>>>>>>>>>>> io_complete_rw_iopoll() which sets REQ_F_IOPOLL_COMPLETED is being
>>>>>>>>>>>>>>> called from interrupt context in my configuration and so there is a
>>>>>>>>>>>>>>> potential race between updating the request there and checking
>>>>>>>>>>>>>>> it in
>>>>>>>>>>>>>>> __io_submit_sqe().
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My first workaround was to simply poll for
>>>>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED in the
>>>>>>>>>>>>>>> code snippet above:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                   if (req->result == --EAGAIN) {
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                       poll for REQ_F_IOPOLL_COMPLETED
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>                       return -EAGAIN;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> and that got rid of the problem.
>>>>>>>>>>>>>> But that will not work at all for a proper poll setup, where you
>>>>>>>>>>>>>> don't
>>>>>>>>>>>>>> trigger any IRQs... It only happens to work for this case because
>>>>>>>>>>>>>> you're
>>>>>>>>>>>>>> still triggering interrupts. But even in that case, it's not a real
>>>>>>>>>>>>>> solution, but I don't think that's the argument here ;-)
>>>>>>>>>>>>> Sure.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm just curious though as how it would break the poll case because
>>>>>>>>>>>>> io_complete_rw_iopoll() would still be called though through polling,
>>>>>>>>>>>>> REQ_F_IOPOLL_COMPLETED would be set, and so io_iopoll_complete()
>>>>>>>>>>>>> should be able to reliably check req->result.
>>>>>>>>>>>> It'd break the poll case because the task doing the submission is
>>>>>>>>>>>> generally also the one that finds and reaps completion. Hence if you
>>>>>>>>>>>> block that task just polling on that completion bit, you are
>>>>>>>>>>>> preventing
>>>>>>>>>>>> that very task from going and reaping completions. The condition would
>>>>>>>>>>>> never become true, and you are now looping forever.
>>>>>>>>>>>>
>>>>>>>>>>>>> The same poll test seemed to run ok with nvme interrupts not being
>>>>>>>>>>>>> triggered. Anyway, no argument that it's not needed!
>>>>>>>>>>>> A few reasons why it would make progress:
>>>>>>>>>>>>
>>>>>>>>>>>> - You eventually trigger a timeout on the nvme side, as blk-mq
>>>>>>>>>>>> finds the
>>>>>>>>>>>>              request hasn't been completed by an IRQ. But that's a 30
>>>>>>>>>>>> second ordeal
>>>>>>>>>>>>              before that event occurs.
>>>>>>>>>>>>
>>>>>>>>>>>> - There was still interrupts enabled.
>>>>>>>>>>>>
>>>>>>>>>>>> - You have two threads, one doing submission and one doing
>>>>>>>>>>>> completions.
>>>>>>>>>>>>              Maybe using SQPOLL? If that's the case, then yes, it'd still
>>>>>>>>>>>> work as
>>>>>>>>>>>>              you have separate threads for submission and completion.
>>>>>>>>>>>>
>>>>>>>>>>>> For the "generic" case of just using one thread and IRQs disabled,
>>>>>>>>>>>> it'd
>>>>>>>>>>>> deadlock.
>>>>>>>>>>>>
>>>>>>>>>>>>>> I see what the race is now, it's specific to IRQ driven polling. We
>>>>>>>>>>>>>> really should just disallow that, to be honest, it doesn't make any
>>>>>>>>>>>>>> sense. But let me think about if we can do a reasonable solution
>>>>>>>>>>>>>> to this
>>>>>>>>>>>>>> that doesn't involve adding overhead for a proper setup.
>>>>>>>>>>>>> It's a nonsensical config in a way and so disallowing it would make
>>>>>>>>>>>>> the most sense.
>>>>>>>>>>>> Definitely. The nvme driver should not set .poll() if it doesn't have
>>>>>>>>>>>> non-irq poll queues. Something like this:
>>>>>>>>>>> Actually, we already disable polling if we don't have specific poll
>>>>>>>>>>> queues:
>>>>>>>>>>>
>>>>>>>>>>>                   if (set->nr_maps > HCTX_TYPE_POLL &&
>>>>>>>>>>>                       set->map[HCTX_TYPE_POLL].nr_queues)
>>>>>>>>>>>                           blk_queue_flag_set(QUEUE_FLAG_POLL, q);
>>>>>>>>>>>
>>>>>>>>>>> Did you see any timeouts in your tests? I wonder if the use-after-free
>>>>>>>>>>> triggered when the timeout found the request while you had the
>>>>>>>>>>> busy-spin
>>>>>>>>>>> logic we discussed previously.
>>>>>>>>>> Ah, but we still have fops->iopoll() set for that case. So we just won't
>>>>>>>>>> poll for it, it'll get completed by IRQ. So I do think we need to handle
>>>>>>>>>> this case in io_uring. I'll get back to you.
>>>>>>>>>>
>>>>>>>>> I ran the same test on linux-next-20191029 in polled mode and got the
>>>>>>>>> same free-after-user panic:
>>>>>>>>>
>>>>>>>>> - I booted with nvme.poll_queues set and verified that all queues
>>>>>>>>> except default where of type poll
>>>>>>>>>
>>>>>>>>> - I added three assertions to verify the following:
>>>>>>>>>
>>>>>>>>>             - nvme_timeout() is not called
>>>>>>>>>
>>>>>>>>>             - io_complete_rw_iopoll() is not called from interrupt context
>>>>>>>>>
>>>>>>>>>             - io_sq_offload_start() is not called with IORING_SETUP_SQPOLL set
>>>>>>>>>
>>>>>>>>> Is it possible that the race is there also in polled mode since a
>>>>>>>>> request submitted by one thread could conceivably be polled for and
>>>>>>>>> completed by a different thread, e.g. in
>>>>>>>>> io_uring_enter()->io_iopoll_check()?
>>>>>>>>>
>>>>>>>>> --bijan
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I also tested my RFC again with 1 thread and with queue depths of 1 to
>>>>>>>> 1024 in multiples of 8 and didn't see any hangs.
>>>>>>>>
>>>>>>>> Just to be clear, the busy-spin logic discussed before was only a
>>>>>>>> workaround an not in the RFC.
>>>>>>> What is your exact test case?
>>>>>>>
>>>>>> See original cover letter.  I can reproduce the failure with numjobs
>>>>>> between 8 and 32.
>>>>> And how many poll queues are you using?
>>>>>
>>>> 30
>>> And how many threads/cores in the box? Trying to get a sense for how
>>> many CPUs share a single poll queue, if any.
>>>
>> Thread(s) per core:    2
>> Core(s) per socket:    8
>> Socket(s):             2
> 
> OK, so 2*8*2 == 32 threads, hence some threads will share a poll queue.

OK, so I still don't quite see where the issue is. Your setup has more
than one CPU per poll queue, and I can reproduce the issue quite easily
here with a similar setup. Below are some things that are given:

1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
   invoked _from_ the submission path. This means that the result is
   readily available by the time we go and check:

   if (req->result == -EAGAIN)

   in __io_submit_sqe(). This is a submission time failure, not
   something that should be happening from a completion path after the
   IO has been submitted successfully.

2) If the succeed in submitting the request, given that we have other
   tasks polling, the request can complete any time. It can very well be
   complete by the time we call io_iopoll_req_issued(), and this is
   perfectly fine. We know the request isn't going away, as we're
   holding a reference to it. kiocb has the same lifetime, as it's
   embedded in the io_kiocb request. Note that this isn't the same
   io_ring_ctx at all, some other task with its own io_ring_ctx just
   happens to find our request when doing polling on the same queue
   itself.

We would definitely get in trouble if we submitted the request
successfully, but returned -EAGAIN because we thought we didn't.

In my testing, what I seem to see is double completions on the block
layer side, and double issues. I can't quite get that to match up with
anything...

I'll keep digging, hopefully I'll get some deeper understanding of what
exactly the issue is shortly. I was hoping I'd get that by writing my
thoughts in this email, but alas that didn't happen yet.

Bijan Mottahedeh Oct. 30, 2019, 2:02 p.m. UTC | #18

> OK, so I still don't quite see where the issue is. Your setup has more
> than one CPU per poll queue, and I can reproduce the issue quite easily
> here with a similar setup.

That's probably why I couldn't reproduce this in a vm.  This time I set 
up one poll queue in a 8 cpu vm and reproduced it.

> Below are some things that are given:
>
> 1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
>     invoked _from_ the submission path. This means that the result is
>     readily available by the time we go and check:
>
>     if (req->result == -EAGAIN)
>
>     in __io_submit_sqe().

Is that always true?

Let's say the operation was __io_submit_sqe()->io_read()

By "failing to submit the io", do you mean that 
io_read()->call_read_iter() returned success or failure?  Are you saying 
that req->result was set from kiocb_done() or later in the block layer?

Anyway I assume that io_read() would have to return success since 
otherwise __io_submit_sqe() would immediately return and not check 
req->result:

         if (ret)
                 return ret;

So if io_read() did return success,  are we guaranteed that setting 
req->result = -EAGAIN would always happen before the check?

Also, is it possible that we can be preempted in __io_submit_sqe() after 
the call to io_read() but before the -EAGAIN check?

> This is a submission time failure, not
>     something that should be happening from a completion path after the
>     IO has been submitted successfully.
>
> 2) If the succeed in submitting the request, given that we have other
>     tasks polling, the request can complete any time. It can very well be
>     complete by the time we call io_iopoll_req_issued(), and this is
>     perfectly fine. We know the request isn't going away, as we're
>     holding a reference to it. kiocb has the same lifetime, as it's
>     embedded in the io_kiocb request. Note that this isn't the same
>     io_ring_ctx at all, some other task with its own io_ring_ctx just
>     happens to find our request when doing polling on the same queue
>     itself.

Ah yes, it's a different io_ring_ctx, different poll list etc. For my 
own clarity, I assume all contexts are mapping the same actual sq/cq rings?

>
> We would definitely get in trouble if we submitted the request
> successfully, but returned -EAGAIN because we thought we didn't.
>
> In my testing, what I seem to see is double completions on the block
> layer side, and double issues. I can't quite get that to match up with
> anything...
>
> I'll keep digging, hopefully I'll get some deeper understanding of what
> exactly the issue is shortly. I was hoping I'd get that by writing my
> thoughts in this email, but alas that didn't happen yet.
>

Jens Axboe Oct. 30, 2019, 2:18 p.m. UTC | #19

On 10/30/19 8:02 AM, Bijan Mottahedeh wrote:
>> OK, so I still don't quite see where the issue is. Your setup has more
>> than one CPU per poll queue, and I can reproduce the issue quite easily
>> here with a similar setup.
> 
> That's probably why I couldn't reproduce this in a vm.  This time I set
> up one poll queue in a 8 cpu vm and reproduced it.

You definitely do need poll queue sharing to hit this.

>> Below are some things that are given:
>>
>> 1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
>>      invoked _from_ the submission path. This means that the result is
>>      readily available by the time we go and check:
>>
>>      if (req->result == -EAGAIN)
>>
>>      in __io_submit_sqe().
> 
> Is that always true?
> 
> Let's say the operation was __io_submit_sqe()->io_read()
> 
> By "failing to submit the io", do you mean that
> io_read()->call_read_iter() returned success or failure?  Are you saying
> that req->result was set from kiocb_done() or later in the block layer?

By "failed to submit" I mean that req->result == -EAGAIN. We set that in
io_complete_rw_iopoll(), which is called off bio_wouldblock_error() in
the block layer. This is different from an ret != 0 return, which would
have been some sort of other failure we encountered, failing to submit
the request. For that error, we just end the request. For the
bio_wouldblock_error(), we need to retry submission.

> Anyway I assume that io_read() would have to return success since
> otherwise __io_submit_sqe() would immediately return and not check
> req->result:
> 
>           if (ret)
>                   return ret;

Right, if ret != 0, then we have a fatal error for that request.
->result will not have been set at all for that case.

> So if io_read() did return success,  are we guaranteed that setting
> req->result = -EAGAIN would always happen before the check?

Always, since it happens inline from when you attempt to issue the read.

> Also, is it possible that we can be preempted in __io_submit_sqe() after
> the call to io_read() but before the -EAGAIN check?

Sure, we're not disabling preemption. But that shouldn't have any
bearing on this case.
> 
>> This is a submission time failure, not
>>      something that should be happening from a completion path after the
>>      IO has been submitted successfully.
>>
>> 2) If the succeed in submitting the request, given that we have other
>>      tasks polling, the request can complete any time. It can very well be
>>      complete by the time we call io_iopoll_req_issued(), and this is
>>      perfectly fine. We know the request isn't going away, as we're
>>      holding a reference to it. kiocb has the same lifetime, as it's
>>      embedded in the io_kiocb request. Note that this isn't the same
>>      io_ring_ctx at all, some other task with its own io_ring_ctx just
>>      happens to find our request when doing polling on the same queue
>>      itself.
> 
> Ah yes, it's a different io_ring_ctx, different poll list etc. For my

Exactly

> own clarity, I assume all contexts are mapping the same actual sq/cq
> rings?

The ring/context isn't mapped to a particular ring, a specific IO is
mapped to a specific queue at the time it's being submitted. That
depends on the IO type and where that task happens to be running at the
time the IO is being submitted.

Jens Axboe Oct. 30, 2019, 5:32 p.m. UTC | #20

On 10/30/19 8:18 AM, Jens Axboe wrote:
> On 10/30/19 8:02 AM, Bijan Mottahedeh wrote:
>>> OK, so I still don't quite see where the issue is. Your setup has more
>>> than one CPU per poll queue, and I can reproduce the issue quite easily
>>> here with a similar setup.
>>
>> That's probably why I couldn't reproduce this in a vm.  This time I set
>> up one poll queue in a 8 cpu vm and reproduced it.
> 
> You definitely do need poll queue sharing to hit this.
> 
>>> Below are some things that are given:
>>>
>>> 1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
>>>       invoked _from_ the submission path. This means that the result is
>>>       readily available by the time we go and check:
>>>
>>>       if (req->result == -EAGAIN)
>>>
>>>       in __io_submit_sqe().
>>
>> Is that always true?
>>
>> Let's say the operation was __io_submit_sqe()->io_read()
>>
>> By "failing to submit the io", do you mean that
>> io_read()->call_read_iter() returned success or failure?  Are you saying
>> that req->result was set from kiocb_done() or later in the block layer?
> 
> By "failed to submit" I mean that req->result == -EAGAIN. We set that in
> io_complete_rw_iopoll(), which is called off bio_wouldblock_error() in
> the block layer. This is different from an ret != 0 return, which would
> have been some sort of other failure we encountered, failing to submit
> the request. For that error, we just end the request. For the
> bio_wouldblock_error(), we need to retry submission.
> 
>> Anyway I assume that io_read() would have to return success since
>> otherwise __io_submit_sqe() would immediately return and not check
>> req->result:
>>
>>            if (ret)
>>                    return ret;
> 
> Right, if ret != 0, then we have a fatal error for that request.
> ->result will not have been set at all for that case.
> 
>> So if io_read() did return success,  are we guaranteed that setting
>> req->result = -EAGAIN would always happen before the check?
> 
> Always, since it happens inline from when you attempt to issue the read.
> 
>> Also, is it possible that we can be preempted in __io_submit_sqe() after
>> the call to io_read() but before the -EAGAIN check?
> 
> Sure, we're not disabling preemption. But that shouldn't have any
> bearing on this case.
>>
>>> This is a submission time failure, not
>>>       something that should be happening from a completion path after the
>>>       IO has been submitted successfully.
>>>
>>> 2) If the succeed in submitting the request, given that we have other
>>>       tasks polling, the request can complete any time. It can very well be
>>>       complete by the time we call io_iopoll_req_issued(), and this is
>>>       perfectly fine. We know the request isn't going away, as we're
>>>       holding a reference to it. kiocb has the same lifetime, as it's
>>>       embedded in the io_kiocb request. Note that this isn't the same
>>>       io_ring_ctx at all, some other task with its own io_ring_ctx just
>>>       happens to find our request when doing polling on the same queue
>>>       itself.
>>
>> Ah yes, it's a different io_ring_ctx, different poll list etc. For my
> 
> Exactly
> 
>> own clarity, I assume all contexts are mapping the same actual sq/cq
>> rings?
> 
> The ring/context isn't mapped to a particular ring, a specific IO is
> mapped to a specific queue at the time it's being submitted. That
> depends on the IO type and where that task happens to be running at the
> time the IO is being submitted.

This might just do it, except it looks like there's a bug in sbitmap
where we don't always flush deferred clears. I'll look into that now.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index af1937d66aee..f3ca2ee44dbd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1212,6 +1212,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
 
 		kiocb->ki_flags |= IOCB_HIPRI;
 		kiocb->ki_complete = io_complete_rw_iopoll;
+		req->result = 0;
 	} else {
 		if (kiocb->ki_flags & IOCB_HIPRI)
 			return -EINVAL;

Bijan Mottahedeh Oct. 30, 2019, 7:21 p.m. UTC | #21

On 10/30/19 10:32 AM, Jens Axboe wrote:
> On 10/30/19 8:18 AM, Jens Axboe wrote:
>> On 10/30/19 8:02 AM, Bijan Mottahedeh wrote:
>>>> OK, so I still don't quite see where the issue is. Your setup has more
>>>> than one CPU per poll queue, and I can reproduce the issue quite easily
>>>> here with a similar setup.
>>> That's probably why I couldn't reproduce this in a vm.  This time I set
>>> up one poll queue in a 8 cpu vm and reproduced it.
>> You definitely do need poll queue sharing to hit this.
>>
>>>> Below are some things that are given:
>>>>
>>>> 1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
>>>>        invoked _from_ the submission path. This means that the result is
>>>>        readily available by the time we go and check:
>>>>
>>>>        if (req->result == -EAGAIN)
>>>>
>>>>        in __io_submit_sqe().
>>> Is that always true?
>>>
>>> Let's say the operation was __io_submit_sqe()->io_read()
>>>
>>> By "failing to submit the io", do you mean that
>>> io_read()->call_read_iter() returned success or failure?  Are you saying
>>> that req->result was set from kiocb_done() or later in the block layer?
>> By "failed to submit" I mean that req->result == -EAGAIN. We set that in
>> io_complete_rw_iopoll(), which is called off bio_wouldblock_error() in
>> the block layer. This is different from an ret != 0 return, which would
>> have been some sort of other failure we encountered, failing to submit
>> the request. For that error, we just end the request. For the
>> bio_wouldblock_error(), we need to retry submission.
>>
>>> Anyway I assume that io_read() would have to return success since
>>> otherwise __io_submit_sqe() would immediately return and not check
>>> req->result:
>>>
>>>             if (ret)
>>>                     return ret;
>> Right, if ret != 0, then we have a fatal error for that request.
>> ->result will not have been set at all for that case.
>>
>>> So if io_read() did return success,  are we guaranteed that setting
>>> req->result = -EAGAIN would always happen before the check?
>> Always, since it happens inline from when you attempt to issue the read.
>>
>>> Also, is it possible that we can be preempted in __io_submit_sqe() after
>>> the call to io_read() but before the -EAGAIN check?
>> Sure, we're not disabling preemption. But that shouldn't have any
>> bearing on this case.
>>>> This is a submission time failure, not
>>>>        something that should be happening from a completion path after the
>>>>        IO has been submitted successfully.
>>>>
>>>> 2) If the succeed in submitting the request, given that we have other
>>>>        tasks polling, the request can complete any time. It can very well be
>>>>        complete by the time we call io_iopoll_req_issued(), and this is
>>>>        perfectly fine. We know the request isn't going away, as we're
>>>>        holding a reference to it. kiocb has the same lifetime, as it's
>>>>        embedded in the io_kiocb request. Note that this isn't the same
>>>>        io_ring_ctx at all, some other task with its own io_ring_ctx just
>>>>        happens to find our request when doing polling on the same queue
>>>>        itself.
>>> Ah yes, it's a different io_ring_ctx, different poll list etc. For my
>> Exactly
>>
>>> own clarity, I assume all contexts are mapping the same actual sq/cq
>>> rings?
>> The ring/context isn't mapped to a particular ring, a specific IO is
>> mapped to a specific queue at the time it's being submitted. That
>> depends on the IO type and where that task happens to be running at the
>> time the IO is being submitted.
> This might just do it, except it looks like there's a bug in sbitmap
> where we don't always flush deferred clears. I'll look into that now.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index af1937d66aee..f3ca2ee44dbd 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1212,6 +1212,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
>   
>   		kiocb->ki_flags |= IOCB_HIPRI;
>   		kiocb->ki_complete = io_complete_rw_iopoll;
> +		req->result = 0;
>   	} else {
>   		if (kiocb->ki_flags & IOCB_HIPRI)
>   			return -EINVAL;
>
I checked out a few configs and didn't hit the error.  However, with 16 
poll queues and 32 fio jobs, I eventually started getting a bunch of 
nvme timeout/abort messages and the system seemed to be stuck in a poll 
loop.  IKilling the test didn't work and the system was hung:

...

[  407.978329] nvme nvme0: I/O 679 QID 30 timeout, aborting
[  408.085318] nvme nvme0: I/O 680 QID 30 timeout, aborting
[  408.164145] nvme nvme0: I/O 682 QID 30 timeout, aborting
[  408.238102] nvme nvme0: I/O 683 QID 30 timeout, aborting
[  412.970712] nvme nvme0: Abort status: 0x0r=239k IOPS][eta 01m:30s]
[  413.018696] nvme nvme0: Abort status: 0x0
[  413.066691] nvme nvme0: Abort status: 0x0
[  413.114684] nvme nvme0: Abort status: 0x0
[  438.035324] nvme nvme0: I/O 674 QID 30 timeout, reset controllers]
[  444.287637] nvme nvme0: 15/0/16 default/read/poll queues
[  637.073111] INFO: task kworker/u66:0:55 blocked for more than 122 
seconds.

...

fio: terminating on signal 2
Jobs: 32 (f=32): [r(32)][0.3%][eta 02d:01h:28m:36s]



Couldn't ssh to the machine and started getting hung task timeout errors.

Jens Axboe Oct. 30, 2019, 7:26 p.m. UTC | #22

On 10/30/19 1:21 PM, Bijan Mottahedeh wrote:
> 
> On 10/30/19 10:32 AM, Jens Axboe wrote:
>> On 10/30/19 8:18 AM, Jens Axboe wrote:
>>> On 10/30/19 8:02 AM, Bijan Mottahedeh wrote:
>>>>> OK, so I still don't quite see where the issue is. Your setup has more
>>>>> than one CPU per poll queue, and I can reproduce the issue quite easily
>>>>> here with a similar setup.
>>>> That's probably why I couldn't reproduce this in a vm.  This time I set
>>>> up one poll queue in a 8 cpu vm and reproduced it.
>>> You definitely do need poll queue sharing to hit this.
>>>
>>>>> Below are some things that are given:
>>>>>
>>>>> 1) If we fail to submit the IO, io_complete_rw_iopoll() is ultimately
>>>>>         invoked _from_ the submission path. This means that the result is
>>>>>         readily available by the time we go and check:
>>>>>
>>>>>         if (req->result == -EAGAIN)
>>>>>
>>>>>         in __io_submit_sqe().
>>>> Is that always true?
>>>>
>>>> Let's say the operation was __io_submit_sqe()->io_read()
>>>>
>>>> By "failing to submit the io", do you mean that
>>>> io_read()->call_read_iter() returned success or failure?  Are you saying
>>>> that req->result was set from kiocb_done() or later in the block layer?
>>> By "failed to submit" I mean that req->result == -EAGAIN. We set that in
>>> io_complete_rw_iopoll(), which is called off bio_wouldblock_error() in
>>> the block layer. This is different from an ret != 0 return, which would
>>> have been some sort of other failure we encountered, failing to submit
>>> the request. For that error, we just end the request. For the
>>> bio_wouldblock_error(), we need to retry submission.
>>>
>>>> Anyway I assume that io_read() would have to return success since
>>>> otherwise __io_submit_sqe() would immediately return and not check
>>>> req->result:
>>>>
>>>>              if (ret)
>>>>                      return ret;
>>> Right, if ret != 0, then we have a fatal error for that request.
>>> ->result will not have been set at all for that case.
>>>
>>>> So if io_read() did return success,  are we guaranteed that setting
>>>> req->result = -EAGAIN would always happen before the check?
>>> Always, since it happens inline from when you attempt to issue the read.
>>>
>>>> Also, is it possible that we can be preempted in __io_submit_sqe() after
>>>> the call to io_read() but before the -EAGAIN check?
>>> Sure, we're not disabling preemption. But that shouldn't have any
>>> bearing on this case.
>>>>> This is a submission time failure, not
>>>>>         something that should be happening from a completion path after the
>>>>>         IO has been submitted successfully.
>>>>>
>>>>> 2) If the succeed in submitting the request, given that we have other
>>>>>         tasks polling, the request can complete any time. It can very well be
>>>>>         complete by the time we call io_iopoll_req_issued(), and this is
>>>>>         perfectly fine. We know the request isn't going away, as we're
>>>>>         holding a reference to it. kiocb has the same lifetime, as it's
>>>>>         embedded in the io_kiocb request. Note that this isn't the same
>>>>>         io_ring_ctx at all, some other task with its own io_ring_ctx just
>>>>>         happens to find our request when doing polling on the same queue
>>>>>         itself.
>>>> Ah yes, it's a different io_ring_ctx, different poll list etc. For my
>>> Exactly
>>>
>>>> own clarity, I assume all contexts are mapping the same actual sq/cq
>>>> rings?
>>> The ring/context isn't mapped to a particular ring, a specific IO is
>>> mapped to a specific queue at the time it's being submitted. That
>>> depends on the IO type and where that task happens to be running at the
>>> time the IO is being submitted.
>> This might just do it, except it looks like there's a bug in sbitmap
>> where we don't always flush deferred clears. I'll look into that now.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index af1937d66aee..f3ca2ee44dbd 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -1212,6 +1212,7 @@ static int io_prep_rw(struct io_kiocb *req, const struct sqe_submit *s,
>>    
>>    		kiocb->ki_flags |= IOCB_HIPRI;
>>    		kiocb->ki_complete = io_complete_rw_iopoll;
>> +		req->result = 0;
>>    	} else {
>>    		if (kiocb->ki_flags & IOCB_HIPRI)
>>    			return -EINVAL;
>>
> I checked out a few configs and didn't hit the error.  However, with 16
> poll queues and 32 fio jobs, I eventually started getting a bunch of
> nvme timeout/abort messages and the system seemed to be stuck in a poll
> loop.  IKilling the test didn't work and the system was hung:
> 
> ...
> 
> [  407.978329] nvme nvme0: I/O 679 QID 30 timeout, aborting
> [  408.085318] nvme nvme0: I/O 680 QID 30 timeout, aborting
> [  408.164145] nvme nvme0: I/O 682 QID 30 timeout, aborting
> [  408.238102] nvme nvme0: I/O 683 QID 30 timeout, aborting
> [  412.970712] nvme nvme0: Abort status: 0x0r=239k IOPS][eta 01m:30s]
> [  413.018696] nvme nvme0: Abort status: 0x0
> [  413.066691] nvme nvme0: Abort status: 0x0
> [  413.114684] nvme nvme0: Abort status: 0x0
> [  438.035324] nvme nvme0: I/O 674 QID 30 timeout, reset controllers]
> [  444.287637] nvme nvme0: 15/0/16 default/read/poll queues
> [  637.073111] INFO: task kworker/u66:0:55 blocked for more than 122
> seconds.
> 
> ...
> 
> fio: terminating on signal 2
> Jobs: 32 (f=32): [r(32)][0.3%][eta 02d:01h:28m:36s]

Right, there's some funkiness going on on the nvme side with shared
poll queues, I'm looking into it right now. The patch I sent only
takes care of the "submit after -EAGAIN was already received" case
that you reported.

[RFC,0/2] io_uring: examine request result only after completion

Message

Comments