[v2] tests/nvme: Add admin-passthru+reset race test

Message ID	20221117212210.934-1-jonathan.derrick@linux.dev (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-block-owner@kernel.org> From: Jonathan Derrick <jonathan.derrick@linux.dev> To: <linux-nvme@lists.infradead.org> Cc: <linux-block@vger.kernel.org>, "Shin\\'ichiro Kawasaki" <shinichiro.kawasaki@wdc.com>, Chaitanya Kulkarni <chaitanyak@nvidia.com>, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>, Sagi Grimberg <sagi@grimberg.me>, Jonathan Derrick <jonathan.derrick@linux.dev> Subject: [PATCH v2] tests/nvme: Add admin-passthru+reset race test Date: Thu, 17 Nov 2022 14:22:10 -0700 Message-Id: <20221117212210.934-1-jonathan.derrick@linux.dev> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v2] tests/nvme: Add admin-passthru+reset race test \| expand [v2] tests/nvme: Add admin-passthru+reset race test

Jonathan Derrick Nov. 17, 2022, 9:22 p.m. UTC

Adds a test which runs many formats and reset_controllers in parallel.
The intent is to expose timing holes in the controller state machine
which will lead to hung task timeouts and the controller becoming
unavailable.

Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354

Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
---
I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
reliably segfaults my QEMU instance (something else to look into) and I don't
have any 'real' hardware to test this on at the moment. It looks like several
passthru commands are able to enqueue prior/during/after resetting/connecting.

The issue seems to be very heavily timing related, so the loop in the header is
a lot more forceful in this approach.

As far as the loop goes, I've noticed it will typically repro immediately or
pass the whole test.

 tests/nvme/047     | 121 +++++++++++++++++++++++++++++++++++++++++++++
 tests/nvme/047.out |   2 +
 2 files changed, 123 insertions(+)
 create mode 100755 tests/nvme/047
 create mode 100644 tests/nvme/047.out

Keith Busch Nov. 21, 2022, 8:55 p.m. UTC | #1

On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote:
> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
> reliably segfaults my QEMU instance (something else to look into) and I don't
> have any 'real' hardware to test this on at the moment. It looks like several
> passthru commands are able to enqueue prior/during/after resetting/connecting.

I'm not seeing any problem with the latest nvme-qemu after several dozen
iterations of this test case. In that environment, the formats and
resets complete practically synchronously with the call, so everything
proceeds quickly. Is there anything special I need to change?

> The issue seems to be very heavily timing related, so the loop in the header is
> a lot more forceful in this approach.
> 
> As far as the loop goes, I've noticed it will typically repro immediately or
> pass the whole test.

I can only get possible repro in scenarios that have multi-second long,
serialized format times. Even then, it still appears that everything
fixes itself after a waiting. Are you observing the same, or is it stuck
forever in your observations?

> +remove_and_rescan() {
> +	local pdev=$1
> +	echo 1 > /sys/bus/pci/devices/"$pdev"/remove
> +	echo 1 > /sys/bus/pci/rescan
> +}

This function isn't called anywhere.

Jonathan Derrick Nov. 21, 2022, 10:34 p.m. UTC | #2

On 11/21/2022 1:55 PM, Keith Busch wrote:
> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote:
>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
>> reliably segfaults my QEMU instance (something else to look into) and I don't
>> have any 'real' hardware to test this on at the moment. It looks like several
>> passthru commands are able to enqueue prior/during/after resetting/connecting.
> 
> I'm not seeing any problem with the latest nvme-qemu after several dozen
> iterations of this test case. In that environment, the formats and
> resets complete practically synchronously with the call, so everything
> proceeds quickly. Is there anything special I need to change?
>  
I can still repro this with nvme-fixes tag, so I'll have to dig into it myself
Does the tighter loop in the test comment header produce results?


>> The issue seems to be very heavily timing related, so the loop in the header is
>> a lot more forceful in this approach.
>>
>> As far as the loop goes, I've noticed it will typically repro immediately or
>> pass the whole test.
> 
> I can only get possible repro in scenarios that have multi-second long,
> serialized format times. Even then, it still appears that everything
> fixes itself after a waiting. Are you observing the same, or is it stuck
> forever in your observations?
In 5.19, it gets stuck forever with lots of formats outstanding and
controller stuck in resetting. I'll keep digging. Thanks Keith

> 
>> +remove_and_rescan() {
>> +	local pdev=$1
>> +	echo 1 > /sys/bus/pci/devices/"$pdev"/remove
>> +	echo 1 > /sys/bus/pci/rescan
>> +}
> 
> This function isn't called anywhere.

Keith Busch Nov. 21, 2022, 10:47 p.m. UTC | #3

On Mon, Nov 21, 2022 at 03:34:44PM -0700, Jonathan Derrick wrote:
> On 11/21/2022 1:55 PM, Keith Busch wrote:
> > On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote:
> >> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
> >> reliably segfaults my QEMU instance (something else to look into) and I don't
> >> have any 'real' hardware to test this on at the moment. It looks like several
> >> passthru commands are able to enqueue prior/during/after resetting/connecting.
> > 
> > I'm not seeing any problem with the latest nvme-qemu after several dozen
> > iterations of this test case. In that environment, the formats and
> > resets complete practically synchronously with the call, so everything
> > proceeds quickly. Is there anything special I need to change?
> >  
> I can still repro this with nvme-fixes tag, so I'll have to dig into it myself
> Does the tighter loop in the test comment header produce results?

My qemu's backing storage is a nullblk which makes format a no-op, but I
can try something slower if you think that will have different results.
These kinds of tests are definitely more pleasant to run under
emulation, so having the recipe to recreate there is a boon.
 
> >> The issue seems to be very heavily timing related, so the loop in the header is
> >> a lot more forceful in this approach.
> >>
> >> As far as the loop goes, I've noticed it will typically repro immediately or
> >> pass the whole test.
> > 
> > I can only get possible repro in scenarios that have multi-second long,
> > serialized format times. Even then, it still appears that everything
> > fixes itself after a waiting. Are you observing the same, or is it stuck
> > forever in your observations?
> In 5.19, it gets stuck forever with lots of formats outstanding and
> controller stuck in resetting. I'll keep digging. Thanks Keith

At the moment I'm interested in upstream, so either Linus' latest
6.1-rc, or the nvme-6.2 branch. If you can confirm these are okay (which
appears to be the case on my side), then I can definitely shift focus to
stable back-ports. But if they're not okay, then I want to straighten
that side out first.

Jonathan Derrick Nov. 21, 2022, 10:49 p.m. UTC | #4

On 11/21/2022 3:34 PM, Jonathan Derrick wrote:
> 
> 
> On 11/21/2022 1:55 PM, Keith Busch wrote:
>> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote:
>>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
>>> reliably segfaults my QEMU instance (something else to look into) and I don't
>>> have any 'real' hardware to test this on at the moment. It looks like several
>>> passthru commands are able to enqueue prior/during/after resetting/connecting.
>>
>> I'm not seeing any problem with the latest nvme-qemu after several dozen
>> iterations of this test case. In that environment, the formats and
>> resets complete practically synchronously with the call, so everything
>> proceeds quickly. Is there anything special I need to change?
>>  
> I can still repro this with nvme-fixes tag, so I'll have to dig into it myself
Here's a backtrace:

Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7554400 (LWP 531154)]
0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
539         return sq->ctrl;
(gdb) backtrace
#0  0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
#1  0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852
#2  0x0000555555f4db15 in aio_bh_call (bh=0x7fffec279910) at ../util/async.c:150
#3  0x0000555555f4dc24 in aio_bh_poll (ctx=0x55555688fa00) at ../util/async.c:178
#4  0x0000555555f34df0 in aio_dispatch (ctx=0x55555688fa00) at ../util/aio-posix.c:421
#5  0x0000555555f4e083 in aio_ctx_dispatch (source=0x55555688fa00, callback=0x0, user_data=0x0) at ../util/async.c:320
#6  0x00007ffff7bd717d in g_main_context_dispatch () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#7  0x0000555555f600c2 in glib_pollfds_poll () at ../util/main-loop.c:297
#8  0x0000555555f60140 in os_host_main_loop_wait (timeout=0) at ../util/main-loop.c:320
#9  0x0000555555f60251 in main_loop_wait (nonblocking=0) at ../util/main-loop.c:596
#10 0x0000555555a8f27c in qemu_main_loop () at ../softmmu/runstate.c:739
#11 0x000055555582b77a in qemu_default_main () at ../softmmu/main.c:37
#12 0x000055555582b7b4 in main (argc=53, argv=0x7fffffffdf88) at ../softmmu/main.c:48



> Does the tighter loop in the test comment header produce results?
> 
> 
>>> The issue seems to be very heavily timing related, so the loop in the header is
>>> a lot more forceful in this approach.
>>>
>>> As far as the loop goes, I've noticed it will typically repro immediately or
>>> pass the whole test.
>>
>> I can only get possible repro in scenarios that have multi-second long,
>> serialized format times. Even then, it still appears that everything
>> fixes itself after a waiting. Are you observing the same, or is it stuck
>> forever in your observations?
> In 5.19, it gets stuck forever with lots of formats outstanding and
> controller stuck in resetting. I'll keep digging. Thanks Keith
> 
>>
>>> +remove_and_rescan() {
>>> +	local pdev=$1
>>> +	echo 1 > /sys/bus/pci/devices/"$pdev"/remove
>>> +	echo 1 > /sys/bus/pci/rescan
>>> +}
>>
>> This function isn't called anywhere.

Keith Busch Nov. 21, 2022, 11:04 p.m. UTC | #5

[cc'ing Klaus]

On Mon, Nov 21, 2022 at 03:49:45PM -0700, Jonathan Derrick wrote:
> On 11/21/2022 3:34 PM, Jonathan Derrick wrote:
> > On 11/21/2022 1:55 PM, Keith Busch wrote:
> >> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote:
> >>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
> >>> reliably segfaults my QEMU instance (something else to look into) and I don't
> >>> have any 'real' hardware to test this on at the moment. It looks like several
> >>> passthru commands are able to enqueue prior/during/after resetting/connecting.
> >>
> >> I'm not seeing any problem with the latest nvme-qemu after several dozen
> >> iterations of this test case. In that environment, the formats and
> >> resets complete practically synchronously with the call, so everything
> >> proceeds quickly. Is there anything special I need to change?
> >>  
> > I can still repro this with nvme-fixes tag, so I'll have to dig into it myself
> Here's a backtrace:
> 
> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff7554400 (LWP 531154)]
> 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
> 540         return sq->ctrl;
> (gdb) backtrace
> #0  0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
> #1  0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852

Thanks, looks like a race between the admin queue format's bottom half,
and the controller reset tearing down that queue. I'll work with Klaus
on that qemu side (looks like a well placed qemu_bh_cancel() should do
it).

Klaus Jensen Nov. 22, 2022, 8:26 a.m. UTC | #6

On Nov 21 16:04, Keith Busch wrote:
> [cc'ing Klaus]
> 
> On Mon, Nov 21, 2022 at 03:49:45PM -0700, Jonathan Derrick wrote:
> > On 11/21/2022 3:34 PM, Jonathan Derrick wrote:
> > > On 11/21/2022 1:55 PM, Keith Busch wrote:
> > >> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote:
> > >>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
> > >>> reliably segfaults my QEMU instance (something else to look into) and I don't
> > >>> have any 'real' hardware to test this on at the moment. It looks like several
> > >>> passthru commands are able to enqueue prior/during/after resetting/connecting.
> > >>
> > >> I'm not seeing any problem with the latest nvme-qemu after several dozen
> > >> iterations of this test case. In that environment, the formats and
> > >> resets complete practically synchronously with the call, so everything
> > >> proceeds quickly. Is there anything special I need to change?
> > >>  
> > > I can still repro this with nvme-fixes tag, so I'll have to dig into it myself
> > Here's a backtrace:
> > 
> > Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 0x7ffff7554400 (LWP 531154)]
> > 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
> > 540         return sq->ctrl;
> > (gdb) backtrace
> > #0  0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
> > #1  0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852
> 
> Thanks, looks like a race between the admin queue format's bottom half,
> and the controller reset tearing down that queue. I'll work with Klaus
> on that qemu side (looks like a well placed qemu_bh_cancel() should do
> it).
> 

Yuck. Bug located and quelched I think.

Jonathan, please try

  https://lore.kernel.org/qemu-devel/20221122081348.49963-2-its@irrelevant.dk/

This fixes the qemu crash, but I still see a "nvme still not live after
42 seconds!" resulting from the test. I'm seeing A LOT of invalid
submission queue doorbell writes:

  pci_nvme_ub_db_wr_invalid_sq in nvme_process_db: submission queue doorbell write for nonexistent queue, sqid=0, ignoring

Tested on a 6.1-rc4.

Jonathan Derrick Nov. 22, 2022, 8:30 p.m. UTC | #7

On 11/22/2022 1:26 AM, Klaus Jensen wrote:
> On Nov 21 16:04, Keith Busch wrote:
>> [cc'ing Klaus]
>>
>> On Mon, Nov 21, 2022 at 03:49:45PM -0700, Jonathan Derrick wrote:
>>> On 11/21/2022 3:34 PM, Jonathan Derrick wrote:
>>>> On 11/21/2022 1:55 PM, Keith Busch wrote:
>>>>> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote:
>>>>>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
>>>>>> reliably segfaults my QEMU instance (something else to look into) and I don't
>>>>>> have any 'real' hardware to test this on at the moment. It looks like several
>>>>>> passthru commands are able to enqueue prior/during/after resetting/connecting.
>>>>>
>>>>> I'm not seeing any problem with the latest nvme-qemu after several dozen
>>>>> iterations of this test case. In that environment, the formats and
>>>>> resets complete practically synchronously with the call, so everything
>>>>> proceeds quickly. Is there anything special I need to change?
>>>>>  
>>>> I can still repro this with nvme-fixes tag, so I'll have to dig into it myself
>>> Here's a backtrace:
>>>
>>> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
>>> [Switching to Thread 0x7ffff7554400 (LWP 531154)]
>>> 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
>>> 540         return sq->ctrl;
>>> (gdb) backtrace
>>> #0  0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
>>> #1  0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852
>>
>> Thanks, looks like a race between the admin queue format's bottom half,
>> and the controller reset tearing down that queue. I'll work with Klaus
>> on that qemu side (looks like a well placed qemu_bh_cancel() should do
>> it).
>>
> 
> Yuck. Bug located and quelched I think.
> 
> Jonathan, please try
> 
>   https://lore.kernel.org/qemu-devel/20221122081348.49963-2-its@irrelevant.dk/
> 
> This fixes the qemu crash, but I still see a "nvme still not live after
> 42 seconds!" resulting from the test. I'm seeing A LOT of invalid
> submission queue doorbell writes:
> 
>   pci_nvme_ub_db_wr_invalid_sq in nvme_process_db: submission queue doorbell write for nonexistent queue, sqid=0, ignoring
> 
> Tested on a 6.1-rc4.

Good change, just defers it a bit for me:

Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7554400 (LWP 559269)]
0x000055555598922e in nvme_enqueue_req_completion (cq=0x0, req=0x7fffec141310) at ../hw/nvme/ctrl.c:1390
1390        assert(cq->cqid == req->sq->cqid);
(gdb) backtrace
#0  0x000055555598922e in nvme_enqueue_req_completion (cq=0x0, req=0x7fffec141310) at ../hw/nvme/ctrl.c:1390
#1  0x000055555598a7a7 in nvme_misc_cb (opaque=0x7fffec141310, ret=0) at ../hw/nvme/ctrl.c:2002
#2  0x000055555599448a in nvme_do_format (iocb=0x55555770ccd0) at ../hw/nvme/ctrl.c:5891
#3  0x00005555559942a9 in nvme_format_ns_cb (opaque=0x55555770ccd0, ret=0) at ../hw/nvme/ctrl.c:5828
#4  0x0000555555dda018 in blk_aio_complete (acb=0x7fffec1fccd0) at ../block/block-backend.c:1501
#5  0x0000555555dda2fc in blk_aio_write_entry (opaque=0x7fffec1fccd0) at ../block/block-backend.c:1568
#6  0x0000555555f506b9 in coroutine_trampoline (i0=-331119632, i1=32767) at ../util/coroutine-ucontext.c:177
#7  0x00007ffff77c84e0 in __start_context () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
#8  0x00007ffff4ff2bd0 in  ()
#9  0x0000000000000000 in  ()

Klaus Jensen Nov. 23, 2022, 8:15 a.m. UTC | #8

On Nov 22 13:30, Jonathan Derrick wrote:
> 
> 
> On 11/22/2022 1:26 AM, Klaus Jensen wrote:
> > On Nov 21 16:04, Keith Busch wrote:
> >> [cc'ing Klaus]
> >>
> >> On Mon, Nov 21, 2022 at 03:49:45PM -0700, Jonathan Derrick wrote:
> >>> On 11/21/2022 3:34 PM, Jonathan Derrick wrote:
> >>>> On 11/21/2022 1:55 PM, Keith Busch wrote:
> >>>>> On Thu, Nov 17, 2022 at 02:22:10PM -0700, Jonathan Derrick wrote:
> >>>>>> I seem to have isolated the error mechanism for older kernels, but 6.2.0-rc2
> >>>>>> reliably segfaults my QEMU instance (something else to look into) and I don't
> >>>>>> have any 'real' hardware to test this on at the moment. It looks like several
> >>>>>> passthru commands are able to enqueue prior/during/after resetting/connecting.
> >>>>>
> >>>>> I'm not seeing any problem with the latest nvme-qemu after several dozen
> >>>>> iterations of this test case. In that environment, the formats and
> >>>>> resets complete practically synchronously with the call, so everything
> >>>>> proceeds quickly. Is there anything special I need to change?
> >>>>>  
> >>>> I can still repro this with nvme-fixes tag, so I'll have to dig into it myself
> >>> Here's a backtrace:
> >>>
> >>> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> >>> [Switching to Thread 0x7ffff7554400 (LWP 531154)]
> >>> 0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
> >>> 540         return sq->ctrl;
> >>> (gdb) backtrace
> >>> #0  0x000055555597a9d5 in nvme_ctrl (req=0x7fffec892780) at ../hw/nvme/nvme.h:539
> >>> #1  0x0000555555994360 in nvme_format_bh (opaque=0x5555579dd000) at ../hw/nvme/ctrl.c:5852
> >>
> >> Thanks, looks like a race between the admin queue format's bottom half,
> >> and the controller reset tearing down that queue. I'll work with Klaus
> >> on that qemu side (looks like a well placed qemu_bh_cancel() should do
> >> it).
> >>
> > 
> > Yuck. Bug located and quelched I think.
> > 
> > Jonathan, please try
> > 
> >   https://lore.kernel.org/qemu-devel/20221122081348.49963-2-its@irrelevant.dk/
> > 
> > This fixes the qemu crash, but I still see a "nvme still not live after
> > 42 seconds!" resulting from the test. I'm seeing A LOT of invalid
> > submission queue doorbell writes:
> > 
> >   pci_nvme_ub_db_wr_invalid_sq in nvme_process_db: submission queue doorbell write for nonexistent queue, sqid=0, ignoring
> > 
> > Tested on a 6.1-rc4.
> 
> Good change, just defers it a bit for me:
> 
> Thread 1 "qemu-system-x86" received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff7554400 (LWP 559269)]
> 0x000055555598922e in nvme_enqueue_req_completion (cq=0x0, req=0x7fffec141310) at ../hw/nvme/ctrl.c:1390
> 1390        assert(cq->cqid == req->sq->cqid);
> (gdb) backtrace
> #0  0x000055555598922e in nvme_enqueue_req_completion (cq=0x0, req=0x7fffec141310) at ../hw/nvme/ctrl.c:1390
> #1  0x000055555598a7a7 in nvme_misc_cb (opaque=0x7fffec141310, ret=0) at ../hw/nvme/ctrl.c:2002
> #2  0x000055555599448a in nvme_do_format (iocb=0x55555770ccd0) at ../hw/nvme/ctrl.c:5891
> #3  0x00005555559942a9 in nvme_format_ns_cb (opaque=0x55555770ccd0, ret=0) at ../hw/nvme/ctrl.c:5828
> #4  0x0000555555dda018 in blk_aio_complete (acb=0x7fffec1fccd0) at ../block/block-backend.c:1501
> #5  0x0000555555dda2fc in blk_aio_write_entry (opaque=0x7fffec1fccd0) at ../block/block-backend.c:1568
> #6  0x0000555555f506b9 in coroutine_trampoline (i0=-331119632, i1=32767) at ../util/coroutine-ucontext.c:177
> #7  0x00007ffff77c84e0 in __start_context () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
> #8  0x00007ffff4ff2bd0 in  ()
> #9  0x0000000000000000 in  ()
> 

Bummer.

I'll keep digging.

[v2] tests/nvme: Add admin-passthru+reset race test

Commit Message

Comments

Patch