mbox series

[blktests,v1,0/2] extend nvme/045 to reconnect with invalid key

Message ID 20240304161303.19681-1-dwagner@suse.de (mailing list archive)
Headers show
Series extend nvme/045 to reconnect with invalid key | expand

Message

Daniel Wagner March 4, 2024, 4:13 p.m. UTC
The is the test case for

https://lore.kernel.org/linux-nvme/20240304161006.19328-1-dwagner@suse.de/


Daniel Wagner (2):
  nvme/rc: add reconnect-delay argument only for fabrics transports
  nvme/048: add reconnect after ctrl key change

 tests/nvme/045     | 72 +++++++++++++++++++++++++++++++++++++++++++++-
 tests/nvme/045.out |  3 +-
 tests/nvme/rc      | 17 +++++++++--
 3 files changed, 88 insertions(+), 4 deletions(-)

Comments

Shin'ichiro Kawasaki March 5, 2024, 9:44 a.m. UTC | #1
On Mar 04, 2024 / 17:13, Daniel Wagner wrote:
> The is the test case for
> 
> https://lore.kernel.org/linux-nvme/20240304161006.19328-1-dwagner@suse.de/
>
> 
> Daniel Wagner (2):
>   nvme/rc: add reconnect-delay argument only for fabrics transports
>   nvme/048: add reconnect after ctrl key change

I apply the kernel patches in the link above to v6.8-rc7, then ran nvme/045
with the blktests patches in the series. And I observed failure of the test
case with various transports [1]. Is this failure expected?

Also, I observed KASAN double-free [2]. Do you observe it in your environment?
I created a quick fix [3], and it looks resolving the double-free.

[1]

sudo ./check nvme/045
nvme/045 (Test re-authentication)                            [failed]
    runtime  8.069s  ...  7.639s
    --- tests/nvme/045.out      2024-03-05 18:09:07.267668493 +0900
    +++ /home/shin/Blktests/blktests/results/nodev/nvme/045.out.bad     2024-03-05 18:10:07.735494384 +0900
    @@ -9,5 +9,6 @@
     Change hash to hmac(sha512)
     Re-authenticate with changed hash
     Renew host key on the controller and force reconnect
    -disconnected 0 controller(s)
    +controller "nvme1" not deleted within 5 seconds
    +disconnected 1 controller(s)
     Test complete

[2]

[  938.253184] ==================================================================
[  938.254995] BUG: KASAN: double-free in nuse_show+0x307/0x3c0 [nvme_core]
[  938.256400] Free of addr ffff88812d318000 by task nvme/1564

[  938.258777] CPU: 2 PID: 1564 Comm: nvme Not tainted 6.8.0-rc7+ #155
[  938.260188] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
[  938.261695] Call Trace:
[  938.262780]  <TASK>
[  938.263950]  dump_stack_lvl+0x57/0x90
[  938.265157]  print_report+0xcf/0x670
[  938.266372]  ? __virt_addr_valid+0x211/0x400
[  938.267554]  ? nuse_show+0x307/0x3c0 [nvme_core]
[  938.268790]  kasan_report_invalid_free+0x72/0xa0
[  938.270025]  ? nuse_show+0x307/0x3c0 [nvme_core]
[  938.271242]  ? nuse_show+0x307/0x3c0 [nvme_core]
[  938.272447]  poison_slab_object+0x141/0x170
[  938.273574]  ? nuse_show+0x307/0x3c0 [nvme_core]
[  938.274826]  __kasan_slab_free+0x2e/0x50
[  938.276029]  kfree+0x116/0x350
[  938.277133]  nuse_show+0x307/0x3c0 [nvme_core]
[  938.278326]  ? __pfx_lock_acquire+0x10/0x10
[  938.279433]  ? __pfx_nuse_show+0x10/0x10 [nvme_core]
[  938.280669]  dev_attr_show+0x42/0xc0
[  938.281668]  ? sysfs_file_ops+0x11b/0x170
[  938.282733]  sysfs_kf_seq_show+0x1f0/0x3b0
[  938.283818]  seq_read_iter+0x40c/0x11c0
[  938.284888]  ? rw_verify_area+0x179/0x470
[  938.286016]  vfs_read+0x606/0xc70
[  938.287106]  ? __pfx_vfs_read+0x10/0x10
[  938.288153]  ? kasan_quarantine_put+0xd6/0x1e0
[  938.289234]  ? lockdep_hardirqs_on+0x7d/0x100
[  938.290313]  ? __fget_light+0x53/0x1e0
[  938.291267]  ksys_read+0xf7/0x1d0
[  938.292233]  ? __pfx_ksys_read+0x10/0x10
[  938.293301]  ? kasan_quarantine_put+0xd6/0x1e0
[  938.294300]  do_syscall_64+0x9a/0x190
[  938.295253]  ? __x64_sys_openat+0x11f/0x1d0
[  938.296292]  ? lockdep_hardirqs_on+0x7d/0x100
[  938.297277]  ? __pfx___x64_sys_openat+0x10/0x10
[  938.298328]  ? ksys_read+0xf7/0x1d0
[  938.299245]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
[  938.300301]  ? do_syscall_64+0xa7/0x190
[  938.301191]  ? lockdep_hardirqs_on+0x7d/0x100
[  938.302148]  ? do_syscall_64+0xa7/0x190
[  938.303107]  ? do_syscall_64+0xa7/0x190
[  938.304009]  ? do_syscall_64+0xa7/0x190
[  938.304936]  ? lockdep_hardirqs_on_prepare+0x17b/0x410
[  938.306017]  entry_SYSCALL_64_after_hwframe+0x6e/0x76
[  938.307103] RIP: 0033:0x7f57658da121
[  938.308065] Code: 00 48 8b 15 11 fd 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 40 ce 01 00 f3 0f 1e fa 80 3d 45 82 0d 00 00 74 13 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec
[  938.310749] RSP: 002b:00007ffe0fd8ef98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  938.312023] RAX: ffffffffffffffda RBX: 00007ffe0fd908a8 RCX: 00007f57658da121
[  938.313215] RDX: 0000000000000fff RSI: 00007ffe0fd8efb0 RDI: 0000000000000003
[  938.314464] RBP: 00007ffe0fd90820 R08: 0000000000000073 R09: 0000000000000001
[  938.315668] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000004
[  938.316871] R13: 0000000000000000 R14: 00007f5765a4b000 R15: 000000000053bdc0
[  938.318077]  </TASK>

[  938.319688] Allocated by task 1564:
[  938.320623]  kasan_save_stack+0x2f/0x50
[  938.321579]  kasan_save_track+0x10/0x30
[  938.322532]  __kasan_kmalloc+0xa6/0xb0
[  938.323477]  nvme_identify_ns+0xae/0x230 [nvme_core]
[  938.324529]  nuse_show+0x27a/0x3c0 [nvme_core]
[  938.325546]  dev_attr_show+0x42/0xc0
[  938.326485]  sysfs_kf_seq_show+0x1f0/0x3b0
[  938.327429]  seq_read_iter+0x40c/0x11c0
[  938.328483]  vfs_read+0x606/0xc70
[  938.329401]  ksys_read+0xf7/0x1d0
[  938.330441]  do_syscall_64+0x9a/0x190
[  938.331348]  entry_SYSCALL_64_after_hwframe+0x6e/0x76

[  938.333140] Freed by task 1564:
[  938.334143]  kasan_save_stack+0x2f/0x50
[  938.335067]  kasan_save_track+0x10/0x30
[  938.336078]  kasan_save_free_info+0x37/0x60
[  938.337101]  poison_slab_object+0x102/0x170
[  938.338124]  __kasan_slab_free+0x2e/0x50
[  938.339082]  kfree+0x116/0x350
[  938.339965]  nvme_identify_ns+0x1c5/0x230 [nvme_core]
[  938.341006]  nuse_show+0x27a/0x3c0 [nvme_core]
[  938.342003]  dev_attr_show+0x42/0xc0
[  938.342931]  sysfs_kf_seq_show+0x1f0/0x3b0
[  938.343882]  seq_read_iter+0x40c/0x11c0
[  938.344804]  vfs_read+0x606/0xc70
[  938.345708]  ksys_read+0xf7/0x1d0
[  938.346611]  do_syscall_64+0x9a/0x190
[  938.347538]  entry_SYSCALL_64_after_hwframe+0x6e/0x76

[  938.349308] The buggy address belongs to the object at ffff88812d318000
                which belongs to the cache kmalloc-4k of size 4096
[  938.350299] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[  938.350311] The buggy address is located 0 bytes inside of
                4096-byte region [ffff88812d318000, ffff88812d319000)

[  938.350314] The buggy address belongs to the physical page:
[  938.358511] page:00000000389f3330 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x12d318
[  938.360009] head:00000000389f3330 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[  938.361388] flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
[  938.362644] page_type: 0xffffffff()
[  938.363627] raw: 0017ffffc0000840 ffff888100043040 dead000000000122 0000000000000000
[  938.364958] raw: 0000000000000000 0000000080040004 00000001ffffffff 0000000000000000
[  938.366278] page dumped because: kasan: bad access detected

[  938.368303] Memory state around the buggy address:
[  938.369384]  ffff88812d317f00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[  938.370661]  ffff88812d317f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  938.371983] >ffff88812d318000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  938.373295]                    ^
[  938.374311]  ffff88812d318080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  938.375618]  ffff88812d318100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  938.376954] ==================================================================
[  938.378356] Disabling lock debugging due to kernel taint


[3]

diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index f2832f70e7e0..4e161d3cd840 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -221,14 +221,10 @@ static int ns_update_nuse(struct nvme_ns *ns)
 
 	ret = nvme_identify_ns(ns->ctrl, ns->head->ns_id, &id);
 	if (ret)
-		goto out_free_id;
+		return ret;
 
 	ns->head->nuse = le64_to_cpu(id->nuse);
-
-out_free_id:
-	kfree(id);
-
-	return ret;
+	return 0;
 }
 
 static ssize_t nuse_show(struct device *dev, struct device_attribute *attr,
Daniel Wagner March 5, 2024, 11:18 a.m. UTC | #2
On Tue, Mar 05, 2024 at 09:44:45AM +0000, Shinichiro Kawasaki wrote:
> On Mar 04, 2024 / 17:13, Daniel Wagner wrote:
> > The is the test case for
> > 
> > https://lore.kernel.org/linux-nvme/20240304161006.19328-1-dwagner@suse.de/
> >
> > 
> > Daniel Wagner (2):
> >   nvme/rc: add reconnect-delay argument only for fabrics transports
> >   nvme/048: add reconnect after ctrl key change
> 
> I apply the kernel patches in the link above to v6.8-rc7, then ran nvme/045
> with the blktests patches in the series. And I observed failure of the test
> case with various transports [1]. Is this failure expected?

If you have these patches applied, the test should pass. But we might
have still some more stuff to unify between the transports. The nvme/045
test passes in my setup. Though I have seen runs which were hang for
some reason. Haven't figured out yet what's happening there. But I
haven't seen failures, IIRC.

I am not really surprised we seeing some fallouts though. We start to
test the error code paths with this test extension.

> Also, I observed KASAN double-free [2]. Do you observe it in your environment?
> I created a quick fix [3], and it looks resolving the double-free.

No, I haven't seen this.

> sudo ./check nvme/045
> nvme/045 (Test re-authentication)                            [failed]
>     runtime  8.069s  ...  7.639s
>     --- tests/nvme/045.out      2024-03-05 18:09:07.267668493 +0900
>     +++ /home/shin/Blktests/blktests/results/nodev/nvme/045.out.bad     2024-03-05 18:10:07.735494384 +0900
>     @@ -9,5 +9,6 @@
>      Change hash to hmac(sha512)
>      Re-authenticate with changed hash
>      Renew host key on the controller and force reconnect
>     -disconnected 0 controller(s)
>     +controller "nvme1" not deleted within 5 seconds
>     +disconnected 1 controller(s)
>      Test complete

That means the host either successfully reconnected or never
disconnected. We have another test case just for the disconnect test
(number of queue changes), so if this test passes, it must be the
former... Shouldn't really happen, this would mean the auth code has bug.

> diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> index f2832f70e7e0..4e161d3cd840 100644
> --- a/drivers/nvme/host/sysfs.c
> +++ b/drivers/nvme/host/sysfs.c
> @@ -221,14 +221,10 @@ static int ns_update_nuse(struct nvme_ns *ns)
>  
>  	ret = nvme_identify_ns(ns->ctrl, ns->head->ns_id, &id);
>  	if (ret)
> -		goto out_free_id;
> +		return ret;

Yes, this is correct.
>  
>  	ns->head->nuse = le64_to_cpu(id->nuse);
> -
> -out_free_id:
> -	kfree(id);
> -
> -	return ret;
> +	return 0;
>  }
>

I think you still need to free the 'id' on the normal exit path though

If you have these patches applied, the test should pass. But we might
have still some more stuff to unify between the transports. The nvme/045
test passes in my setup. Though I have seen runs which were hang for
some reason. Haven't figured out yet what's happening there. But I
haven't seen failures.
Shin'ichiro Kawasaki March 6, 2024, 8:44 a.m. UTC | #3
On Mar 05, 2024 / 12:18, Daniel Wagner wrote:
> On Tue, Mar 05, 2024 at 09:44:45AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 04, 2024 / 17:13, Daniel Wagner wrote:
> > > The is the test case for
> > > 
> > > https://lore.kernel.org/linux-nvme/20240304161006.19328-1-dwagner@suse.de/
> > >
> > > 
> > > Daniel Wagner (2):
> > >   nvme/rc: add reconnect-delay argument only for fabrics transports
> > >   nvme/048: add reconnect after ctrl key change
> > 
> > I apply the kernel patches in the link above to v6.8-rc7, then ran nvme/045
> > with the blktests patches in the series. And I observed failure of the test
> > case with various transports [1]. Is this failure expected?
> 
> If you have these patches applied, the test should pass. But we might
> have still some more stuff to unify between the transports. The nvme/045
> test passes in my setup. Though I have seen runs which were hang for
> some reason. Haven't figured out yet what's happening there. But I
> haven't seen failures, IIRC.
> 
> I am not really surprised we seeing some fallouts though. We start to
> test the error code paths with this test extension.
> 
> > Also, I observed KASAN double-free [2]. Do you observe it in your environment?
> > I created a quick fix [3], and it looks resolving the double-free.
> 
> No, I haven't seen this.
> 
> > sudo ./check nvme/045
> > nvme/045 (Test re-authentication)                            [failed]
> >     runtime  8.069s  ...  7.639s
> >     --- tests/nvme/045.out      2024-03-05 18:09:07.267668493 +0900
> >     +++ /home/shin/Blktests/blktests/results/nodev/nvme/045.out.bad     2024-03-05 18:10:07.735494384 +0900
> >     @@ -9,5 +9,6 @@
> >      Change hash to hmac(sha512)
> >      Re-authenticate with changed hash
> >      Renew host key on the controller and force reconnect
> >     -disconnected 0 controller(s)
> >     +controller "nvme1" not deleted within 5 seconds
> >     +disconnected 1 controller(s)
> >      Test complete
> 
> That means the host either successfully reconnected or never
> disconnected. We have another test case just for the disconnect test
> (number of queue changes), so if this test passes, it must be the
> former... Shouldn't really happen, this would mean the auth code has bug.

The test case nvme/048 passes, so this looks a bug.

> 
> > diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
> > index f2832f70e7e0..4e161d3cd840 100644
> > --- a/drivers/nvme/host/sysfs.c
> > +++ b/drivers/nvme/host/sysfs.c
> > @@ -221,14 +221,10 @@ static int ns_update_nuse(struct nvme_ns *ns)
> >  
> >  	ret = nvme_identify_ns(ns->ctrl, ns->head->ns_id, &id);
> >  	if (ret)
> > -		goto out_free_id;
> > +		return ret;
> 
> Yes, this is correct.
> >  
> >  	ns->head->nuse = le64_to_cpu(id->nuse);
> > -
> > -out_free_id:
> > -	kfree(id);
> > -
> > -	return ret;
> > +	return 0;
> >  }
> >
> 
> I think you still need to free the 'id' on the normal exit path though

Thanks, I posted the patch with the fix.

> 
> If you have these patches applied, the test should pass. But we might
> have still some more stuff to unify between the transports. The nvme/045
> test passes in my setup. Though I have seen runs which were hang for
> some reason. Haven't figured out yet what's happening there. But I
> haven't seen failures.

Still with the fix of the double-free, I observe the nvme/045 failure for rdma,
tcp and fc transports. I wonder where the difference between your system and
mine comes from.

FYI, here I share the kernel messages for rdma transport. It shows that
nvme_rdma_reconnect_or_remove() was called repeatedly and it tried to reconnect.
The status argument is -111 or 880, so I think the recon flag is always true
and no effect. I'm interested in the status values in your environment.


[   59.117607] run blktests nvme/045 at 2024-03-06 17:05:55
[   59.198629] (null): rxe_set_mtu: Set mtu to 1024
[   59.211185] PCLMULQDQ-NI instructions are not detected.
[   59.362952] infiniband ens3_rxe: set active
[   59.363765] infiniband ens3_rxe: added ens3
[   59.540499] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[   59.560541] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
[   59.688866] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[   59.701114] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[   59.702195] nvme nvme1: qid 0: controller authenticated
[   59.703310] nvme nvme1: qid 0: authenticated
[   59.707478] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
[   59.709883] nvme nvme1: creating 4 I/O queues.
[   59.745087] nvme nvme1: mapped 4/0/0 default/read/poll queues.
[   59.786869] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 10.0.2.15:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[   59.999761] nvme nvme1: re-authenticating controller
[   60.010902] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[   60.011640] nvme nvme1: qid 0: controller authenticated
[   60.025652] nvme nvme1: re-authenticating controller
[   60.035349] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[   60.036375] nvme nvme1: qid 0: controller authenticated
[   60.050449] nvme nvme1: re-authenticating controller
[   60.060757] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[   60.061460] nvme nvme1: qid 0: controller authenticated
[   62.662430] nvme nvme1: re-authenticating controller
[   62.859510] nvme nvme1: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
[   62.860502] nvme nvme1: qid 0: controller authenticated
[   63.029182] nvme nvme1: re-authenticating controller
[   63.192844] nvme nvme1: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
[   63.193900] nvme nvme1: qid 0: controller authenticated
[   63.608561] nvme nvme1: starting error recovery
[   63.653699] nvme nvme1: Reconnecting in 1 seconds...
[   64.712627] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[   64.868896] nvmet: ctrl 1 qid 0 host response mismatch
[   64.870065] nvmet: ctrl 1 qid 0 failure1 (1)
[   64.871152] nvmet: ctrl 1 fatal error occurred!
[   64.871519] nvme nvme1: qid 0: authentication failed
[   64.874330] nvme nvme1: failed to connect queue: 0 ret=-111
[   64.878612] nvme nvme1: Failed reconnect attempt 1
[   64.880472] nvme nvme1: Reconnecting in 1 seconds...
[   66.040957] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[   66.200862] nvmet: ctrl 1 qid 0 host response mismatch
[   66.203005] nvmet: ctrl 1 qid 0 failure1 (1)
[   66.204873] nvmet: ctrl 1 fatal error occurred!
[   66.205148] nvme nvme1: qid 0: authentication failed
[   66.208609] nvme nvme1: failed to connect queue: 0 ret=-111
[   66.212033] nvme nvme1: Failed reconnect attempt 2
[   66.213837] nvme nvme1: Reconnecting in 1 seconds...
[   67.327576] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[   67.485392] nvmet: ctrl 1 qid 0 host response mismatch
[   67.487440] nvmet: ctrl 1 qid 0 failure1 (1)
[   67.489403] nvmet: ctrl 1 fatal error occurred!
[   67.489565] nvme nvme1: qid 0: authentication failed
[   67.493015] nvme nvme1: failed to connect queue: 0 ret=-111
[   67.496909] nvme nvme1: Failed reconnect attempt 3
[   67.498692] nvme nvme1: Reconnecting in 1 seconds...
[   68.610640] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[   68.739298] nvme nvme1: Identify namespace failed (880)
[   68.742833] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
[   68.774125] nvmet: ctrl 1 qid 0 host response mismatch
[   68.776440] nvme nvme1: qid 0 auth_send failed with status 880
[   68.778133] nvme nvme1: qid 0 failed to receive success1, nvme status 880
[   68.780300] nvme nvme1: qid 0: authentication failed
[   68.782490] nvme nvme1: failed to connect queue: 0 ret=880
[   68.785335] nvme nvme1: Failed reconnect attempt 4
[   68.829188] nvme nvme1: Property Set error: 880, offset 0x14
[   69.634482] rdma_rxe: unloaded
Daniel Wagner March 6, 2024, 9:36 a.m. UTC | #4
On Wed, Mar 06, 2024 at 08:44:48AM +0000, Shinichiro Kawasaki wrote:
 > > sudo ./check nvme/045
> > > nvme/045 (Test re-authentication)                            [failed]
> > >     runtime  8.069s  ...  7.639s
> > >     --- tests/nvme/045.out      2024-03-05 18:09:07.267668493 +0900
> > >     +++ /home/shin/Blktests/blktests/results/nodev/nvme/045.out.bad     2024-03-05 18:10:07.735494384 +0900
> > >     @@ -9,5 +9,6 @@
> > >      Change hash to hmac(sha512)
> > >      Re-authenticate with changed hash
> > >      Renew host key on the controller and force reconnect
> > >     -disconnected 0 controller(s)
> > >     +controller "nvme1" not deleted within 5 seconds
> > >     +disconnected 1 controller(s)
> > >      Test complete
> > 
> > That means the host either successfully reconnected or never
> > disconnected. We have another test case just for the disconnect test
> > (number of queue changes), so if this test passes, it must be the
> > former... Shouldn't really happen, this would mean the auth code has bug.
> 
> The test case nvme/048 passes, so this looks a bug.

I'll try to recreate it.

> > If you have these patches applied, the test should pass. But we might
> > have still some more stuff to unify between the transports. The nvme/045
> > test passes in my setup. Though I have seen runs which were hang for
> > some reason. Haven't figured out yet what's happening there. But I
> > haven't seen failures.
> 
> Still with the fix of the double-free, I observe the nvme/045 failure for rdma,
> tcp and fc transports. I wonder where the difference between your system and
> mine comes from.
> 
> FYI, here I share the kernel messages for rdma transport. It shows that
> nvme_rdma_reconnect_or_remove() was called repeatedly and it tried to reconnect.
> The status argument is -111 or 880, so I think the recon flag is always true
> and no effect. I'm interested in the status values in your environment.

Do you have these patches applied:

https://lore.kernel.org/linux-nvme/20240305080005.3638-1-dwagner@suse.de/

?

> [   59.117607] run blktests nvme/045 at 2024-03-06 17:05:55
> [   59.198629] (null): rxe_set_mtu: Set mtu to 1024
> [   59.211185] PCLMULQDQ-NI instructions are not detected.
> [   59.362952] infiniband ens3_rxe: set active
> [   59.363765] infiniband ens3_rxe: added ens3
> [   59.540499] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [   59.560541] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
> [   59.688866] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
> [   59.701114] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
> [   59.702195] nvme nvme1: qid 0: controller authenticated
> [   59.703310] nvme nvme1: qid 0: authenticated
> [   59.707478] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
> [   59.709883] nvme nvme1: creating 4 I/O queues.
> [   59.745087] nvme nvme1: mapped 4/0/0 default/read/poll queues.
> [   59.786869] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 10.0.2.15:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
> [   59.999761] nvme nvme1: re-authenticating controller
> [   60.010902] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
> [   60.011640] nvme nvme1: qid 0: controller authenticated
> [   60.025652] nvme nvme1: re-authenticating controller
> [   60.035349] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
> [   60.036375] nvme nvme1: qid 0: controller authenticated
> [   60.050449] nvme nvme1: re-authenticating controller
> [   60.060757] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
> [   60.061460] nvme nvme1: qid 0: controller authenticated
> [   62.662430] nvme nvme1: re-authenticating controller
> [   62.859510] nvme nvme1: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
> [   62.860502] nvme nvme1: qid 0: controller authenticated
> [   63.029182] nvme nvme1: re-authenticating controller
> [   63.192844] nvme nvme1: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
> [   63.193900] nvme nvme1: qid 0: controller authenticated
> [   63.608561] nvme nvme1: starting error recovery
> [   63.653699] nvme nvme1: Reconnecting in 1 seconds...
> [   64.712627] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
> [   64.868896] nvmet: ctrl 1 qid 0 host response mismatch
> [   64.870065] nvmet: ctrl 1 qid 0 failure1 (1)
> [   64.871152] nvmet: ctrl 1 fatal error occurred!
> [   64.871519] nvme nvme1: qid 0: authentication failed
> [   64.874330] nvme nvme1: failed to connect queue: 0 ret=-111
> [   64.878612] nvme nvme1: Failed reconnect attempt 1
> [   64.880472] nvme nvme1: Reconnecting in 1 seconds...

This looks like the DNR bit is not considered in the reconnect_or_delete
function.
Shin'ichiro Kawasaki March 6, 2024, 11:28 a.m. UTC | #5
On Mar 06, 2024 / 10:36, Daniel Wagner wrote:
> On Wed, Mar 06, 2024 at 08:44:48AM +0000, Shinichiro Kawasaki wrote:
>  > > sudo ./check nvme/045
> > > > nvme/045 (Test re-authentication)                            [failed]
> > > >     runtime  8.069s  ...  7.639s
> > > >     --- tests/nvme/045.out      2024-03-05 18:09:07.267668493 +0900
> > > >     +++ /home/shin/Blktests/blktests/results/nodev/nvme/045.out.bad     2024-03-05 18:10:07.735494384 +0900
> > > >     @@ -9,5 +9,6 @@
> > > >      Change hash to hmac(sha512)
> > > >      Re-authenticate with changed hash
> > > >      Renew host key on the controller and force reconnect
> > > >     -disconnected 0 controller(s)
> > > >     +controller "nvme1" not deleted within 5 seconds
> > > >     +disconnected 1 controller(s)
> > > >      Test complete
> > > 
> > > That means the host either successfully reconnected or never
> > > disconnected. We have another test case just for the disconnect test
> > > (number of queue changes), so if this test passes, it must be the
> > > former... Shouldn't really happen, this would mean the auth code has bug.
> > 
> > The test case nvme/048 passes, so this looks a bug.
> 
> I'll try to recreate it.
> 
> > > If you have these patches applied, the test should pass. But we might
> > > have still some more stuff to unify between the transports. The nvme/045
> > > test passes in my setup. Though I have seen runs which were hang for
> > > some reason. Haven't figured out yet what's happening there. But I
> > > haven't seen failures.
> > 
> > Still with the fix of the double-free, I observe the nvme/045 failure for rdma,
> > tcp and fc transports. I wonder where the difference between your system and
> > mine comes from.
> > 
> > FYI, here I share the kernel messages for rdma transport. It shows that
> > nvme_rdma_reconnect_or_remove() was called repeatedly and it tried to reconnect.
> > The status argument is -111 or 880, so I think the recon flag is always true
> > and no effect. I'm interested in the status values in your environment.
> 
> Do you have these patches applied:
> 
> https://lore.kernel.org/linux-nvme/20240305080005.3638-1-dwagner@suse.de/
> 
> ?

Yes. To be precise, I applied v2 patches, but I believe they have same logic as
v3. To be sure, I replaced the v2 patches with v3, then saw the same failure.

> 
> > [   59.117607] run blktests nvme/045 at 2024-03-06 17:05:55
> > [   59.198629] (null): rxe_set_mtu: Set mtu to 1024
> > [   59.211185] PCLMULQDQ-NI instructions are not detected.
> > [   59.362952] infiniband ens3_rxe: set active
> > [   59.363765] infiniband ens3_rxe: added ens3
> > [   59.540499] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> > [   59.560541] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
> > [   59.688866] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
> > [   59.701114] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
> > [   59.702195] nvme nvme1: qid 0: controller authenticated
> > [   59.703310] nvme nvme1: qid 0: authenticated
> > [   59.707478] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
> > [   59.709883] nvme nvme1: creating 4 I/O queues.
> > [   59.745087] nvme nvme1: mapped 4/0/0 default/read/poll queues.
> > [   59.786869] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 10.0.2.15:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
> > [   59.999761] nvme nvme1: re-authenticating controller
> > [   60.010902] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
> > [   60.011640] nvme nvme1: qid 0: controller authenticated
> > [   60.025652] nvme nvme1: re-authenticating controller
> > [   60.035349] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
> > [   60.036375] nvme nvme1: qid 0: controller authenticated
> > [   60.050449] nvme nvme1: re-authenticating controller
> > [   60.060757] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
> > [   60.061460] nvme nvme1: qid 0: controller authenticated
> > [   62.662430] nvme nvme1: re-authenticating controller
> > [   62.859510] nvme nvme1: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
> > [   62.860502] nvme nvme1: qid 0: controller authenticated
> > [   63.029182] nvme nvme1: re-authenticating controller
> > [   63.192844] nvme nvme1: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
> > [   63.193900] nvme nvme1: qid 0: controller authenticated
> > [   63.608561] nvme nvme1: starting error recovery
> > [   63.653699] nvme nvme1: Reconnecting in 1 seconds...
> > [   64.712627] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
> > [   64.868896] nvmet: ctrl 1 qid 0 host response mismatch
> > [   64.870065] nvmet: ctrl 1 qid 0 failure1 (1)
> > [   64.871152] nvmet: ctrl 1 fatal error occurred!
> > [   64.871519] nvme nvme1: qid 0: authentication failed
> > [   64.874330] nvme nvme1: failed to connect queue: 0 ret=-111
> > [   64.878612] nvme nvme1: Failed reconnect attempt 1
> > [   64.880472] nvme nvme1: Reconnecting in 1 seconds...
> 
> This looks like the DNR bit is not considered in the reconnect_or_delete
> function.

Agreed. Let me share the kernel messages with the added printk() below.
We can see recon is true at every nvme_rdma_reconnect_or_remove() call.

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 507cbd6a3312..bfb23b9d5d82 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -988,6 +988,7 @@ static void nvme_rdma_reconnect_or_remove(struct nvme_rdma_ctrl *ctrl,
 	enum nvme_ctrl_state state = nvme_ctrl_state(&ctrl->ctrl);
 	bool recon = status > 0 && (status & NVME_SC_DNR) ? false : true;
 
+	printk("%s: status=%d recon=%d\n", __func__, status, recon);
 	/* If we are resetting/deleting then do nothing */
 	if (state != NVME_CTRL_CONNECTING) {
 		WARN_ON_ONCE(state == NVME_CTRL_NEW || state == NVME_CTRL_LIVE);

[  139.201909] run blktests nvme/045 at 2024-03-06 20:22:44
[  139.277895] (null): rxe_set_mtu: Set mtu to 1024
[  139.291307] PCLMULQDQ-NI instructions are not detected.
[  139.451087] infiniband ens3_rxe: set active
[  139.451856] infiniband ens3_rxe: added ens3
[  139.627815] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[  139.647899] nvmet_rdma: enabling port 0 (10.0.2.15:4420)
[  139.777796] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[  139.789952] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[  139.791053] nvme nvme1: qid 0: controller authenticated
[  139.792039] nvme nvme1: qid 0: authenticated
[  139.796131] nvme nvme1: Please enable CONFIG_NVME_MULTIPATH for full support of multi-port devices.
[  139.798258] nvme nvme1: creating 4 I/O queues.
[  139.832707] nvme nvme1: mapped 4/0/0 default/read/poll queues.
[  139.874555] nvme nvme1: new ctrl: NQN "blktests-subsystem-1", addr 10.0.2.15:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[  140.084928] nvme nvme1: re-authenticating controller
[  140.095934] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[  140.096705] nvme nvme1: qid 0: controller authenticated
[  140.111696] nvme nvme1: re-authenticating controller
[  140.121209] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[  140.121885] nvme nvme1: qid 0: controller authenticated
[  140.138435] nvme nvme1: re-authenticating controller
[  140.148513] nvme nvme1: qid 0: authenticated with hash hmac(sha256) dhgroup ffdhe2048
[  140.149344] nvme nvme1: qid 0: controller authenticated
[  142.748523] nvme nvme1: re-authenticating controller
[  142.944154] nvme nvme1: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
[  142.945023] nvme nvme1: qid 0: controller authenticated
[  143.127665] nvme nvme1: re-authenticating controller
[  143.293304] nvme nvme1: qid 0: authenticated with hash hmac(sha512) dhgroup ffdhe8192
[  143.294204] nvme nvme1: qid 0: controller authenticated
[  143.732511] nvme nvme1: starting error recovery
[  143.785440] nvme_rdma_reconnect_or_remove: status=-107 recon=1
[  143.786448] nvme nvme1: Reconnecting in 1 seconds...
[  144.834299] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[  144.990883] nvmet: ctrl 1 qid 0 host response mismatch
[  144.991937] nvmet: ctrl 1 qid 0 failure1 (1)
[  144.992655] nvmet: ctrl 1 fatal error occurred!
[  144.992953] nvme nvme1: qid 0: authentication failed
[  144.996695] nvme nvme1: failed to connect queue: 0 ret=-111
[  145.000090] nvme nvme1: Failed reconnect attempt 1
[  145.001717] nvme_rdma_reconnect_or_remove: status=-111 recon=1
[  145.003218] nvme nvme1: Reconnecting in 1 seconds...
[  146.116430] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[  146.272948] nvmet: ctrl 1 qid 0 host response mismatch
[  146.274869] nvmet: ctrl 1 qid 0 failure1 (1)
[  146.276687] nvmet: ctrl 1 fatal error occurred!
[  146.276891] nvme nvme1: qid 0: authentication failed
[  146.280591] nvme nvme1: failed to connect queue: 0 ret=-111
[  146.283783] nvme nvme1: Failed reconnect attempt 2
[  146.285339] nvme_rdma_reconnect_or_remove: status=-111 recon=1
[  146.287206] nvme nvme1: Reconnecting in 1 seconds...
[  147.395546] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[  147.563730] nvmet: ctrl 1 qid 0 host response mismatch
[  147.565628] nvmet: ctrl 1 qid 0 failure1 (1)
[  147.567256] nvmet: ctrl 1 fatal error occurred!
[  147.567411] nvme nvme1: qid 0: authentication failed
[  147.571022] nvme nvme1: failed to connect queue: 0 ret=-111
[  147.574003] nvme nvme1: Failed reconnect attempt 3
[  147.575463] nvme_rdma_reconnect_or_remove: status=-111 recon=1
[  147.576976] nvme nvme1: Reconnecting in 1 seconds...
[  148.676722] nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 with DH-HMAC-CHAP.
[  148.842117] nvmet: ctrl 1 qid 0 host response mismatch
[  148.844543] nvmet: ctrl 1 qid 0 failure1 (1)
[  148.846369] nvmet: ctrl 1 fatal error occurred!
[  148.846478] nvme nvme1: qid 0: authentication failed
[  148.850217] nvme nvme1: failed to connect queue: 0 ret=-111
[  148.853850] nvme nvme1: Failed reconnect attempt 4
[  148.855425] nvme_rdma_reconnect_or_remove: status=-111 recon=1
[  148.857496] nvme nvme1: Reconnecting in 1 seconds...
[  149.873720] nvme nvme1: Identify namespace failed (880)
[  149.877168] nvme nvme1: Removing ctrl: NQN "blktests-subsystem-1"
[  149.927672] nvme nvme1: Property Set error: 880, offset 0x14
[  150.754847] rdma_rxe: unloaded