diff mbox

nvmeof rdma regression issue on 4.14.0-rc1 (or maybe mlx4?)

Message ID 62eea88b-caa4-5799-3d8f-8d8789879aa8@grimberg.me (mailing list archive)
State Not Applicable
Headers show

Commit Message

Sagi Grimberg Oct. 19, 2017, 6:55 a.m. UTC
>> Hi Yi,
>>
>> I was referring to the bug you reported on a simple create_ctrl failed:
>> https://pastebin.com/7z0XSGSd
>>
>> Does it reproduce?
>>
> yes, this issue was reproduced during "git bisect" with below patch

OK, if this does not reproduce with the latest code, lets put it aside
for now.

So as for the error you see, can you please try the following patch?
--
  }
@@ -739,8 +744,6 @@ static struct blk_mq_tag_set 
*nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
  static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl,
                 bool remove)
  {
-       nvme_rdma_free_qe(ctrl->queues[0].device->dev, 
&ctrl->async_event_sqe,
-                       sizeof(struct nvme_command), DMA_TO_DEVICE);
         nvme_rdma_stop_queue(&ctrl->queues[0]);
         if (remove) {
                 blk_cleanup_queue(ctrl->ctrl.admin_q);
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yi Zhang Oct. 19, 2017, 8:23 a.m. UTC | #1
On 10/19/2017 02:55 PM, Sagi Grimberg wrote:
>
>>> Hi Yi,
>>>
>>> I was referring to the bug you reported on a simple create_ctrl failed:
>>> https://pastebin.com/7z0XSGSd
>>>
>>> Does it reproduce?
>>>
>> yes, this issue was reproduced during "git bisect" with below patch
>
> OK, if this does not reproduce with the latest code, lets put it aside
> for now.
>
> So as for the error you see, can you please try the following patch?
Hi Sagi
With this patch, no such error log found on host side, but I found there 
is no nvme0n1 device node even get "nvme nvme0: Successfully 
reconnected" on host.

Host side:
[   98.181089] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[   98.329464] nvme nvme0: creating 40 I/O queues.
[   98.835409] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  107.873586] nvme nvme0: Reconnecting in 10 seconds...
[  118.505937] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  118.513443] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  118.519875] nvme nvme0: Failed reconnect attempt 1
[  118.525241] nvme nvme0: Reconnecting in 10 seconds...
[  128.733311] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  128.740812] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  128.747247] nvme nvme0: Failed reconnect attempt 2
[  128.752609] nvme nvme0: Reconnecting in 10 seconds...
[  138.973404] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  138.980904] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  138.987329] nvme nvme0: Failed reconnect attempt 3
[  138.992691] nvme nvme0: Reconnecting in 10 seconds...
[  149.232610] nvme nvme0: creating 40 I/O queues.
[  149.831443] nvme nvme0: Successfully reconnected
[  149.831519] nvme nvme0: identifiers changed for nsid 1
[root@rdma-virt-01 linux ((dafb1b2...))]$ lsblk
NAME                           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                              8:0    0 465.8G  0 disk
├─sda2                           8:2    0 464.8G  0 part
│ ├─rhelaa_rdma--virt--01-swap 253:1    0     4G  0 lvm  [SWAP]
│ ├─rhelaa_rdma--virt--01-home 253:2    0 410.8G  0 lvm  /home
│ └─rhelaa_rdma--virt--01-root 253:0    0    50G  0 lvm  /
└─sda1                           8:1    0     1G  0 part /boot

> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 405895b1dff2..916658e010ff 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -572,6 +572,11 @@ static void nvme_rdma_free_queue(struct 
> nvme_rdma_queue *queue)
>         if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
>                 return;
>
> +       if(nvme_rdma_queue_idx(queue) == 0)
> +               nvme_rdma_free_qe(queue->device->dev,
> +                       &queue->ctrl->async_event_sqe,
> +                       sizeof(struct nvme_command), DMA_TO_DEVICE);
> +
>         nvme_rdma_destroy_queue_ib(queue);
>         rdma_destroy_id(queue->cm_id);
>  }
> @@ -739,8 +744,6 @@ static struct blk_mq_tag_set 
> *nvme_rdma_alloc_tagset(struct nvme_ctrl *nctrl,
>  static void nvme_rdma_destroy_admin_queue(struct nvme_rdma_ctrl *ctrl,
>                 bool remove)
>  {
> -       nvme_rdma_free_qe(ctrl->queues[0].device->dev, 
> &ctrl->async_event_sqe,
> -                       sizeof(struct nvme_command), DMA_TO_DEVICE);
>         nvme_rdma_stop_queue(&ctrl->queues[0]);
>         if (remove) {
>                 blk_cleanup_queue(ctrl->ctrl.admin_q);
> -- 
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg Oct. 19, 2017, 9:44 a.m. UTC | #2
> Hi Sagi
> With this patch, no such error log found on host side,


Awsome. that's the culprit...

> but I found there 
> is no nvme0n1 device node even get "nvme nvme0: Successfully 
> reconnected" on host.

That is expected because you did not persist a namespace UUID
which caused the kernel to generate a random one. That confused
the host as it got the same namespace ID with a different UUID.

Can you please set a uuid when you rerun the test?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yi Zhang Oct. 19, 2017, 11:13 a.m. UTC | #3
On 10/19/2017 05:44 PM, Sagi Grimberg wrote:
>
>> Hi Sagi
>> With this patch, no such error log found on host side,
>
>
> Awsome. that's the culprit...
>
>> but I found there is no nvme0n1 device node even get "nvme nvme0: 
>> Successfully reconnected" on host.
>
> That is expected because you did not persist a namespace UUID
> which caused the kernel to generate a random one. That confused
> the host as it got the same namespace ID with a different UUID.
>
> Can you please set a uuid when you rerun the test?
>
I tried add uuid filed on rdma.json, it works well now.

[root@rdma-virt-00 ~]$ cat /etc/rdma.json
{
   "hosts": [
     {
       "nqn": "hostnqn"
     }
   ],
   "ports": [
     {
       "addr": {
         "adrfam": "ipv4",
         "traddr": "172.31.0.90",
         "treq": "not specified",
         "trsvcid": "4420",
         "trtype": "rdma"
       },
       "portid": 2,
       "referrals": [],
       "subsystems": [
         "testnqn"
       ]
     }
   ],
   "subsystems": [
     {
       "allowed_hosts": [],
       "attr": {
         "allow_any_host": "1"
       },
       "namespaces": [
         {
           "device": {
             "nguid": "ef90689c-6c46-d44c-89c1-4067801309a8",
             "path": "/dev/nullb0",
             "uuid": "00000000-0000-0000-0000-000000000001"
           },
           "enable": 1,
           "nsid": 1
         }
       ],
       "nqn": "testnqn"
     }
   ]
}

> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 405895b1dff2..916658e010ff 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -572,6 +572,11 @@  static void nvme_rdma_free_queue(struct 
nvme_rdma_queue *queue)
         if (!test_and_clear_bit(NVME_RDMA_Q_ALLOCATED, &queue->flags))
                 return;

+       if(nvme_rdma_queue_idx(queue) == 0)
+               nvme_rdma_free_qe(queue->device->dev,
+                       &queue->ctrl->async_event_sqe,
+                       sizeof(struct nvme_command), DMA_TO_DEVICE);
+
         nvme_rdma_destroy_queue_ib(queue);
         rdma_destroy_id(queue->cm_id);