diff mbox

kernull NULL pointer observed on initiator side after 'nvmetcli clear' on target side

Message ID 6ffda302-02f9-12f0-a112-ea7cd20b9ffa@grimberg.me (mailing list archive)
State Rejected
Headers show

Commit Message

Sagi Grimberg March 9, 2017, 11:23 a.m. UTC
> Hi Sagi
> With this patch, the NULL pointer fixed now.
> But from below log, we can see it will continue reconnecting in 10
> seconds and cannot be stopped.
>
> [36288.963890] Broke affinity for irq 16
> [36288.983090] Broke affinity for irq 28
> [36289.003104] Broke affinity for irq 90
> [36289.020488] Broke affinity for irq 93
> [36289.036911] Broke affinity for irq 97
> [36289.053344] Broke affinity for irq 100
> [36289.070166] Broke affinity for irq 104
> [36289.088076] smpboot: CPU 1 is now offline
> [36302.371160] nvme nvme0: reconnecting in 10 seconds
> [36312.953684] blk_mq_reinit_tagset: tag is null, continue
> [36312.983267] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [36313.017290] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [36313.044937] nvme nvme0: Failed reconnect attempt, requeueing...
> [36323.171983] blk_mq_reinit_tagset: tag is null, continue
> [36323.200733] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [36323.233820] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [36323.261027] nvme nvme0: Failed reconnect attempt, requeueing...
> [36333.412341] blk_mq_reinit_tagset: tag is null, continue
> [36333.441346] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [36333.476139] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [36333.502794] nvme nvme0: Failed reconnect attempt, requeueing...
> [36343.652755] blk_mq_reinit_tagset: tag is null, continue
> [36343.682103] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [36343.716645] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [36343.743581] nvme nvme0: Failed reconnect attempt, requeueing...
> [36353.893103] blk_mq_reinit_tagset: tag is null, continue
> [36353.921041] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [36353.953541] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [36353.983528] nvme nvme0: Failed reconnect attempt, requeueing...
> [36364.133544] blk_mq_reinit_tagset: tag is null, continue
> [36364.162012] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [36364.195002] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [36364.221671] nvme nvme0: Failed reconnect attempt, requeueing...
>

Yep... looks like we don't take into account that we can't use all the
queues now...

Does this patch help:
--
                         DMA_TO_DEVICE);
         if (ret)
@@ -647,8 +645,22 @@ static int nvme_rdma_connect_io_queues(struct 
nvme_rdma_ctrl *ctrl)

  static int nvme_rdma_init_io_queues(struct nvme_rdma_ctrl *ctrl)
  {
+       struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
+       unsigned int nr_io_queues;
         int i, ret;

+       nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
+       ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues);
+       if (ret)
+               return ret;
+
+       ctrl->queue_count = nr_io_queues + 1;
+       if (ctrl->queue_count < 2)
+               return 0;
+
+       dev_info(ctrl->ctrl.device,
+               "creating %d I/O queues.\n", nr_io_queues);
+
         for (i = 1; i < ctrl->queue_count; i++) {
                 ret = nvme_rdma_init_queue(ctrl, i,
                                            ctrl->ctrl.opts->queue_size);
@@ -1793,20 +1805,8 @@ static const struct nvme_ctrl_ops 
nvme_rdma_ctrl_ops = {

  static int nvme_rdma_create_io_queues(struct nvme_rdma_ctrl *ctrl)
  {
-       struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
         int ret;

-       ret = nvme_set_queue_count(&ctrl->ctrl, &opts->nr_io_queues);
-       if (ret)
-               return ret;
-
-       ctrl->queue_count = opts->nr_io_queues + 1;
-       if (ctrl->queue_count < 2)
-               return 0;
-
-       dev_info(ctrl->ctrl.device,
-               "creating %d I/O queues.\n", opts->nr_io_queues);
-
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Yi Zhang March 10, 2017, 7:59 a.m. UTC | #1
>>
>
> Yep... looks like we don't take into account that we can't use all the
> queues now...
>
> Does this patch help:
Still can reproduce the reconnect in 10 seconds issues with the patch, 
here is the log:

[  193.574183] nvme nvme0: new ctrl: NQN "nvme-subsystem-name", addr 
172.31.2.3:1023
[  193.612039] __nvme_rdma_init_request: changing called
[  193.638723] __nvme_rdma_init_request: changing called
[  193.661767] __nvme_rdma_init_request: changing called
[  193.684579] __nvme_rdma_init_request: changing called
[  193.707327] __nvme_rdma_init_request: changing called
[  193.730071] __nvme_rdma_init_request: changing called
[  193.752896] __nvme_rdma_init_request: changing called
[  193.775699] __nvme_rdma_init_request: changing called
[  193.798813] __nvme_rdma_init_request: changing called
[  193.821257] __nvme_rdma_init_request: changing called
[  193.844090] __nvme_rdma_init_request: changing called
[  193.866472] __nvme_rdma_init_request: changing called
[  193.889375] __nvme_rdma_init_request: changing called
[  193.912094] __nvme_rdma_init_request: changing called
[  193.934942] __nvme_rdma_init_request: changing called
[  193.957688] __nvme_rdma_init_request: changing called
[  606.273376] Broke affinity for irq 16
[  606.291940] Broke affinity for irq 28
[  606.310201] Broke affinity for irq 90
[  606.328211] Broke affinity for irq 93
[  606.346263] Broke affinity for irq 97
[  606.364314] Broke affinity for irq 100
[  606.382105] Broke affinity for irq 104
[  606.400727] smpboot: CPU 1 is now offline
[  616.820505] nvme nvme0: reconnecting in 10 seconds
[  626.882747] blk_mq_reinit_tagset: tag is null, continue
[  626.914000] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  626.947965] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  626.974673] nvme nvme0: Failed reconnect attempt, requeueing...
[  637.100252] blk_mq_reinit_tagset: tag is null, continue
[  637.129200] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  637.163578] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  637.190246] nvme nvme0: Failed reconnect attempt, requeueing...
[  647.340147] blk_mq_reinit_tagset: tag is null, continue
[  647.367612] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  647.402527] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  647.430338] nvme nvme0: Failed reconnect attempt, requeueing...
[  657.579993] blk_mq_reinit_tagset: tag is null, continue
[  657.608478] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  657.643947] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  657.670579] nvme nvme0: Failed reconnect attempt, requeueing...
[  667.819897] blk_mq_reinit_tagset: tag is null, continue
[  667.848786] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  667.881951] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  667.908578] nvme nvme0: Failed reconnect attempt, requeueing...
[  678.059821] blk_mq_reinit_tagset: tag is null, continue
[  678.089295] nvme nvme0: Connect rejected: status 8 (invalid service ID).
[  678.123602] nvme nvme0: rdma_resolve_addr wait failed (-104).
[  678.150317] nvme nvme0: Failed reconnect attempt, requeueing...


> -- 
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> index 29ac8fcb8d2c..25af3f75f6f1 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -337,8 +337,6 @@ static int __nvme_rdma_init_request(struct 
> nvme_rdma_ctrl *ctrl,
>         struct ib_device *ibdev = dev->dev;
>         int ret;
>
> -       BUG_ON(queue_idx >= ctrl->queue_count);
> -
>         ret = nvme_rdma_alloc_qe(ibdev, &req->sqe, sizeof(struct 
> nvme_command),
>                         DMA_TO_DEVICE);
>         if (ret)
> @@ -647,8 +645,22 @@ static int nvme_rdma_connect_io_queues(struct 
> nvme_rdma_ctrl *ctrl)
>
>  static int nvme_rdma_init_io_queues(struct nvme_rdma_ctrl *ctrl)
>  {
> +       struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
> +       unsigned int nr_io_queues;
>         int i, ret;
>
> +       nr_io_queues = min(opts->nr_io_queues, num_online_cpus());
> +       ret = nvme_set_queue_count(&ctrl->ctrl, &nr_io_queues);
> +       if (ret)
> +               return ret;
> +
> +       ctrl->queue_count = nr_io_queues + 1;
> +       if (ctrl->queue_count < 2)
> +               return 0;
> +
> +       dev_info(ctrl->ctrl.device,
> +               "creating %d I/O queues.\n", nr_io_queues);
> +
>         for (i = 1; i < ctrl->queue_count; i++) {
>                 ret = nvme_rdma_init_queue(ctrl, i,
> ctrl->ctrl.opts->queue_size);
> @@ -1793,20 +1805,8 @@ static const struct nvme_ctrl_ops 
> nvme_rdma_ctrl_ops = {
>
>  static int nvme_rdma_create_io_queues(struct nvme_rdma_ctrl *ctrl)
>  {
> -       struct nvmf_ctrl_options *opts = ctrl->ctrl.opts;
>         int ret;
>
> -       ret = nvme_set_queue_count(&ctrl->ctrl, &opts->nr_io_queues);
> -       if (ret)
> -               return ret;
> -
> -       ctrl->queue_count = opts->nr_io_queues + 1;
> -       if (ctrl->queue_count < 2)
> -               return 0;
> -
> -       dev_info(ctrl->ctrl.device,
> -               "creating %d I/O queues.\n", opts->nr_io_queues);
> -
> -- 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 13, 2017, 8:09 a.m. UTC | #2
>> Yep... looks like we don't take into account that we can't use all the
>> queues now...
>>
>> Does this patch help:
> Still can reproduce the reconnect in 10 seconds issues with the patch,
> here is the log:
>
> [  193.574183] nvme nvme0: new ctrl: NQN "nvme-subsystem-name", addr
> 172.31.2.3:1023
> [  193.612039] __nvme_rdma_init_request: changing called
> [  193.638723] __nvme_rdma_init_request: changing called
> [  193.661767] __nvme_rdma_init_request: changing called
> [  193.684579] __nvme_rdma_init_request: changing called
> [  193.707327] __nvme_rdma_init_request: changing called
> [  193.730071] __nvme_rdma_init_request: changing called
> [  193.752896] __nvme_rdma_init_request: changing called
> [  193.775699] __nvme_rdma_init_request: changing called
> [  193.798813] __nvme_rdma_init_request: changing called
> [  193.821257] __nvme_rdma_init_request: changing called
> [  193.844090] __nvme_rdma_init_request: changing called
> [  193.866472] __nvme_rdma_init_request: changing called
> [  193.889375] __nvme_rdma_init_request: changing called
> [  193.912094] __nvme_rdma_init_request: changing called
> [  193.934942] __nvme_rdma_init_request: changing called
> [  193.957688] __nvme_rdma_init_request: changing called
> [  606.273376] Broke affinity for irq 16
> [  606.291940] Broke affinity for irq 28
> [  606.310201] Broke affinity for irq 90
> [  606.328211] Broke affinity for irq 93
> [  606.346263] Broke affinity for irq 97
> [  606.364314] Broke affinity for irq 100
> [  606.382105] Broke affinity for irq 104
> [  606.400727] smpboot: CPU 1 is now offline
> [  616.820505] nvme nvme0: reconnecting in 10 seconds
> [  626.882747] blk_mq_reinit_tagset: tag is null, continue
> [  626.914000] nvme nvme0: Connect rejected: status 8 (invalid service ID).
> [  626.947965] nvme nvme0: rdma_resolve_addr wait failed (-104).
> [  626.974673] nvme nvme0: Failed reconnect attempt, requeueing...

This is strange...

Is the target alive? I'm assuming it didn't crash here correct?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yi Zhang March 14, 2017, 1:27 p.m. UTC | #3
On 03/13/2017 04:09 PM, Sagi Grimberg wrote:
>
>>> Yep... looks like we don't take into account that we can't use all the
>>> queues now...
>>>
>>> Does this patch help:
>> Still can reproduce the reconnect in 10 seconds issues with the patch,
>> here is the log:
>>
>> [  193.574183] nvme nvme0: new ctrl: NQN "nvme-subsystem-name", addr
>> 172.31.2.3:1023
>> [  193.612039] __nvme_rdma_init_request: changing called
>> [  193.638723] __nvme_rdma_init_request: changing called
>> [  193.661767] __nvme_rdma_init_request: changing called
>> [  193.684579] __nvme_rdma_init_request: changing called
>> [  193.707327] __nvme_rdma_init_request: changing called
>> [  193.730071] __nvme_rdma_init_request: changing called
>> [  193.752896] __nvme_rdma_init_request: changing called
>> [  193.775699] __nvme_rdma_init_request: changing called
>> [  193.798813] __nvme_rdma_init_request: changing called
>> [  193.821257] __nvme_rdma_init_request: changing called
>> [  193.844090] __nvme_rdma_init_request: changing called
>> [  193.866472] __nvme_rdma_init_request: changing called
>> [  193.889375] __nvme_rdma_init_request: changing called
>> [  193.912094] __nvme_rdma_init_request: changing called
>> [  193.934942] __nvme_rdma_init_request: changing called
>> [  193.957688] __nvme_rdma_init_request: changing called
>> [  606.273376] Broke affinity for irq 16
>> [  606.291940] Broke affinity for irq 28
>> [  606.310201] Broke affinity for irq 90
>> [  606.328211] Broke affinity for irq 93
>> [  606.346263] Broke affinity for irq 97
>> [  606.364314] Broke affinity for irq 100
>> [  606.382105] Broke affinity for irq 104
>> [  606.400727] smpboot: CPU 1 is now offline
>> [  616.820505] nvme nvme0: reconnecting in 10 seconds
>> [  626.882747] blk_mq_reinit_tagset: tag is null, continue
>> [  626.914000] nvme nvme0: Connect rejected: status 8 (invalid 
>> service ID).
>> [  626.947965] nvme nvme0: rdma_resolve_addr wait failed (-104).
>> [  626.974673] nvme nvme0: Failed reconnect attempt, requeueing...
>
> This is strange...
>
> Is the target alive? I'm assuming it didn't crash here correct?
The target was deleted by 'nvmetcli clear' command.
Then on client side, seems it doesn't know the target side was deleted 
and will always reconnecting in 10 seconds.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 16, 2017, 4:40 p.m. UTC | #4
> The target was deleted by 'nvmetcli clear' command.
> Then on client side, seems it doesn't know the target side was deleted
> and will always reconnecting in 10 seconds.

Oh, so the target doesn't come back. That makes sense. The host
doesn't know if/when the target will come back so it attempts reconnect
periodically forever.

I think what you're asking is a "dev_loss_tmo" kind of functionality
where the host gives up on the controller correct?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yi Zhang March 18, 2017, 12:06 p.m. UTC | #5
On 03/17/2017 12:40 AM, Sagi Grimberg wrote:
>
>> The target was deleted by 'nvmetcli clear' command.
>> Then on client side, seems it doesn't know the target side was deleted
>> and will always reconnecting in 10 seconds.
>
> Oh, so the target doesn't come back. That makes sense. The host
> doesn't know if/when the target will come back so it attempts reconnect
> periodically forever.
>
> I think what you're asking is a "dev_loss_tmo" kind of functionality
> where the host gives up on the controller correct?
Hi Sagi
Yes, since the target was deleted, but the client doesn't realized it 
and always reconnect it.
I think it's better to stop the reconnect operation. Or any other good idea?
Since I'm a newbie and not familiar with for nvme-of, correct if my 
thought not reasonable. :)

Thanks
Yi

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 29ac8fcb8d2c..25af3f75f6f1 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -337,8 +337,6 @@  static int __nvme_rdma_init_request(struct 
nvme_rdma_ctrl *ctrl,
         struct ib_device *ibdev = dev->dev;
         int ret;

-       BUG_ON(queue_idx >= ctrl->queue_count);
-
         ret = nvme_rdma_alloc_qe(ibdev, &req->sqe, sizeof(struct 
nvme_command),