diff mbox

Cannot Connect NVMeoF At Certain NR_IO_Queues Values

Message ID 743c4b33-2bf8-7c76-95af-62043c2aaeeb@mellanox.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Max Gurtovoy May 14, 2018, 10:46 p.m. UTC
Hi Joseph,


On 5/14/2018 8:46 PM, Gruher, Joseph R wrote:
> I'm running Ubuntu 18.04 with the included 4.15.0 kernel and Mellanox CX4 NICs and Intel P4800X SSDs.  I'm using NVMe-CLI v1.5 and nvmetcli v0.6.
> 
> I am getting a connect failure even at a relatively moderate nr_io_queues value such as 8:
> 
> rsa@tppjoe01:~$ sudo nvme connect -t rdma -a 10.6.0.16 -i 8 -n NQN1
> Failed to write to /dev/nvme-fabrics: Invalid cross-device link
> 
> However, it works just fine if I use a smaller value, such as 4:
> 
> rsa@tppjoe01:~$ sudo nvme connect -t rdma -a 10.6.0.16 -i 4 -n NQN1
> rsa@tppjoe01:~$
> 
> Target side dmesg from a failed attached with -i 8:
> 
> [425470.899691] nvmet: creating controller 1 for subsystem NQN1 for NQN nqn.2014-08.org.nvmexpress:uuid:8d0ac789-9136-4275-a46c-8d1223c8fe84.
> [425471.081358] nvmet: adding queue 1 to ctrl 1.
> [425471.081563] nvmet: adding queue 2 to ctrl 1.
> [425471.081758] nvmet: adding queue 3 to ctrl 1.
> [425471.110059] nvmet_rdma: freeing queue 3
> [425471.110946] nvmet_rdma: freeing queue 1
> [425471.111905] nvmet_rdma: freeing queue 2
> [425471.382128] nvmet_rdma: freeing queue 4
> [425471.522836] nvmet_rdma: freeing queue 5
> [425471.640105] nvmet_rdma: freeing queue 7
> [425471.669427] nvmet_rdma: freeing queue 6
> [425471.670107] nvmet_rdma: freeing queue 0
> [425471.692922] nvmet_rdma: freeing queue 8
> 
> Initiator side dmesg from same attempt:
> 
> [862316.209664] nvme nvme1: creating 8 I/O queues.
> [862316.391411] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
> [862316.406271] nvme nvme1: failed to connect queue: 4 ret=-18

IMO this issue was fixed in mlx5_core function mlx5_get_vector_affinity.
It was a long discussion regarding this fix and it will be fixed again 
in 4.17. After the final fix, it should go to stable kernel as well.
Meanwhile I can suggest a fast workaround for you if needed (or other 
solutions as well):


-Max.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Gruher, Joseph R May 14, 2018, 11:09 p.m. UTC | #1
> IMO this issue was fixed in mlx5_core function mlx5_get_vector_affinity.
> It was a long discussion regarding this fix and it will be fixed again in 4.17. After
> the final fix, it should go to stable kernel as well.

Thanks Max.  So current 4.17-rc5 has a fix now?  Or fix is not in yet?

> Meanwhile I can suggest a fast workaround for you if needed (or other
> solutions as well):
> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index
> 0f840ec..dd92cb9 100644
> --- a/drivers/nvme/host/rdma.c
> +++ b/drivers/nvme/host/rdma.c
> @@ -2236,7 +2236,7 @@ static int nvme_rdma_map_queues(struct
> blk_mq_tag_set *set)
>          .init_hctx      = nvme_rdma_init_hctx,
>          .poll           = nvme_rdma_poll,
>          .timeout        = nvme_rdma_timeout,
> -       .map_queues     = nvme_rdma_map_queues,
>   };

Thanks, I'll give this a shot.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Max Gurtovoy May 15, 2018, 2:04 p.m. UTC | #2
On 5/15/2018 2:09 AM, Gruher, Joseph R wrote:
>> IMO this issue was fixed in mlx5_core function mlx5_get_vector_affinity.
>> It was a long discussion regarding this fix and it will be fixed again in 4.17. After
>> the final fix, it should go to stable kernel as well.
> 
> Thanks Max.  So current 4.17-rc5 has a fix now?  Or fix is not in yet?

The latest fix is not there. SaeedM should push it this week.

> 
>> Meanwhile I can suggest a fast workaround for you if needed (or other
>> solutions as well):
>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index
>> 0f840ec..dd92cb9 100644
>> --- a/drivers/nvme/host/rdma.c
>> +++ b/drivers/nvme/host/rdma.c
>> @@ -2236,7 +2236,7 @@ static int nvme_rdma_map_queues(struct
>> blk_mq_tag_set *set)
>>           .init_hctx      = nvme_rdma_init_hctx,
>>           .poll           = nvme_rdma_poll,
>>           .timeout        = nvme_rdma_timeout,
>> -       .map_queues     = nvme_rdma_map_queues,
>>    };
> 
> Thanks, I'll give this a shot.
> 

Great.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gruher, Joseph R May 15, 2018, 8:28 p.m. UTC | #3
> >> Meanwhile I can suggest a fast workaround for you if needed (or other
> >> solutions as well):
> >> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
> >> index
> >> 0f840ec..dd92cb9 100644
> >> --- a/drivers/nvme/host/rdma.c
> >> +++ b/drivers/nvme/host/rdma.c
> >> @@ -2236,7 +2236,7 @@ static int nvme_rdma_map_queues(struct
> >> blk_mq_tag_set *set)
> >>           .init_hctx      = nvme_rdma_init_hctx,
> >>           .poll           = nvme_rdma_poll,
> >>           .timeout        = nvme_rdma_timeout,
> >> -       .map_queues     = nvme_rdma_map_queues,
> >>    };
> >
> > Thanks, I'll give this a shot.
> >
> 
> Great.

Max, I'm not having much luck rebuilding with this code change, to be honest building the kernel from source is not something I'm used to doing.  You mentioned other workarounds may exist, are there any that don't involve rebuilding from the kernel source?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Max Gurtovoy May 15, 2018, 9:25 p.m. UTC | #4
On 5/15/2018 11:28 PM, Gruher, Joseph R wrote:
> 
>>>> Meanwhile I can suggest a fast workaround for you if needed (or other
>>>> solutions as well):
>>>> diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
>>>> index
>>>> 0f840ec..dd92cb9 100644
>>>> --- a/drivers/nvme/host/rdma.c
>>>> +++ b/drivers/nvme/host/rdma.c
>>>> @@ -2236,7 +2236,7 @@ static int nvme_rdma_map_queues(struct
>>>> blk_mq_tag_set *set)
>>>>            .init_hctx      = nvme_rdma_init_hctx,
>>>>            .poll           = nvme_rdma_poll,
>>>>            .timeout        = nvme_rdma_timeout,
>>>> -       .map_queues     = nvme_rdma_map_queues,
>>>>     };
>>>
>>> Thanks, I'll give this a shot.
>>>
>>
>> Great.
> 
> Max, I'm not having much luck rebuilding with this code change, to be honest building the kernel from source is not something I'm used to doing.  You mentioned other workarounds may exist, are there any that don't involve rebuilding from the kernel source?

other WA is to install MLNX_OFED package.
if this option is good for you so please send me an email.

> 
> Thanks!
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gruher, Joseph R June 6, 2018, 2:29 a.m. UTC | #5
> On 5/15/2018 2:09 AM, Gruher, Joseph R wrote:
> >> IMO this issue was fixed in mlx5_core function mlx5_get_vector_affinity.
> >> It was a long discussion regarding this fix and it will be fixed
> >> again in 4.17. After the final fix, it should go to stable kernel as well.
> >
> > Thanks Max.  So current 4.17-rc5 has a fix now?  Or fix is not in yet?
> 
> The latest fix is not there. SaeedM should push it this week.
> 

Hey Max, was just hoping to confirm if this fix made it in?  Do you know which kernel release would first include it?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Max Gurtovoy June 6, 2018, 9:27 a.m. UTC | #6
On 6/6/2018 5:29 AM, Gruher, Joseph R wrote:
>> On 5/15/2018 2:09 AM, Gruher, Joseph R wrote:
>>>> IMO this issue was fixed in mlx5_core function mlx5_get_vector_affinity.
>>>> It was a long discussion regarding this fix and it will be fixed
>>>> again in 4.17. After the final fix, it should go to stable kernel as well.
>>>
>>> Thanks Max.  So current 4.17-rc5 has a fix now?  Or fix is not in yet?
>>
>> The latest fix is not there. SaeedM should push it this week.
>>
> 
> Hey Max, was just hoping to confirm if this fix made it in?  Do you know which kernel release would first include it?
> 

Hey,

Commit e3ca34880652250f524022ad89e516f8ba9a805b was pushed to 4.17.

-Max.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 0f840ec..dd92cb9 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -2236,7 +2236,7 @@  static int nvme_rdma_map_queues(struct 
blk_mq_tag_set *set)
         .init_hctx      = nvme_rdma_init_hctx,
         .poll           = nvme_rdma_poll,
         .timeout        = nvme_rdma_timeout,
-       .map_queues     = nvme_rdma_map_queues,
  };