Message ID | 3f5da1e4-99f1-8376-a291-e50a7d52c303@suse.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xprtrdma: Make sure Send CQ is allocated on an existing CPU | expand |
Hi Nicolas- > On Jan 23, 2019, at 8:12 AM, Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> wrote: > > Make sure host has at least 2 CPU before allocating to CPU#1 The fourth parameter of ib_alloc_cq() is not a CPU number, it's a completion vector number. What failure did you see that prompted this patch? > Fixes: a4699f5647f3 (xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode) > Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> > --- > net/sunrpc/xprtrdma/verbs.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c > index b725911c0f3f..36aa7b2648e4 100644 > --- a/net/sunrpc/xprtrdma/verbs.c > +++ b/net/sunrpc/xprtrdma/verbs.c > @@ -546,7 +546,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia, > > sendcq = ib_alloc_cq(ia->ri_device, NULL, > ep->rep_attr.cap.max_send_wr + 1, > - 1, IB_POLL_WORKQUEUE); > + num_online_cpus() > 1 ? 1 : 0, IB_POLL_WORKQUEUE); > if (IS_ERR(sendcq)) { > rc = PTR_ERR(sendcq); > dprintk("RPC: %s: failed to create send CQ: %i\n", > -- > 2.18.0 > -- Chuck Lever
On 1/23/19 5:51 PM, Chuck Lever wrote: > Hi Nicolas- > >> On Jan 23, 2019, at 8:12 AM, Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> wrote: >> >> Make sure host has at least 2 CPU before allocating to CPU#1 > The fourth parameter of ib_alloc_cq() is not a CPU number, > it's a completion vector number. What failure did you see > that prompted this patch? When trying to mount, I get this: + mount -o rdma,port=20049 192.168.20.15:/tmp/RAM /tmp/RAM mount.nfs: mounting 192.168.20.15:/tmp/RAM failed, reason given by server: No such file or directory Digging a bit into the code, it appears that the cq allocation here returns a ENOENT which come from mlx5_vector2eqn. On my system (VM with a mlx5 card with SRIOV), the comp_eqs_list only contains one entry with index == 0 Nicolas
On 1/23/19 6:06 PM, Nicolas Morey-Chaisemartin wrote: > > On 1/23/19 5:51 PM, Chuck Lever wrote: >> Hi Nicolas- >> >>> On Jan 23, 2019, at 8:12 AM, Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> wrote: >>> >>> Make sure host has at least 2 CPU before allocating to CPU#1 >> The fourth parameter of ib_alloc_cq() is not a CPU number, >> it's a completion vector number. What failure did you see >> that prompted this patch? > When trying to mount, I get this: > + mount -o rdma,port=20049 192.168.20.15:/tmp/RAM /tmp/RAM > mount.nfs: mounting 192.168.20.15:/tmp/RAM failed, reason given by server: No such file or directory > > Digging a bit into the code, it appears that the cq allocation here returns a ENOENT which come from mlx5_vector2eqn. > On my system (VM with a mlx5 card with SRIOV), the comp_eqs_list only contains one entry with index == 0 > > Nicolas > Also, adding a 2nd core to my VM fixes the issue (thus my understanding that it was a CPU number)
> On Jan 23, 2019, at 12:07 PM, Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.de> wrote: > > > > On 1/23/19 6:06 PM, Nicolas Morey-Chaisemartin wrote: >> >> On 1/23/19 5:51 PM, Chuck Lever wrote: >>> Hi Nicolas- >>> >>>> On Jan 23, 2019, at 8:12 AM, Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> wrote: >>>> >>>> Make sure host has at least 2 CPU before allocating to CPU#1 >>> The fourth parameter of ib_alloc_cq() is not a CPU number, >>> it's a completion vector number. What failure did you see >>> that prompted this patch? >> When trying to mount, I get this: >> + mount -o rdma,port=20049 192.168.20.15:/tmp/RAM /tmp/RAM >> mount.nfs: mounting 192.168.20.15:/tmp/RAM failed, reason given by server: No such file or directory >> >> Digging a bit into the code, it appears that the cq allocation here returns a ENOENT which come from mlx5_vector2eqn. >> On my system (VM with a mlx5 card with SRIOV), the comp_eqs_list only contains one entry with index == 0 >> >> Nicolas >> > > Also, adding a 2nd core to my VM fixes the issue (thus my understanding that it was a CPU number) Fair enough. The 2nd CPU adds a 2nd compvec. Instead of num_cpus_online() you want ib_device::num_comp_vectors. I suspect there's a spiffier way to go about this these days thanks to ib_get_vector_affinity, but you've found a longstanding bug. So let's get something that can be comfortably backported to stable. -- Chuck Lever
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c index b725911c0f3f..36aa7b2648e4 100644 --- a/net/sunrpc/xprtrdma/verbs.c +++ b/net/sunrpc/xprtrdma/verbs.c @@ -546,7 +546,7 @@ rpcrdma_ep_create(struct rpcrdma_ep *ep, struct rpcrdma_ia *ia, sendcq = ib_alloc_cq(ia->ri_device, NULL, ep->rep_attr.cap.max_send_wr + 1, - 1, IB_POLL_WORKQUEUE); + num_online_cpus() > 1 ? 1 : 0, IB_POLL_WORKQUEUE); if (IS_ERR(sendcq)) { rc = PTR_ERR(sendcq); dprintk("RPC: %s: failed to create send CQ: %i\n",
Make sure host has at least 2 CPU before allocating to CPU#1 Fixes: a4699f5647f3 (xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode) Signed-off-by: Nicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com> --- net/sunrpc/xprtrdma/verbs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)