diff mbox

[v1,03/14] svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT

Message ID 20170316155250.4482.49638.stgit@klimt.1015granger.net (mailing list archive)
State New, archived
Headers show

Commit Message

Chuck Lever March 16, 2017, 3:52 p.m. UTC
The Send Queue depth is temporarily reduced to 1 SQE per credit. The
new rdma_rw API does an internal computation, during QP creation, to
increase the depth of the Send Queue to handle RDMA Read and Write
operations.

This change has to come before the NFSD code paths are updated to
use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
the size of the SQ too much, resulting in memory allocation failures
during QP creation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc_rdma.h          |    1 -
 net/sunrpc/xprtrdma/svc_rdma.c           |    2 --
 net/sunrpc/xprtrdma/svc_rdma_transport.c |    2 +-
 3 files changed, 1 insertion(+), 4 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sagi Grimberg March 21, 2017, 5:58 p.m. UTC | #1
> The Send Queue depth is temporarily reduced to 1 SQE per credit. The
> new rdma_rw API does an internal computation, during QP creation, to
> increase the depth of the Send Queue to handle RDMA Read and Write
> operations.
>
> This change has to come before the NFSD code paths are updated to
> use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
> the size of the SQ too much, resulting in memory allocation failures
> during QP creation.

I agree this needs to happen, but turns out you don't have any
guarantees of the maximum size of the sq depending on your max_sge
parameter. I'd recommend having a fall-back shrinked size sq allocation
impllemented like srpt does.

We don't have it in nvmet-rdma nor iser, but its a good thing to have...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever March 21, 2017, 6:44 p.m. UTC | #2
> On Mar 21, 2017, at 1:58 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> The Send Queue depth is temporarily reduced to 1 SQE per credit. The
>> new rdma_rw API does an internal computation, during QP creation, to
>> increase the depth of the Send Queue to handle RDMA Read and Write
>> operations.
>> 
>> This change has to come before the NFSD code paths are updated to
>> use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
>> the size of the SQ too much, resulting in memory allocation failures
>> during QP creation.
> 
> I agree this needs to happen, but turns out you don't have any
> guarantees of the maximum size of the sq depending on your max_sge
> parameter.

That's true. However, this is meant to be temporary while I'm
working out details of the rdma_rw API conversion. More work
in this area comes in the next series:

http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api


> I'd recommend having a fall-back shrinked size sq allocation
> impllemented like srpt does.

Agree it should be done. Would it be OK to wait until the dust
settles here, or do you think it's a hard requirement for
accepting this series?


> We don't have it in nvmet-rdma nor iser, but its a good thing to have...


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 22, 2017, 1:09 p.m. UTC | #3
>> I agree this needs to happen, but turns out you don't have any
>> guarantees of the maximum size of the sq depending on your max_sge
>> parameter.
>
> That's true. However, this is meant to be temporary while I'm
> working out details of the rdma_rw API conversion. More work
> in this area comes in the next series:
>
> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api

Thanks for the pointer...

>> I'd recommend having a fall-back shrinked size sq allocation
>> impllemented like srpt does.
>
> Agree it should be done. Would it be OK to wait until the dust
> settles here, or do you think it's a hard requirement for
> accepting this series?

It isn't and can definitely be added incrementally...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever March 22, 2017, 1:36 p.m. UTC | #4
> On Mar 22, 2017, at 9:09 AM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>>> I agree this needs to happen, but turns out you don't have any
>>> guarantees of the maximum size of the sq depending on your max_sge
>>> parameter.
>> 
>> That's true. However, this is meant to be temporary while I'm
>> working out details of the rdma_rw API conversion. More work
>> in this area comes in the next series:
>> 
>> http://git.linux-nfs.org/?p=cel/cel-2.6.git;a=log;h=refs/heads/nfsd-rdma-rw-api
> 
> Thanks for the pointer...
> 
>>> I'd recommend having a fall-back shrinked size sq allocation
>>> impllemented like srpt does.
>> 
>> Agree it should be done. Would it be OK to wait until the dust
>> settles here, or do you think it's a hard requirement for
>> accepting this series?
> 
> It isn't and can definitely be added incrementally...

Roughly speaking, I think there needs to be an rdma_rw API that
assists the ULP with setting its CQ and SQ sizes, since rdma_rw
hides the registration mode (one of which, at least, consumes
more SQEs than the other).

I'd like to introduce one new function call that surfaces the
factor used to compute how many additional SQEs that rdma_rw will
need. The ULP will invoke it before allocating new Send CQs.

I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sagi Grimberg March 22, 2017, 7:06 p.m. UTC | #5
> Roughly speaking, I think there needs to be an rdma_rw API that
> assists the ULP with setting its CQ and SQ sizes, since rdma_rw
> hides the registration mode (one of which, at least, consumes
> more SQEs than the other).

Hiding the registration mode was the largely the motivation for
this... It buys us simplified implementation and inherently supports
both IB and iWARP (which was annoying and only existing in svc but
still suboptimal).

> I'd like to introduce one new function call that surfaces the
> factor used to compute how many additional SQEs that rdma_rw will
> need. The ULP will invoke it before allocating new Send CQs.

I see your point... We should probably get a sense on how to
size the completion queue. I think that this issue is solved with
the CQ pool API that Christoph sent a while ago but was never
pursued.

The basic idea is that the core would create a pool of long CQs
and then assigns queue-pairs depending on the sq+rq depth.
If we were to pick it up would you consider using it?

> I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.

Cool, lets see what you had in mind...
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chuck Lever March 22, 2017, 7:30 p.m. UTC | #6
> On Mar 22, 2017, at 3:06 PM, Sagi Grimberg <sagi@grimberg.me> wrote:
> 
> 
>> Roughly speaking, I think there needs to be an rdma_rw API that
>> assists the ULP with setting its CQ and SQ sizes, since rdma_rw
>> hides the registration mode (one of which, at least, consumes
>> more SQEs than the other).
> 
> Hiding the registration mode was the largely the motivation for
> this... It buys us simplified implementation and inherently supports
> both IB and iWARP (which was annoying and only existing in svc but
> still suboptimal).
> 
>> I'd like to introduce one new function call that surfaces the
>> factor used to compute how many additional SQEs that rdma_rw will
>> need. The ULP will invoke it before allocating new Send CQs.
> 
> I see your point... We should probably get a sense on how to
> size the completion queue. I think that this issue is solved with
> the CQ pool API that Christoph sent a while ago but was never
> pursued.
> 
> The basic idea is that the core would create a pool of long CQs
> and then assigns queue-pairs depending on the sq+rq depth.
> If we were to pick it up would you consider using it?

I will certainly take a look at it. But I don't think that's
enough.

The ULP is also responsible for managing send queue accounting,
and possibly queuing WRs when a send queue is full. So it still
needs to know the maximum number of send WRs that can be posted
at one time. For svc_rdma, this is sc_sq_avail.

I believe that the ULP needs to know the actual number of SQEs
both for determining CQ size, and for knowing when to plug the
send queue.

This maximum depends on the registration mode, the page list
depth capability of the HCA (relative to the maximum ULP data
payload size), and the page size of the platform.

For example, for NFS, the typical maximum rsize and wsize is 1MB.
The CX-3 Pro cards I have allow 511 pages per MR in FRWR mode.
My systems are x64 using 4KB pages.

So I know that one rdma_rw_ctx can handle 256 pages (or 1MB) of
payload on my system.

An HCA with a smaller page list depth or if the system has larger
pages, or an rsize/wsize of 4MB might want a different number of
MRs for the same transport, and thus a larger send queue.

Alternately, we could set a fixed arbitrary send queue size, and
force all ULPs and devices to live with that. That would be much
simpler.


>> I'll try to provide an RFC in the nfsd-rdma-rw-api topic branch.
> 
> Cool, lets see what you had in mind...


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index ac05495..f066349 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -182,7 +182,6 @@  struct svcxprt_rdma {
 /* The default ORD value is based on two outstanding full-size writes with a
  * page size of 4k, or 32k * 2 ops / 4k = 16 outstanding RDMA_READ.  */
 #define RPCRDMA_ORD             (64/4)
-#define RPCRDMA_SQ_DEPTH_MULT   8
 #define RPCRDMA_MAX_REQUESTS    32
 #define RPCRDMA_MAX_REQ_SIZE    4096
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index c846ca9..9124441 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -247,8 +247,6 @@  int svc_rdma_init(void)
 	dprintk("SVCRDMA Module Init, register RPC RDMA transport\n");
 	dprintk("\tsvcrdma_ord      : %d\n", svcrdma_ord);
 	dprintk("\tmax_requests     : %u\n", svcrdma_max_requests);
-	dprintk("\tsq_depth         : %u\n",
-		svcrdma_max_requests * RPCRDMA_SQ_DEPTH_MULT);
 	dprintk("\tmax_bc_requests  : %u\n", svcrdma_max_bc_requests);
 	dprintk("\tmax_inline       : %d\n", svcrdma_max_req_size);
 
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index c13a5c3..b84cd53 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1013,7 +1013,7 @@  static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
 					    svcrdma_max_bc_requests);
 	newxprt->sc_rq_depth = newxprt->sc_max_requests +
 			       newxprt->sc_max_bc_requests;
-	newxprt->sc_sq_depth = RPCRDMA_SQ_DEPTH_MULT * newxprt->sc_rq_depth;
+	newxprt->sc_sq_depth = newxprt->sc_rq_depth;
 	atomic_set(&newxprt->sc_sq_avail, newxprt->sc_sq_depth);
 
 	if (!svc_rdma_prealloc_ctxts(newxprt))