Message ID | 168607282316.2076.15999420482159168604.stgit@manet.1015granger.net (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | svcrdma: Go back to releasing pages-under-I/O | expand |
On 6/6/2023 1:33 PM, Chuck Lever wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > Now that we have bulk page allocation and release APIs, it's more > efficient to use those than it is for nfsd threads to wait for send > completions. Previous patches have eliminated the calls to > wait_for_completion() and complete(), in order to avoid scheduler > overhead. Makes sense to me. Have you considered changing the send cq arming to avoid completion notifications and simply poll them from the cq at future convenient/efficient times? Reviewed-by: Tom Talpey <tom@talpey.com> Tom. > Now release pages-under-I/O in the send completion handler using > the efficient bulk release API. > > I've measured a 7% reduction in cumulative CPU utilization in > svc_rdma_sendto(), svc_rdma_wc_send(), and svc_xprt_release(). In > particular, using release_pages() instead of complete() cuts the > time per svc_rdma_wc_send() call by two-thirds. This helps improve > scalability because svc_rdma_wc_send() is single-threaded per > connection. > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > net/sunrpc/xprtrdma/svc_rdma_sendto.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > index 1ae4236d04a3..24228f3611e8 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > @@ -236,8 +236,8 @@ void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma, > struct ib_device *device = rdma->sc_cm_id->device; > unsigned int i; > > - for (i = 0; i < ctxt->sc_page_count; ++i) > - put_page(ctxt->sc_pages[i]); > + if (ctxt->sc_page_count) > + release_pages(ctxt->sc_pages, ctxt->sc_page_count); > > /* The first SGE contains the transport header, which > * remains mapped until @ctxt is destroyed. > > >
> On Jun 9, 2023, at 4:31 PM, Tom Talpey <tom@talpey.com> wrote: > > On 6/6/2023 1:33 PM, Chuck Lever wrote: >> From: Chuck Lever <chuck.lever@oracle.com> >> Now that we have bulk page allocation and release APIs, it's more >> efficient to use those than it is for nfsd threads to wait for send >> completions. Previous patches have eliminated the calls to >> wait_for_completion() and complete(), in order to avoid scheduler >> overhead. > > Makes sense to me. Have you considered changing the send cq arming > to avoid completion notifications and simply poll them from the cq > at future convenient/efficient times? That doesn't work for Read completions: those need to wake a new nfsd as soon as they complete. Send and Write completion don't need timely handling unless the SQ is out of SQEs. It would be nice to move that processing out of the CQ handler. Otherwise, no, I haven't considered leaving the Send CQ disarmed. I assumed that, at least on mlx5 hardware, there is some interrupt mitigation that helps in this regard. > Reviewed-by: Tom Talpey <tom@talpey.com> > > Tom. > >> Now release pages-under-I/O in the send completion handler using >> the efficient bulk release API. >> I've measured a 7% reduction in cumulative CPU utilization in >> svc_rdma_sendto(), svc_rdma_wc_send(), and svc_xprt_release(). In >> particular, using release_pages() instead of complete() cuts the >> time per svc_rdma_wc_send() call by two-thirds. This helps improve >> scalability because svc_rdma_wc_send() is single-threaded per >> connection. >> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >> --- >> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> index 1ae4236d04a3..24228f3611e8 100644 >> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >> @@ -236,8 +236,8 @@ void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma, >> struct ib_device *device = rdma->sc_cm_id->device; >> unsigned int i; >> - for (i = 0; i < ctxt->sc_page_count; ++i) >> - put_page(ctxt->sc_pages[i]); >> + if (ctxt->sc_page_count) >> + release_pages(ctxt->sc_pages, ctxt->sc_page_count); >> /* The first SGE contains the transport header, which >> * remains mapped until @ctxt is destroyed. -- Chuck Lever
On 6/9/2023 4:49 PM, Chuck Lever III wrote: > > >> On Jun 9, 2023, at 4:31 PM, Tom Talpey <tom@talpey.com> wrote: >> >> On 6/6/2023 1:33 PM, Chuck Lever wrote: >>> From: Chuck Lever <chuck.lever@oracle.com> >>> Now that we have bulk page allocation and release APIs, it's more >>> efficient to use those than it is for nfsd threads to wait for send >>> completions. Previous patches have eliminated the calls to >>> wait_for_completion() and complete(), in order to avoid scheduler >>> overhead. >> >> Makes sense to me. Have you considered changing the send cq arming >> to avoid completion notifications and simply poll them from the cq >> at future convenient/efficient times? > > That doesn't work for Read completions: those need to wake a new > nfsd as soon as they complete. > > Send and Write completion don't need timely handling unless the > SQ is out of SQEs. It would be nice to move that processing out > of the CQ handler. Yes of course. That's my point, the receive cq and send cq have very different requirements, and even more with this change. > Otherwise, no, I haven't considered leaving the Send CQ disarmed. > I assumed that, at least on mlx5 hardware, there is some interrupt > mitigation that helps in this regard. You definitely do not want interrupt mitigation, it will kill IOPS. Say the hardware pauses 10us before delivering an interrupt, in case another comes in shortly after. Ok, in a single-threaded workload you now have a max 100K IOPS throughput. Interrupt management needs to be done from the upper layer. Which is what cq arming is all about. Food for thought. Tom. >> Reviewed-by: Tom Talpey <tom@talpey.com> >> >> Tom. >> >>> Now release pages-under-I/O in the send completion handler using >>> the efficient bulk release API. >>> I've measured a 7% reduction in cumulative CPU utilization in >>> svc_rdma_sendto(), svc_rdma_wc_send(), and svc_xprt_release(). In >>> particular, using release_pages() instead of complete() cuts the >>> time per svc_rdma_wc_send() call by two-thirds. This helps improve >>> scalability because svc_rdma_wc_send() is single-threaded per >>> connection. >>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> >>> --- >>> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>> index 1ae4236d04a3..24228f3611e8 100644 >>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c >>> @@ -236,8 +236,8 @@ void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma, >>> struct ib_device *device = rdma->sc_cm_id->device; >>> unsigned int i; >>> - for (i = 0; i < ctxt->sc_page_count; ++i) >>> - put_page(ctxt->sc_pages[i]); >>> + if (ctxt->sc_page_count) >>> + release_pages(ctxt->sc_pages, ctxt->sc_page_count); >>> /* The first SGE contains the transport header, which >>> * remains mapped until @ctxt is destroyed. > > -- > Chuck Lever > > >
diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index 1ae4236d04a3..24228f3611e8 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -236,8 +236,8 @@ void svc_rdma_send_ctxt_put(struct svcxprt_rdma *rdma, struct ib_device *device = rdma->sc_cm_id->device; unsigned int i; - for (i = 0; i < ctxt->sc_page_count; ++i) - put_page(ctxt->sc_pages[i]); + if (ctxt->sc_page_count) + release_pages(ctxt->sc_pages, ctxt->sc_page_count); /* The first SGE contains the transport header, which * remains mapped until @ctxt is destroyed.