[RFC,00/12] for-5.3 NFS/RDMA patches for review

Message ID	20190528181018.19012.61210.stgit@manet.1015granger.net (mailing list archive)
Headers	show Return-Path: <linux-nfs-owner@kernel.org> Subject: [PATCH RFC 00/12] for-5.3 NFS/RDMA patches for review From: Chuck Lever <chuck.lever@oracle.com> To: linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org Date: Tue, 28 May 2019 14:20:50 -0400 Message-ID: <20190528181018.19012.61210.stgit@manet.1015granger.net> User-Agent: StGit/0.17.1-dirty MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: linux-nfs-owner@vger.kernel.org Precedence: bulk
Series	for-5.3 NFS/RDMA patches for review \| expand [RFC,00/12] for-5.3 NFS/RDMA patches for review [RFC,01/12] xprtrdma: Fix use-after-free in rpcrdma_post_recvs [RFC,02/12] xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req [RFC,03/12] xprtrdma: Fix occasional transport deadlock [RFC,04/12] xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag [RFC,05/12] xprtrdma: Remove fr_state [RFC,06/12] xprtrdma: Add mechanism to place MRs back on the free list [RFC,07/12] xprtrdma: Reduce context switching due to Local Invalidation [RFC,08/12] xprtrdma: Wake RPCs directly in rpcrdma_wc_send path [RFC,09/12] xprtrdma: Simplify rpcrdma_rep_create [RFC,10/12] xprtrdma: Streamline rpcrdma_post_recvs [RFC,11/12] xprtrdma: Refactor chunk encoding [RFC,12/12] xprtrdma: Remove rpcrdma_req::rl_buffer

Message ID

20190528181018.19012.61210.stgit@manet.1015granger.net (mailing list archive)

Headers

Subject: [PATCH RFC 00/12] for-5.3 NFS/RDMA patches for review
From: Chuck Lever <chuck.lever@oracle.com>
To: linux-rdma@vger.kernel.org, linux-nfs@vger.kernel.org
Date: Tue, 28 May 2019 14:20:50 -0400
Message-ID: <20190528181018.19012.61210.stgit@manet.1015granger.net>
User-Agent: StGit/0.17.1-dirty
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
Sender: linux-nfs-owner@vger.kernel.org
Precedence: bulk

Series

for-5.3 NFS/RDMA patches for review | expand

Message

Chuck Lever III May 28, 2019, 6:20 p.m. UTC

This is a series of fixes and architectural changes that should
improve robustness and result in better scalability of NFS/RDMA.
I'm sure one or two of these could be broken down a little more,
comments welcome.

The fundamental observation is that the RPC work queues are BOUND,
thus rescheduling work in the Receive completion handler to one of
these work queues just forces it to run later on the same CPU. So
try to do more work right in the Receive completion handler to
reduce context switch overhead.

A secondary concern is that the average amount of wall-clock time
it takes to handle a single Receive completion caps the IOPS rate
(both per-xprt and per-NIC). In this patch series I've taken a few
steps to reduce that latency, and I'm looking into a few others.

This series can be fetched from:

  git://git.linux-nfs.org/projects/cel/cel-2.6.git

in topic branch "nfs-for-5.3".

---

Chuck Lever (12):
      xprtrdma: Fix use-after-free in rpcrdma_post_recvs
      xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req
      xprtrdma: Fix occasional transport deadlock
      xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag
      xprtrdma: Remove fr_state
      xprtrdma: Add mechanism to place MRs back on the free list
      xprtrdma: Reduce context switching due to Local Invalidation
      xprtrdma: Wake RPCs directly in rpcrdma_wc_send path
      xprtrdma: Simplify rpcrdma_rep_create
      xprtrdma: Streamline rpcrdma_post_recvs
      xprtrdma: Refactor chunk encoding
      xprtrdma: Remove rpcrdma_req::rl_buffer


 include/trace/events/rpcrdma.h  |   47 ++++--
 net/sunrpc/xprtrdma/frwr_ops.c  |  330 ++++++++++++++++++++++++++-------------
 net/sunrpc/xprtrdma/rpc_rdma.c  |  146 +++++++----------
 net/sunrpc/xprtrdma/transport.c |   16 +-
 net/sunrpc/xprtrdma/verbs.c     |  115 ++++++--------
 net/sunrpc/xprtrdma/xprt_rdma.h |   43 +----
 6 files changed, 384 insertions(+), 313 deletions(-)

--
Chuck Lever

Comments

Christoph Hellwig May 29, 2019, 6:40 a.m. UTC | #1

On Tue, May 28, 2019 at 02:20:50PM -0400, Chuck Lever wrote:
> This is a series of fixes and architectural changes that should
> improve robustness and result in better scalability of NFS/RDMA.
> I'm sure one or two of these could be broken down a little more,
> comments welcome.

Just curious, do you have any performance numbers.

Chuck Lever III May 29, 2019, 2:35 p.m. UTC | #2

> On May 29, 2019, at 2:40 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
> On Tue, May 28, 2019 at 02:20:50PM -0400, Chuck Lever wrote:
>> This is a series of fixes and architectural changes that should
>> improve robustness and result in better scalability of NFS/RDMA.
>> I'm sure one or two of these could be broken down a little more,
>> comments welcome.
> 
> Just curious, do you have any performance numbers.

To watch for performance regressions and improvements, I regularly run several
variations of iozone, fio 70/30 mix, and multi-threaded software builds. I did
not note any change in throughput after applying these patches.

I'm unsure how to measure context switch rate precisely during these tests.

This is typical for fio 8KB random 70/30 mix on FDR Infiniband on a NUMA client.
Not impressive compared to NVMe, I know, but much better than NFS/TCP. On a
single socket client, the IOPS numbers more than double.

Jobs: 12 (f=12): [m(12)][100.0%][r=897MiB/s,w=386MiB/s][r=115k,w=49.5k IOPS][eta 00m:00s]
8k7030test: (groupid=0, jobs=12): err= 0: pid=2107: Fri May 24 15:22:38 2019
   read: IOPS=115k, BW=897MiB/s (941MB/s)(8603MiB/9588msec)
    slat (usec): min=2, max=6203, avg= 7.02, stdev=27.49
    clat (usec): min=33, max=13553, avg=1131.12, stdev=536.34
     lat (usec): min=47, max=13557, avg=1138.37, stdev=537.11
    clat percentiles (usec):
     |  1.00th=[  338],  5.00th=[  515], 10.00th=[  619], 20.00th=[  750],
     | 30.00th=[  857], 40.00th=[  955], 50.00th=[ 1057], 60.00th=[ 1156],
     | 70.00th=[ 1270], 80.00th=[ 1434], 90.00th=[ 1696], 95.00th=[ 1926],
     | 99.00th=[ 2966], 99.50th=[ 3785], 99.90th=[ 5866], 99.95th=[ 6652],
     | 99.99th=[ 8586]
   bw (  KiB/s): min=64624, max=82800, per=8.34%, avg=76631.87, stdev=2877.97, samples=227
   iops        : min= 8078, max=10350, avg=9578.91, stdev=359.76, samples=227
  write: IOPS=49.2k, BW=384MiB/s (403MB/s)(3685MiB/9588msec)
    slat (usec): min=3, max=7226, avg= 7.54, stdev=29.53
    clat (usec): min=64, max=14828, avg=1210.36, stdev=584.82
     lat (usec): min=78, max=14834, avg=1218.11, stdev=585.77
    clat percentiles (usec):
     |  1.00th=[  359],  5.00th=[  545], 10.00th=[  652], 20.00th=[  791],
     | 30.00th=[  906], 40.00th=[ 1004], 50.00th=[ 1106], 60.00th=[ 1221],
     | 70.00th=[ 1369], 80.00th=[ 1549], 90.00th=[ 1844], 95.00th=[ 2147],
     | 99.00th=[ 3163], 99.50th=[ 4015], 99.90th=[ 6194], 99.95th=[ 7308],
     | 99.99th=[ 9372]
   bw (  KiB/s): min=27520, max=36128, per=8.34%, avg=32816.45, stdev=1323.08, samples=227
   iops        : min= 3440, max= 4516, avg=4101.97, stdev=165.38, samples=227
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.31%, 500=3.91%, 750=14.66%
  lat (usec)   : 1000=24.41%
  lat (msec)   : 2=51.69%, 4=4.57%, 10=0.44%, 20=0.01%
  cpu          : usr=3.24%, sys=8.11%, ctx=786935, majf=0, minf=117
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwt: total=1101195,471669,0, short=0,0,0, dropped=0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=897MiB/s (941MB/s), 897MiB/s-897MiB/s (941MB/s-941MB/s), io=8603MiB (9021MB), run=9588-9588msec
  WRITE: bw=384MiB/s (403MB/s), 384MiB/s-384MiB/s (403MB/s-403MB/s), io=3685MiB (3864MB), run=9588-9588msec


--
Chuck Lever

Dennis Dalessandro May 31, 2019, 2:32 p.m. UTC | #3

On 5/28/2019 2:20 PM, Chuck Lever wrote:
> This is a series of fixes and architectural changes that should
> improve robustness and result in better scalability of NFS/RDMA.
> I'm sure one or two of these could be broken down a little more,
> comments welcome.
> 
> The fundamental observation is that the RPC work queues are BOUND,
> thus rescheduling work in the Receive completion handler to one of
> these work queues just forces it to run later on the same CPU. So
> try to do more work right in the Receive completion handler to
> reduce context switch overhead.
> 
> A secondary concern is that the average amount of wall-clock time
> it takes to handle a single Receive completion caps the IOPS rate
> (both per-xprt and per-NIC). In this patch series I've taken a few
> steps to reduce that latency, and I'm looking into a few others.
> 
> This series can be fetched from:
> 
>    git://git.linux-nfs.org/projects/cel/cel-2.6.git
> 
> in topic branch "nfs-for-5.3".
> 
> ---
> 
> Chuck Lever (12):
>        xprtrdma: Fix use-after-free in rpcrdma_post_recvs
>        xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req
>        xprtrdma: Fix occasional transport deadlock
>        xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag
>        xprtrdma: Remove fr_state
>        xprtrdma: Add mechanism to place MRs back on the free list
>        xprtrdma: Reduce context switching due to Local Invalidation
>        xprtrdma: Wake RPCs directly in rpcrdma_wc_send path
>        xprtrdma: Simplify rpcrdma_rep_create
>        xprtrdma: Streamline rpcrdma_post_recvs
>        xprtrdma: Refactor chunk encoding
>        xprtrdma: Remove rpcrdma_req::rl_buffer
> 
> 
>   include/trace/events/rpcrdma.h  |   47 ++++--
>   net/sunrpc/xprtrdma/frwr_ops.c  |  330 ++++++++++++++++++++++++++-------------
>   net/sunrpc/xprtrdma/rpc_rdma.c  |  146 +++++++----------
>   net/sunrpc/xprtrdma/transport.c |   16 +-
>   net/sunrpc/xprtrdma/verbs.c     |  115 ++++++--------
>   net/sunrpc/xprtrdma/xprt_rdma.h |   43 +----
>   6 files changed, 384 insertions(+), 313 deletions(-)
> 

For hfi1:
Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com>

Chuck Lever III May 31, 2019, 2:34 p.m. UTC | #4

> On May 31, 2019, at 10:32 AM, Dennis Dalessandro <dennis.dalessandro@intel.com> wrote:
> 
> On 5/28/2019 2:20 PM, Chuck Lever wrote:
>> This is a series of fixes and architectural changes that should
>> improve robustness and result in better scalability of NFS/RDMA.
>> I'm sure one or two of these could be broken down a little more,
>> comments welcome.
>> The fundamental observation is that the RPC work queues are BOUND,
>> thus rescheduling work in the Receive completion handler to one of
>> these work queues just forces it to run later on the same CPU. So
>> try to do more work right in the Receive completion handler to
>> reduce context switch overhead.
>> A secondary concern is that the average amount of wall-clock time
>> it takes to handle a single Receive completion caps the IOPS rate
>> (both per-xprt and per-NIC). In this patch series I've taken a few
>> steps to reduce that latency, and I'm looking into a few others.
>> This series can be fetched from:
>>   git://git.linux-nfs.org/projects/cel/cel-2.6.git
>> in topic branch "nfs-for-5.3".
>> ---
>> Chuck Lever (12):
>>       xprtrdma: Fix use-after-free in rpcrdma_post_recvs
>>       xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req
>>       xprtrdma: Fix occasional transport deadlock
>>       xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag
>>       xprtrdma: Remove fr_state
>>       xprtrdma: Add mechanism to place MRs back on the free list
>>       xprtrdma: Reduce context switching due to Local Invalidation
>>       xprtrdma: Wake RPCs directly in rpcrdma_wc_send path
>>       xprtrdma: Simplify rpcrdma_rep_create
>>       xprtrdma: Streamline rpcrdma_post_recvs
>>       xprtrdma: Refactor chunk encoding
>>       xprtrdma: Remove rpcrdma_req::rl_buffer
>>  include/trace/events/rpcrdma.h  |   47 ++++--
>>  net/sunrpc/xprtrdma/frwr_ops.c  |  330 ++++++++++++++++++++++++++-------------
>>  net/sunrpc/xprtrdma/rpc_rdma.c  |  146 +++++++----------
>>  net/sunrpc/xprtrdma/transport.c |   16 +-
>>  net/sunrpc/xprtrdma/verbs.c     |  115 ++++++--------
>>  net/sunrpc/xprtrdma/xprt_rdma.h |   43 +----
>>  6 files changed, 384 insertions(+), 313 deletions(-)
> 
> For hfi1:
> Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com>

Thanks!

--
Chuck Lever