Message ID | 20190528181018.19012.61210.stgit@manet.1015granger.net (mailing list archive) |
---|---|
Headers | show |
Series | for-5.3 NFS/RDMA patches for review | expand |
On Tue, May 28, 2019 at 02:20:50PM -0400, Chuck Lever wrote: > This is a series of fixes and architectural changes that should > improve robustness and result in better scalability of NFS/RDMA. > I'm sure one or two of these could be broken down a little more, > comments welcome. Just curious, do you have any performance numbers.
> On May 29, 2019, at 2:40 AM, Christoph Hellwig <hch@infradead.org> wrote: > > On Tue, May 28, 2019 at 02:20:50PM -0400, Chuck Lever wrote: >> This is a series of fixes and architectural changes that should >> improve robustness and result in better scalability of NFS/RDMA. >> I'm sure one or two of these could be broken down a little more, >> comments welcome. > > Just curious, do you have any performance numbers. To watch for performance regressions and improvements, I regularly run several variations of iozone, fio 70/30 mix, and multi-threaded software builds. I did not note any change in throughput after applying these patches. I'm unsure how to measure context switch rate precisely during these tests. This is typical for fio 8KB random 70/30 mix on FDR Infiniband on a NUMA client. Not impressive compared to NVMe, I know, but much better than NFS/TCP. On a single socket client, the IOPS numbers more than double. Jobs: 12 (f=12): [m(12)][100.0%][r=897MiB/s,w=386MiB/s][r=115k,w=49.5k IOPS][eta 00m:00s] 8k7030test: (groupid=0, jobs=12): err= 0: pid=2107: Fri May 24 15:22:38 2019 read: IOPS=115k, BW=897MiB/s (941MB/s)(8603MiB/9588msec) slat (usec): min=2, max=6203, avg= 7.02, stdev=27.49 clat (usec): min=33, max=13553, avg=1131.12, stdev=536.34 lat (usec): min=47, max=13557, avg=1138.37, stdev=537.11 clat percentiles (usec): | 1.00th=[ 338], 5.00th=[ 515], 10.00th=[ 619], 20.00th=[ 750], | 30.00th=[ 857], 40.00th=[ 955], 50.00th=[ 1057], 60.00th=[ 1156], | 70.00th=[ 1270], 80.00th=[ 1434], 90.00th=[ 1696], 95.00th=[ 1926], | 99.00th=[ 2966], 99.50th=[ 3785], 99.90th=[ 5866], 99.95th=[ 6652], | 99.99th=[ 8586] bw ( KiB/s): min=64624, max=82800, per=8.34%, avg=76631.87, stdev=2877.97, samples=227 iops : min= 8078, max=10350, avg=9578.91, stdev=359.76, samples=227 write: IOPS=49.2k, BW=384MiB/s (403MB/s)(3685MiB/9588msec) slat (usec): min=3, max=7226, avg= 7.54, stdev=29.53 clat (usec): min=64, max=14828, avg=1210.36, stdev=584.82 lat (usec): min=78, max=14834, avg=1218.11, stdev=585.77 clat percentiles (usec): | 1.00th=[ 359], 5.00th=[ 545], 10.00th=[ 652], 20.00th=[ 791], | 30.00th=[ 906], 40.00th=[ 1004], 50.00th=[ 1106], 60.00th=[ 1221], | 70.00th=[ 1369], 80.00th=[ 1549], 90.00th=[ 1844], 95.00th=[ 2147], | 99.00th=[ 3163], 99.50th=[ 4015], 99.90th=[ 6194], 99.95th=[ 7308], | 99.99th=[ 9372] bw ( KiB/s): min=27520, max=36128, per=8.34%, avg=32816.45, stdev=1323.08, samples=227 iops : min= 3440, max= 4516, avg=4101.97, stdev=165.38, samples=227 lat (usec) : 50=0.01%, 100=0.01%, 250=0.31%, 500=3.91%, 750=14.66% lat (usec) : 1000=24.41% lat (msec) : 2=51.69%, 4=4.57%, 10=0.44%, 20=0.01% cpu : usr=3.24%, sys=8.11%, ctx=786935, majf=0, minf=117 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=1101195,471669,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=16 Run status group 0 (all jobs): READ: bw=897MiB/s (941MB/s), 897MiB/s-897MiB/s (941MB/s-941MB/s), io=8603MiB (9021MB), run=9588-9588msec WRITE: bw=384MiB/s (403MB/s), 384MiB/s-384MiB/s (403MB/s-403MB/s), io=3685MiB (3864MB), run=9588-9588msec -- Chuck Lever
On 5/28/2019 2:20 PM, Chuck Lever wrote: > This is a series of fixes and architectural changes that should > improve robustness and result in better scalability of NFS/RDMA. > I'm sure one or two of these could be broken down a little more, > comments welcome. > > The fundamental observation is that the RPC work queues are BOUND, > thus rescheduling work in the Receive completion handler to one of > these work queues just forces it to run later on the same CPU. So > try to do more work right in the Receive completion handler to > reduce context switch overhead. > > A secondary concern is that the average amount of wall-clock time > it takes to handle a single Receive completion caps the IOPS rate > (both per-xprt and per-NIC). In this patch series I've taken a few > steps to reduce that latency, and I'm looking into a few others. > > This series can be fetched from: > > git://git.linux-nfs.org/projects/cel/cel-2.6.git > > in topic branch "nfs-for-5.3". > > --- > > Chuck Lever (12): > xprtrdma: Fix use-after-free in rpcrdma_post_recvs > xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req > xprtrdma: Fix occasional transport deadlock > xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag > xprtrdma: Remove fr_state > xprtrdma: Add mechanism to place MRs back on the free list > xprtrdma: Reduce context switching due to Local Invalidation > xprtrdma: Wake RPCs directly in rpcrdma_wc_send path > xprtrdma: Simplify rpcrdma_rep_create > xprtrdma: Streamline rpcrdma_post_recvs > xprtrdma: Refactor chunk encoding > xprtrdma: Remove rpcrdma_req::rl_buffer > > > include/trace/events/rpcrdma.h | 47 ++++-- > net/sunrpc/xprtrdma/frwr_ops.c | 330 ++++++++++++++++++++++++++------------- > net/sunrpc/xprtrdma/rpc_rdma.c | 146 +++++++---------- > net/sunrpc/xprtrdma/transport.c | 16 +- > net/sunrpc/xprtrdma/verbs.c | 115 ++++++-------- > net/sunrpc/xprtrdma/xprt_rdma.h | 43 +---- > 6 files changed, 384 insertions(+), 313 deletions(-) > For hfi1: Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> On May 31, 2019, at 10:32 AM, Dennis Dalessandro <dennis.dalessandro@intel.com> wrote: > > On 5/28/2019 2:20 PM, Chuck Lever wrote: >> This is a series of fixes and architectural changes that should >> improve robustness and result in better scalability of NFS/RDMA. >> I'm sure one or two of these could be broken down a little more, >> comments welcome. >> The fundamental observation is that the RPC work queues are BOUND, >> thus rescheduling work in the Receive completion handler to one of >> these work queues just forces it to run later on the same CPU. So >> try to do more work right in the Receive completion handler to >> reduce context switch overhead. >> A secondary concern is that the average amount of wall-clock time >> it takes to handle a single Receive completion caps the IOPS rate >> (both per-xprt and per-NIC). In this patch series I've taken a few >> steps to reduce that latency, and I'm looking into a few others. >> This series can be fetched from: >> git://git.linux-nfs.org/projects/cel/cel-2.6.git >> in topic branch "nfs-for-5.3". >> --- >> Chuck Lever (12): >> xprtrdma: Fix use-after-free in rpcrdma_post_recvs >> xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req >> xprtrdma: Fix occasional transport deadlock >> xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag >> xprtrdma: Remove fr_state >> xprtrdma: Add mechanism to place MRs back on the free list >> xprtrdma: Reduce context switching due to Local Invalidation >> xprtrdma: Wake RPCs directly in rpcrdma_wc_send path >> xprtrdma: Simplify rpcrdma_rep_create >> xprtrdma: Streamline rpcrdma_post_recvs >> xprtrdma: Refactor chunk encoding >> xprtrdma: Remove rpcrdma_req::rl_buffer >> include/trace/events/rpcrdma.h | 47 ++++-- >> net/sunrpc/xprtrdma/frwr_ops.c | 330 ++++++++++++++++++++++++++------------- >> net/sunrpc/xprtrdma/rpc_rdma.c | 146 +++++++---------- >> net/sunrpc/xprtrdma/transport.c | 16 +- >> net/sunrpc/xprtrdma/verbs.c | 115 ++++++-------- >> net/sunrpc/xprtrdma/xprt_rdma.h | 43 +---- >> 6 files changed, 384 insertions(+), 313 deletions(-) > > For hfi1: > Tested-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Thanks! -- Chuck Lever