mbox series

[for-next,0/1] RDMA-rxe: Allow retry sends for rdma read responses

Message ID 20230215224419.9195-1-rpearsonhpe@gmail.com (mailing list archive)
Headers show
Series RDMA-rxe: Allow retry sends for rdma read responses | expand

Message

Bob Pearson Feb. 15, 2023, 10:44 p.m. UTC
If the rxe driver is able to generate packets faster than the
IP code can process them it will start dropping packets and
ip_local_out() will return NET_XMIT_DROP. The requester side of
the driver detects this and retries the packet. The responder
does not and the requester recovers by taking a retry timer delay
and resubmitting the read operation from the last received packet.
This can and does occur for large RDMA read responses for multi-MB
reads. This causes a steep drop off in performance.

This patch modifies read_reply() in rxe_resp.c to retry the
send if err == -EAGAIN. When IP does drop a packet it requires
more time to recover than a simple retry takes so a subroutine
read_retry_delay() is added that dynamically estimates the time
required for this recovery and inserts a delay before the retry.

With this patch applied the performance of large reads is very
stable. For example with a 1Gb/sec (112.5 GB/sec) Ethernet link
between two systems, without this patch ib_read_bw shows the
following performance.

                    RDMA_Read BW Test
 Dual-port       : OFF		Device         : rxe0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 2
 Outstand reads  : 128
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
 <snip>
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             0.56               0.56   		   0.294363
 4          1000             0.66               0.66   		   0.173862
 8          1000             1.32               1.20   		   0.157406
 16         1000             2.66               2.40   		   0.157357
 32         1000             5.54               5.46   		   0.179006
 64         1000             18.22              16.94  		   0.277533
 128        1000             21.61              20.91  		   0.171322
 256        1000             44.02              38.90  		   0.159316
 512        1000             70.39              64.86  		   0.132843
 1024       1000             106.50             100.49 		   0.102904
 2048       1000             106.46             105.29 		   0.053908
 4096       1000             107.85             107.85 		   0.027609
 8192       1000             109.09             109.09 		   0.013963
 16384      1000             110.17             110.17 		   0.007051
 32768      1000             110.27             110.27 		   0.003529
 65536      1000             110.33             110.33 		   0.001765
 131072     1000             110.35             110.35 		   0.000883
 262144     1000             110.36             110.36 		   0.000441
 524288     1000             110.37             110.36 		   0.000221
 1048576    1000             110.37             110.37 		   0.000110
 2097152    1000             24.19              24.10  		   0.000012
 4194304    1000             18.70              18.65  		   0.000005
 8388608    1000             18.09              17.82  		   0.000002

No NET_XMIT_DROP returns are seen up to 1MiB but at 2MiB and above they
are constant.

With the patch applied ib_read_bw shows the following
performance:

                    RDMA_Read BW Test
 Dual-port       : OFF		Device         : rxe0
 Number of qps   : 1		Transport type : IB
 Connection type : RC		Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 2
 Outstand reads  : 128
 rdma_cm QPs	 : OFF
 Data ex. method : Ethernet
 <snip>
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 2          1000             0.34               0.33   		   0.175541
 4          1000             0.69               0.68   		   0.179279
 8          1000             2.02               1.75   		   0.229972
 16         1000             2.72               2.63   		   0.172632
 32         1000             5.42               4.94   		   0.161824
 64         1000             10.63              9.67   		   0.158487
 128        1000             31.06              28.11  		   0.230288
 256        1000             40.48              36.75  		   0.150543
 512        1000             70.00              66.00  		   0.135164
 1024       1000             94.43              89.26  		   0.091402
 2048       1000             106.38             104.34 		   0.053424
 4096       1000             109.48             109.16 		   0.027946
 8192       1000             108.96             108.96 		   0.013946
 16384      1000             110.18             110.18 		   0.007052
 32768      1000             110.28             110.28 		   0.003529
 65536      1000             110.33             110.33 		   0.001765
 131072     1000             110.35             110.35 		   0.000883
 262144     1000             110.36             110.35 		   0.000441
 524288     1000             110.35             110.31 		   0.000221
 1048576    1000             110.37             110.37 		   0.000110
 2097152    1000             110.37             110.37 		   0.000055
 4194304    1000             110.37             110.36 		   0.000028
 8388608    1000             110.37             110.37 		   0.000014

The delay algorithm computes approximately 50 usecs as the correct delay
to insert before retrying a read_reply() send.

Bob Pearson (1):
  RDMA-rxe: Allow retry sends for rdma read responses

 drivers/infiniband/sw/rxe/rxe_resp.c  | 62 +++++++++++++++++++++++++--
 drivers/infiniband/sw/rxe/rxe_verbs.h |  9 ++++
 2 files changed, 68 insertions(+), 3 deletions(-)


base-commit: 91d088a0304941b88c915cc800617ff4068cdd39