diff mbox series

[RFC,net-next] net/smc: Introduce receive queue flow control support

Message ID 20220120065140.5385-1-guangguan.wang@linux.alibaba.com (mailing list archive)
State RFC
Delegated to: Netdev Maintainers
Headers show
Series [RFC,net-next] net/smc: Introduce receive queue flow control support | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 0 this patch: 0
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/checkpatch warning WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 84 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 86 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 89 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 91 exceeds 80 columns WARNING: line length of 92 exceeds 80 columns WARNING: line length of 93 exceeds 80 columns WARNING: line length of 95 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Guangguan Wang Jan. 20, 2022, 6:51 a.m. UTC
This implement rq flow control in smc-r link layer. QPs
communicating without rq flow control, in the previous
version, may result in RNR (reveive not ready) error, which
means when sq sends a message to the remote qp, but the
remote qp's rq has no valid rq entities to receive the message.
In RNR condition, the rdma transport layer may retransmit
the messages again and again until the rq has any entities,
which may lower the performance, especially in heavy traffic.
Using credits to do rq flow control can avoid the occurrence
of RNR.

Test environment:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
- redis benchmark 6.2.3 and redis server 6.2.3.
- redis server: redis-server --save "" --appendonly no
  --protected-mode no --io-threads 7 --io-threads-do-reads yes
- redis client: redis-benchmark -h 192.168.26.36 -q -t set,get
  -P 1 --threads 7 -n 2000000 -c 200 -d 10

 Before:
 SET: 205229.23 requests per second, p50=0.799 msec
 GET: 212278.16 requests per second, p50=0.751 msec

 After:
 SET: 623674.69 requests per second, p50=0.303 msec
 GET: 688326.00 requests per second, p50=0.271 msec

The test of redis-benchmark shows that more than 3X rps
improvement after the implementation of rq flow control.

Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
---
 net/smc/af_smc.c   | 12 ++++++
 net/smc/smc_cdc.c  | 10 ++++-
 net/smc/smc_cdc.h  |  3 +-
 net/smc/smc_clc.c  |  3 ++
 net/smc/smc_clc.h  |  3 +-
 net/smc/smc_core.h | 17 ++++++++-
 net/smc/smc_ib.c   |  6 ++-
 net/smc/smc_llc.c  | 92 +++++++++++++++++++++++++++++++++++++++++++++-
 net/smc/smc_llc.h  |  5 +++
 net/smc/smc_wr.c   | 30 ++++++++++++---
 net/smc/smc_wr.h   | 54 ++++++++++++++++++++++++++-
 11 files changed, 222 insertions(+), 13 deletions(-)

Comments

Leon Romanovsky Jan. 20, 2022, 8:24 a.m. UTC | #1
On Thu, Jan 20, 2022 at 02:51:40PM +0800, Guangguan Wang wrote:
> This implement rq flow control in smc-r link layer. QPs
> communicating without rq flow control, in the previous
> version, may result in RNR (reveive not ready) error, which
> means when sq sends a message to the remote qp, but the
> remote qp's rq has no valid rq entities to receive the message.
> In RNR condition, the rdma transport layer may retransmit
> the messages again and again until the rq has any entities,
> which may lower the performance, especially in heavy traffic.
> Using credits to do rq flow control can avoid the occurrence
> of RNR.
> 
> Test environment:
> - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
> - redis benchmark 6.2.3 and redis server 6.2.3.
> - redis server: redis-server --save "" --appendonly no
>   --protected-mode no --io-threads 7 --io-threads-do-reads yes
> - redis client: redis-benchmark -h 192.168.26.36 -q -t set,get
>   -P 1 --threads 7 -n 2000000 -c 200 -d 10
> 
>  Before:
>  SET: 205229.23 requests per second, p50=0.799 msec
>  GET: 212278.16 requests per second, p50=0.751 msec
> 
>  After:
>  SET: 623674.69 requests per second, p50=0.303 msec
>  GET: 688326.00 requests per second, p50=0.271 msec
> 
> The test of redis-benchmark shows that more than 3X rps
> improvement after the implementation of rq flow control.
> 
> Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
> ---
>  net/smc/af_smc.c   | 12 ++++++
>  net/smc/smc_cdc.c  | 10 ++++-
>  net/smc/smc_cdc.h  |  3 +-
>  net/smc/smc_clc.c  |  3 ++
>  net/smc/smc_clc.h  |  3 +-
>  net/smc/smc_core.h | 17 ++++++++-
>  net/smc/smc_ib.c   |  6 ++-
>  net/smc/smc_llc.c  | 92 +++++++++++++++++++++++++++++++++++++++++++++-
>  net/smc/smc_llc.h  |  5 +++
>  net/smc/smc_wr.c   | 30 ++++++++++++---
>  net/smc/smc_wr.h   | 54 ++++++++++++++++++++++++++-
>  11 files changed, 222 insertions(+), 13 deletions(-)

<...>

> +		// set peer rq credits watermark, if less than init_credits * 2/3,
> +		// then credit announcement is needed.

<...>

> +		// set peer rq credits watermark, if less than init_credits * 2/3,
> +		// then credit announcement is needed.

<...>

> +	// credits have already been announced to peer

<...>

> +	// set local rq credits high watermark to lnk->wr_rx_cnt / 3,
> +	// if local rq credits more than high watermark, announcement is needed.

<...>

> +// get one tx credit, and peer rq credits dec

<...>

> +// put tx credits, when some failures occurred after tx credits got
> +// or receive announce credits msgs
> +static inline void smc_wr_tx_put_credits(struct smc_link *link, int credits, bool wakeup)

<...>

> +// to check whether peer rq credits is lower than watermark.
> +static inline int smc_wr_tx_credits_need_announce(struct smc_link *link)

<...>

> +// get local rq credits and set credits to zero.
> +// may called when announcing credits
> +static inline int smc_wr_rx_get_credits(struct smc_link *link)

Please try to use C-style comments.

Thanks
Guangguan Wang Jan. 20, 2022, 9:20 a.m. UTC | #2
On 2022/1/20 16:24, Leon Romanovsky wrote:
> On Thu, Jan 20, 2022 at 02:51:40PM +0800, Guangguan Wang wrote:
>> This implement rq flow control in smc-r link layer. QPs
>> communicating without rq flow control, in the previous
>> version, may result in RNR (reveive not ready) error, which
>> means when sq sends a message to the remote qp, but the
>> remote qp's rq has no valid rq entities to receive the message.
>> In RNR condition, the rdma transport layer may retransmit
>> the messages again and again until the rq has any entities,
>> which may lower the performance, especially in heavy traffic.
>> Using credits to do rq flow control can avoid the occurrence
>> of RNR.
>>
>> Test environment:
>> - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
>> - redis benchmark 6.2.3 and redis server 6.2.3.
>> - redis server: redis-server --save "" --appendonly no
>>   --protected-mode no --io-threads 7 --io-threads-do-reads yes
>> - redis client: redis-benchmark -h 192.168.26.36 -q -t set,get
>>   -P 1 --threads 7 -n 2000000 -c 200 -d 10
>>
>>  Before:
>>  SET: 205229.23 requests per second, p50=0.799 msec
>>  GET: 212278.16 requests per second, p50=0.751 msec
>>
>>  After:
>>  SET: 623674.69 requests per second, p50=0.303 msec
>>  GET: 688326.00 requests per second, p50=0.271 msec
>>
>> The test of redis-benchmark shows that more than 3X rps
>> improvement after the implementation of rq flow control.
>>
>> Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
>> ---
>>  net/smc/af_smc.c   | 12 ++++++
>>  net/smc/smc_cdc.c  | 10 ++++-
>>  net/smc/smc_cdc.h  |  3 +-
>>  net/smc/smc_clc.c  |  3 ++
>>  net/smc/smc_clc.h  |  3 +-
>>  net/smc/smc_core.h | 17 ++++++++-
>>  net/smc/smc_ib.c   |  6 ++-
>>  net/smc/smc_llc.c  | 92 +++++++++++++++++++++++++++++++++++++++++++++-
>>  net/smc/smc_llc.h  |  5 +++
>>  net/smc/smc_wr.c   | 30 ++++++++++++---
>>  net/smc/smc_wr.h   | 54 ++++++++++++++++++++++++++-
>>  11 files changed, 222 insertions(+), 13 deletions(-)
> 
> <...>
> 
>> +		// set peer rq credits watermark, if less than init_credits * 2/3,
>> +		// then credit announcement is needed.
> 
> <...>
> 
>> +		// set peer rq credits watermark, if less than init_credits * 2/3,
>> +		// then credit announcement is needed.
> 
> <...>
> 
>> +	// credits have already been announced to peer
> 
> <...>
> 
>> +	// set local rq credits high watermark to lnk->wr_rx_cnt / 3,
>> +	// if local rq credits more than high watermark, announcement is needed.
> 
> <...>
> 
>> +// get one tx credit, and peer rq credits dec
> 
> <...>
> 
>> +// put tx credits, when some failures occurred after tx credits got
>> +// or receive announce credits msgs
>> +static inline void smc_wr_tx_put_credits(struct smc_link *link, int credits, bool wakeup)
> 
> <...>
> 
>> +// to check whether peer rq credits is lower than watermark.
>> +static inline int smc_wr_tx_credits_need_announce(struct smc_link *link)
> 
> <...>
> 
>> +// get local rq credits and set credits to zero.
>> +// may called when announcing credits
>> +static inline int smc_wr_rx_get_credits(struct smc_link *link)
> 
> Please try to use C-style comments.
> 
> Thanks

Thanks for your advice, I will modify it in the next version of patch.
Dust Li Jan. 20, 2022, 9:51 a.m. UTC | #3
On Thu, Jan 20, 2022 at 02:51:40PM +0800, Guangguan Wang wrote:
>This implement rq flow control in smc-r link layer. QPs
>communicating without rq flow control, in the previous
>version, may result in RNR (reveive not ready) error, which
>means when sq sends a message to the remote qp, but the
>remote qp's rq has no valid rq entities to receive the message.
>In RNR condition, the rdma transport layer may retransmit
>the messages again and again until the rq has any entities,
>which may lower the performance, especially in heavy traffic.
>Using credits to do rq flow control can avoid the occurrence
>of RNR.

I'm wondering if SRQ can be used to solve this problem ?

One of my concern on credit-base flow control is if the RTT is
a bit longer, we may have to wait RTT/2 for peer to grant us credit
before we can really send more data. That may decrease the maximium
bandwidth we can achive in this case.
Karsten Graul Jan. 20, 2022, 11:03 a.m. UTC | #4
On 20/01/2022 07:51, Guangguan Wang wrote:
> This implement rq flow control in smc-r link layer. QPs
> communicating without rq flow control, in the previous
> version, may result in RNR (reveive not ready) error, which
> means when sq sends a message to the remote qp, but the
> remote qp's rq has no valid rq entities to receive the message.
> In RNR condition, the rdma transport layer may retransmit
> the messages again and again until the rq has any entities,
> which may lower the performance, especially in heavy traffic.
> Using credits to do rq flow control can avoid the occurrence
> of RNR.
> 
> Test environment:
> - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
> - redis benchmark 6.2.3 and redis server 6.2.3.
> - redis server: redis-server --save "" --appendonly no
>   --protected-mode no --io-threads 7 --io-threads-do-reads yes
> - redis client: redis-benchmark -h 192.168.26.36 -q -t set,get
>   -P 1 --threads 7 -n 2000000 -c 200 -d 10
> 
>  Before:
>  SET: 205229.23 requests per second, p50=0.799 msec
>  GET: 212278.16 requests per second, p50=0.751 msec
> 
>  After:
>  SET: 623674.69 requests per second, p50=0.303 msec
>  GET: 688326.00 requests per second, p50=0.271 msec
> 
> The test of redis-benchmark shows that more than 3X rps
> improvement after the implementation of rq flow control.
> 
> Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
> ---

I really appreciate your effort to improve the performance and solve existing bottle necks,
but please keep in mind that the SMC module implements the IBM SMC protocol that is
described here: https://www.ibm.com/support/pages/node/6326337
(you can find these links in the source code, too).

Your patch makes changes that are not described in this design paper and may lead to
future incompatibilities with other platforms that support the IBM SMC protocol.

For example:
- you start using one of the reserved bytes in struct smc_cdc_msg
- you define a new smc_llc message type 0x0A
- you change the maximum number of connections per link group from 255 to 32

We need to start a discussion about your (good!) ideas with the owners of the protocol.
Tony Lu Jan. 20, 2022, 2:22 p.m. UTC | #5
On Thu, Jan 20, 2022 at 02:51:40PM +0800, Guangguan Wang wrote:
> This implement rq flow control in smc-r link layer. QPs
> communicating without rq flow control, in the previous
> version, may result in RNR (reveive not ready) error, which
> means when sq sends a message to the remote qp, but the
> remote qp's rq has no valid rq entities to receive the message.
> In RNR condition, the rdma transport layer may retransmit
> the messages again and again until the rq has any entities,
> which may lower the performance, especially in heavy traffic.
> Using credits to do rq flow control can avoid the occurrence
> of RNR.
> 
> Test environment:
> - CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
> - redis benchmark 6.2.3 and redis server 6.2.3.
> - redis server: redis-server --save "" --appendonly no
>   --protected-mode no --io-threads 7 --io-threads-do-reads yes
> - redis client: redis-benchmark -h 192.168.26.36 -q -t set,get
>   -P 1 --threads 7 -n 2000000 -c 200 -d 10
> 
>  Before:
>  SET: 205229.23 requests per second, p50=0.799 msec
>  GET: 212278.16 requests per second, p50=0.751 msec
> 
>  After:
>  SET: 623674.69 requests per second, p50=0.303 msec
>  GET: 688326.00 requests per second, p50=0.271 msec
> 
> The test of redis-benchmark shows that more than 3X rps
> improvement after the implementation of rq flow control.

There seems lots of RNR retransmission in your environment. If would be
better to give out more benchmark data of different cases about this
patch. For different scenarios, such as large packets, perhaps we can
use more fine-grained flow control.
 
>  #include "smc_ib.h"
>  
> -#define SMC_RMBS_PER_LGR_MAX	255	/* max. # of RMBs per link group */
> +#define SMC_RMBS_PER_LGR_MAX	32	/* max. # of RMBs per link group. Correspondingly,
> +					 * SMC_WR_BUF_CNT should not be less than 2 *
> +					 * SMC_RMBS_PER_LGR_MAX, since every connection at
> +					 * least has two rq/sq credits in average, otherwise
> +					 * may result in waiting for credits in sending process.
> +					 */

This gives a fixed limit for per link group connections. Using tunable
knobs to control this for different workload would be better. It also
reduce the completion of free slots in the same link group and link.

Thank you,
Tony Lu
Guangguan Wang Jan. 21, 2022, 4:21 p.m. UTC | #6
On 2022/1/20 17:51, dust.li wrote:
> On Thu, Jan 20, 2022 at 02:51:40PM +0800, Guangguan Wang wrote:
>> This implement rq flow control in smc-r link layer. QPs
>> communicating without rq flow control, in the previous
>> version, may result in RNR (reveive not ready) error, which
>> means when sq sends a message to the remote qp, but the
>> remote qp's rq has no valid rq entities to receive the message.
>> In RNR condition, the rdma transport layer may retransmit
>> the messages again and again until the rq has any entities,
>> which may lower the performance, especially in heavy traffic.
>> Using credits to do rq flow control can avoid the occurrence
>> of RNR.
> 
> I'm wondering if SRQ can be used to solve this problem ?
> 
> One of my concern on credit-base flow control is if the RTT is
> a bit longer, we may have to wait RTT/2 for peer to grant us credit
> before we can really send more data. That may decrease the maximium
> bandwidth we can achive in this case.

Longer RTT can result in more inflight messages and increase
the announcement latency indeed.

The following items are used in this patch to reduce the pact
of this situation.
- More rqe. (average 2 credits per smc_connection now, longer RTT is
  a good case for me to check whether an average of 2 is enough. As
  each additional rqe only increases the memory by 104 Bytes,
  SRQ may be an icing on the cake option to reduce memory usage)
- Announce frequenly. (credits carried by every cdc msg)
- Avoid credit accumulation. (announce as soon as the low watermark(1/3 rq entities) is reached)
Guangguan Wang Jan. 21, 2022, 4:36 p.m. UTC | #7
On 2022/1/20 19:03, Karsten Graul wrote:> 
> I really appreciate your effort to improve the performance and solve existing bottle necks,
> but please keep in mind that the SMC module implements the IBM SMC protocol that is
> described here: https://www.ibm.com/support/pages/node/6326337
> (you can find these links in the source code, too).
> 
> Your patch makes changes that are not described in this design paper and may lead to
> future incompatibilities with other platforms that support the IBM SMC protocol.
> 
> For example:
> - you start using one of the reserved bytes in struct smc_cdc_msg
> - you define a new smc_llc message type 0x0A
> - you change the maximum number of connections per link group from 255 to 32
> 
> We need to start a discussion about your (good!) ideas with the owners of the protocol.

Thanks for your affirmation of my effort and looking forward
to the conclusion of the protocol discussion.
Guangguan Wang Jan. 21, 2022, 4:48 p.m. UTC | #8
On 2022/1/20 22:22, Tony Lu wrote:>>  #include "smc_ib.h"
>>  
>> -#define SMC_RMBS_PER_LGR_MAX	255	/* max. # of RMBs per link group */
>> +#define SMC_RMBS_PER_LGR_MAX	32	/* max. # of RMBs per link group. Correspondingly,
>> +					 * SMC_WR_BUF_CNT should not be less than 2 *
>> +					 * SMC_RMBS_PER_LGR_MAX, since every connection at
>> +					 * least has two rq/sq credits in average, otherwise
>> +					 * may result in waiting for credits in sending process.
>> +					 */
> 
> This gives a fixed limit for per link group connections. Using tunable
> knobs to control this for different workload would be better. It also
> reduce the completion of free slots in the same link group and link.
> 

It is a good idea, but I find a patch (https://lore.kernel.org/linux-s390/20220114054852.38058-7-tonylu@linux.alibaba.com/) where you have already done this idea.
Stefan Raspl Jan. 25, 2022, 9:42 a.m. UTC | #9
On 1/20/22 07:51, Guangguan Wang wrote:
> This implement rq flow control in smc-r link layer. QPs
> communicating without rq flow control, in the previous
> version, may result in RNR (reveive not ready) error, which
> means when sq sends a message to the remote qp, but the
> remote qp's rq has no valid rq entities to receive the message.
> In RNR condition, the rdma transport layer may retransmit
> the messages again and again until the rq has any entities,
> which may lower the performance, especially in heavy traffic.
> Using credits to do rq flow control can avoid the occurrence
> of RNR.

That's some truly substantial improvements!
But we need to be careful with protocol-level changes: There are other operating 
systems like z/OS and AIX which have compatible implementations of SMC, too. 
Changes like a reduction of connections per link group or usage of reserved 
fields would need to be coordinated, and likely would have unwanted side-effects 
even when used with older Linux kernel versions.
Changing the protocol is "expensive" insofar as it requires time to thoroughly 
discuss the changes, perform compatibility tests, and so on.
So I would like to urge you to investigate alternative ways that do not require 
protocol-level changes to address this scenario, e.g. by modifying the number of 
completion queue elements, to see if this could yield similar results.

Thx!
Guangguan Wang Jan. 29, 2022, 3:43 a.m. UTC | #10
On 2022/1/25 17:42, Stefan Raspl wrote:
> 
> That's some truly substantial improvements!
> But we need to be careful with protocol-level changes: There are other operating systems like z/OS and AIX which have compatible implementations of SMC, too. Changes like a reduction of connections per link group or usage of reserved fields would need to be coordinated, and likely would have unwanted side-effects even when used with older Linux kernel versions.
> Changing the protocol is "expensive" insofar as it requires time to thoroughly discuss the changes, perform compatibility tests, and so on.
> So I would like to urge you to investigate alternative ways that do not require protocol-level changes to address this scenario, e.g. by modifying the number of completion queue elements, to see if this could yield similar results.
> 
> Thx!
> 

Yes, there are alternative ways, as RNR caused by the missmatch of send rate and receive rate, which means sending too fast
or receiving too slow. What I have done in this patch is to backpressure the sending side when sending too fast.

Another solution is to process and refill the receive queue as quickly as posibble, which requires no protocol-level change. 
The fllowing modifications are needed:
- Enqueue cdc msgs to backlog queues instead of processing in rx tasklet. llc msgs remain unchanged.
- A mempool is needed as cdc msgs are processed asynchronously. Allocate new receive buffers from mempool when refill receive queue.
- Schedule backlog queues to other cpus, which are calculated by 4-tuple or 5-tuple hash of the connections, to process the cdc msgs,
  in order to reduce the usage of the cpu where rx tasklet runs on.

the pseudocode shows below:
rx_tasklet
    if cdc_msgs
        enqueue to backlog;
	maybe smp_call_function_single_async is needed to wakeup the corresponding cpu to process backlog;
        allocate new buffer and modify the sge in rq_wr;
    else
        process remains unchanged;
    endif

    post_recv rq_wr;
end rx_tasklet

smp_backlog_process in corresponding cpu, called by smp_call_function_single_async
    for connections hashed to this cpu
        for cdc_msgs in backlog
            process cdc msgs;
        end cdc_msgs
    end connections
end smp_backlog_process

I‘d like to hear your suggestions of this solution.
Thank you.
Tony Lu Jan. 29, 2022, 4:24 a.m. UTC | #11
On Sat, Jan 29, 2022 at 11:43:22AM +0800, Guangguan Wang wrote:
> 
> Yes, there are alternative ways, as RNR caused by the missmatch of send rate and receive rate, which means sending too fast
> or receiving too slow. What I have done in this patch is to backpressure the sending side when sending too fast.
> 
> Another solution is to process and refill the receive queue as quickly as posibble, which requires no protocol-level change. 
> The fllowing modifications are needed:
> - Enqueue cdc msgs to backlog queues instead of processing in rx tasklet. llc msgs remain unchanged.

It's a good idea to use backlog to free the work in softirq. Rx backlog
can help move the heavy logic out of softirq, and let extra kthread or
workqueue to handle it, then let kernel scheduler to deal with the
balance between userspace process and kthread.

There are two things to be careful, one for introducing more latency,
this should trade off latency and throughput, the other for backlog full.

> - A mempool is needed as cdc msgs are processed asynchronously. Allocate new receive buffers from mempool when refill receive queue.

Yes, we need a elastically expanding RX buffer, also elastically
shrinking. This looks like tcp_mem with tree elements to limit the
memory usage. We also need to free memory automatically, based on memcg
pressure is a good idea.

> - Schedule backlog queues to other cpus, which are calculated by 4-tuple or 5-tuple hash of the connections, to process the cdc msgs,
>   in order to reduce the usage of the cpu where rx tasklet runs on.

I am wondering if it is need for now. In general, it should spread the
CPU usage to different cores. The memory usage or CPU usage which one
will reach its limitation before trigger RNR. Maybe there should some
data to support it?

Thank you,
Tony Lu
Karsten Graul Jan. 31, 2022, 12:56 p.m. UTC | #12
On 29/01/2022 04:43, Guangguan Wang wrote:
> 
> On 2022/1/25 17:42, Stefan Raspl wrote:
>>
>> That's some truly substantial improvements!
>> But we need to be careful with protocol-level changes: There are other operating systems like z/OS and AIX which have compatible implementations of SMC, too. Changes like a reduction of connections per link group or usage of reserved fields would need to be coordinated, and likely would have unwanted side-effects even when used with older Linux kernel versions.
>> Changing the protocol is "expensive" insofar as it requires time to thoroughly discuss the changes, perform compatibility tests, and so on.
>> So I would like to urge you to investigate alternative ways that do not require protocol-level changes to address this scenario, e.g. by modifying the number of completion queue elements, to see if this could yield similar results.
>>
>> Thx!
>>
> 
> Yes, there are alternative ways, as RNR caused by the missmatch of send rate and receive rate, which means sending too fast
> or receiving too slow. What I have done in this patch is to backpressure the sending side when sending too fast.
> 
> Another solution is to process and refill the receive queue as quickly as posibble, which requires no protocol-level change. 
> The fllowing modifications are needed:
> - Enqueue cdc msgs to backlog queues instead of processing in rx tasklet. llc msgs remain unchanged.
> - A mempool is needed as cdc msgs are processed asynchronously. Allocate new receive buffers from mempool when refill receive queue.
> - Schedule backlog queues to other cpus, which are calculated by 4-tuple or 5-tuple hash of the connections, to process the cdc msgs,
>   in order to reduce the usage of the cpu where rx tasklet runs on.
> 
> the pseudocode shows below:
> rx_tasklet
>     if cdc_msgs
>         enqueue to backlog;
> 	maybe smp_call_function_single_async is needed to wakeup the corresponding cpu to process backlog;
>         allocate new buffer and modify the sge in rq_wr;
>     else
>         process remains unchanged;
>     endif
> 
>     post_recv rq_wr;
> end rx_tasklet
> 
> smp_backlog_process in corresponding cpu, called by smp_call_function_single_async
>     for connections hashed to this cpu
>         for cdc_msgs in backlog
>             process cdc msgs;
>         end cdc_msgs
>     end connections
> end smp_backlog_process
> 
> I‘d like to hear your suggestions of this solution.
> Thank you.

I like this idea, this should improve the RX handling a lot!
diff mbox series

Patch

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index aa3bcaaeabf7..108ac85bd794 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -531,6 +531,13 @@  static void smc_link_save_peer_info(struct smc_link *link,
 	memcpy(link->peer_mac, ini->peer_mac, sizeof(link->peer_mac));
 	link->peer_psn = ntoh24(clc->r0.psn);
 	link->peer_mtu = clc->r0.qp_mtu;
+	link->credits_enable = clc->r0.init_credits ? 1 : 0;
+	if (link->credits_enable) {
+		atomic_set(&link->peer_rq_credits, clc->r0.init_credits);
+		// set peer rq credits watermark, if less than init_credits * 2/3,
+		// then credit announcement is needed.
+		link->peer_cr_watermark_low = max(clc->r0.init_credits * 2 / 3, 1);
+	}
 }
 
 static void smc_stat_inc_fback_rsn_cnt(struct smc_sock *smc,
@@ -945,6 +952,11 @@  static int smc_connect_rdma(struct smc_sock *smc,
 			goto connect_abort;
 		}
 	} else {
+		if (smc_llc_announce_credits(link, SMC_LLC_RESP, true)) {
+			reason_code = SMC_CLC_DECL_CREDITSERR;
+			goto connect_abort;
+		}
+
 		if (smcr_lgr_reg_rmbs(link, smc->conn.rmb_desc)) {
 			reason_code = SMC_CLC_DECL_ERR_REGRMB;
 			goto connect_abort;
diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index 84c8a4374fdd..6c0f71d6dc23 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -101,14 +101,18 @@  int smc_cdc_msg_send(struct smc_connection *conn,
 		     struct smc_cdc_tx_pend *pend)
 {
 	struct smc_link *link = conn->lnk;
+	struct smc_cdc_msg *cdc_msg = (struct smc_cdc_msg *)wr_buf;
 	union smc_host_cursor cfed;
+	u8 saved_credits = 0;
 	int rc;
 
 	smc_cdc_add_pending_send(conn, pend);
 
 	conn->tx_cdc_seq++;
 	conn->local_tx_ctrl.seqno = conn->tx_cdc_seq;
-	smc_host_msg_to_cdc((struct smc_cdc_msg *)wr_buf, conn, &cfed);
+	smc_host_msg_to_cdc(cdc_msg, conn, &cfed);
+	saved_credits = (u8)smc_wr_rx_get_credits(link);
+	cdc_msg->credits = saved_credits;
 
 	atomic_inc(&conn->cdc_pend_tx_wr);
 	smp_mb__after_atomic(); /* Make sure cdc_pend_tx_wr added before post */
@@ -120,6 +124,7 @@  int smc_cdc_msg_send(struct smc_connection *conn,
 	} else {
 		conn->tx_cdc_seq--;
 		conn->local_tx_ctrl.seqno = conn->tx_cdc_seq;
+		smc_wr_rx_put_credits(link, saved_credits);
 		atomic_dec(&conn->cdc_pend_tx_wr);
 	}
 
@@ -430,6 +435,9 @@  static void smc_cdc_rx_handler(struct ib_wc *wc, void *buf)
 	if (cdc->len != SMC_WR_TX_SIZE)
 		return; /* invalid message */
 
+	if (cdc->credits)
+		smc_wr_tx_put_credits(link, cdc->credits, true);
+
 	/* lookup connection */
 	lgr = smc_get_lgr(link);
 	read_lock_bh(&lgr->conns_lock);
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index 696cc11f2303..145ce7997e64 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -47,7 +47,8 @@  struct smc_cdc_msg {
 	union smc_cdc_cursor		cons;	/* piggy backed "ack" */
 	struct smc_cdc_producer_flags	prod_flags;
 	struct smc_cdc_conn_state_flags	conn_state_flags;
-	u8				reserved[18];
+	u8				credits;	/* credits synced by every cdc msg */
+	u8				reserved[17];
 };
 
 /* SMC-D cursor format */
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index 6be95a2a7b25..f477ac24c427 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -1038,9 +1038,12 @@  static int smc_clc_send_confirm_accept(struct smc_sock *smc,
 		switch (clc->hdr.type) {
 		case SMC_CLC_ACCEPT:
 			clc->r0.qp_mtu = link->path_mtu;
+			clc->r0.init_credits = (u8)link->wr_rx_cnt;
 			break;
 		case SMC_CLC_CONFIRM:
 			clc->r0.qp_mtu = min(link->path_mtu, link->peer_mtu);
+			clc->r0.init_credits =
+				link->credits_enable ? (u8)link->wr_rx_cnt : 0;
 			break;
 		}
 		clc->r0.rmbe_size = conn->rmbe_size_short;
diff --git a/net/smc/smc_clc.h b/net/smc/smc_clc.h
index 83f02f131fc0..eb4bba54d6df 100644
--- a/net/smc/smc_clc.h
+++ b/net/smc/smc_clc.h
@@ -63,6 +63,7 @@ 
 #define SMC_CLC_DECL_ERR_RTOK	0x09990001  /*	 rtoken handling failed       */
 #define SMC_CLC_DECL_ERR_RDYLNK	0x09990002  /*	 ib ready link failed	      */
 #define SMC_CLC_DECL_ERR_REGRMB	0x09990003  /*	 reg rmb failed		      */
+#define SMC_CLC_DECL_CREDITSERR	0x09990004  /*	 announce credits failed	  */
 
 #define SMC_FIRST_CONTACT_MASK	0b10	/* first contact bit within typev2 */
 
@@ -190,7 +191,7 @@  struct smcr_clc_msg_accept_confirm {	/* SMCR accept/confirm */
 	u8 qp_mtu   : 4,
 	   rmbe_size : 4;
 #endif
-	u8 reserved;
+	u8 init_credits;		/* QP rq init credits for rq flowctrl */
 	__be64 rmb_dma_addr;	/* RMB virtual address */
 	u8 reserved2;
 	u8 psn[3];		/* packet sequence number */
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 521c64a3d8d3..ce859dfdbe6c 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -21,7 +21,12 @@ 
 #include "smc.h"
 #include "smc_ib.h"
 
-#define SMC_RMBS_PER_LGR_MAX	255	/* max. # of RMBs per link group */
+#define SMC_RMBS_PER_LGR_MAX	32	/* max. # of RMBs per link group. Correspondingly,
+					 * SMC_WR_BUF_CNT should not be less than 2 *
+					 * SMC_RMBS_PER_LGR_MAX, since every connection at
+					 * least has two rq/sq credits in average, otherwise
+					 * may result in waiting for credits in sending process.
+					 */
 
 struct smc_lgr_list {			/* list of link group definition */
 	struct list_head	list;
@@ -80,6 +85,8 @@  struct smc_rdma_wr {				/* work requests per message
 
 #define SMC_LGR_ID_SIZE		4
 
+#define SMC_LINKFLAG_ANNOUNCE_PENDING	0
+
 struct smc_link {
 	struct smc_ib_device	*smcibdev;	/* ib-device */
 	u8			ibport;		/* port - values 1 | 2 */
@@ -123,6 +130,14 @@  struct smc_link {
 	atomic_t		wr_reg_refcnt;	/* reg refs to link */
 	enum smc_wr_reg_state	wr_reg_state;	/* state of wr_reg request */
 
+	atomic_t	peer_rq_credits;	/* credits for peer rq flowctrl */
+	atomic_t	local_rq_credits;	/* credits for local rq flowctrl */
+	u8		credits_enable;		/* credits enable flag, set when negotiation */
+	u8		local_cr_watermark_high;	/* local rq credits watermark */
+	u8		peer_cr_watermark_low;	/* peer rq credits watermark */
+	struct work_struct	credits_announce_work;	/* work for credits announcement */
+	unsigned long	flags;	/* link flags, SMC_LINKFLAG_ANNOUNCE_PENDING .etc */
+
 	u8			gid[SMC_GID_SIZE];/* gid matching used vlan id*/
 	u8			sgid_index;	/* gid index for vlan id      */
 	u32			peer_qpn;	/* QP number of peer */
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index a3e2d3b89568..9c8206da0d2b 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -665,10 +665,12 @@  int smc_ib_create_queue_pair(struct smc_link *lnk)
 		.srq = NULL,
 		.cap = {
 				/* include unsolicited rdma_writes as well,
-				 * there are max. 2 RDMA_WRITE per 1 WR_SEND
+				 * there are max. 2 RDMA_WRITE per 1 WR_SEND.
+				 * RDMA_WRITE consumes send queue entities,
+				 * without recv queue entities.
 				 */
 			.max_send_wr = SMC_WR_BUF_CNT * 3,
-			.max_recv_wr = SMC_WR_BUF_CNT * 3,
+			.max_recv_wr = SMC_WR_BUF_CNT,
 			.max_send_sge = SMC_IB_MAX_SEND_SGE,
 			.max_recv_sge = sges_per_buf,
 		},
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index c4d057b2941d..10653b4e3d3f 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -75,7 +75,8 @@  struct smc_llc_msg_add_link {		/* type 0x02 */
 	   reserved3 : 4;
 #endif
 	u8 initial_psn[3];
-	u8 reserved[8];
+	u8 init_credits;	/* QP rq init credits for rq flowctrl */
+	u8 reserved[7];
 };
 
 struct smc_llc_msg_add_link_cont_rt {
@@ -170,6 +171,12 @@  struct smc_llc_msg_delete_rkey {	/* type 0x09 */
 	u8 reserved2[4];
 };
 
+struct smc_llc_msg_announce_credits {	/* type 0x0A */
+	struct smc_llc_hdr hd;
+	u8 credits;
+	u8 reserved[39];
+};
+
 struct smc_llc_msg_delete_rkey_v2 {	/* type 0x29 */
 	struct smc_llc_hdr hd;
 	u8 num_rkeys;
@@ -189,6 +196,7 @@  union smc_llc_msg {
 	struct smc_llc_msg_delete_rkey delete_rkey;
 
 	struct smc_llc_msg_test_link test_link;
+	struct smc_llc_msg_announce_credits announce_credits;
 	struct {
 		struct smc_llc_hdr hdr;
 		u8 data[SMC_LLC_DATA_LEN];
@@ -750,6 +758,46 @@  static int smc_llc_send_test_link(struct smc_link *link, u8 user_data[16])
 	return rc;
 }
 
+/* send credits announce request or response  */
+int smc_llc_announce_credits(struct smc_link *link,
+			     enum smc_llc_reqresp reqresp, bool force)
+{
+	struct smc_llc_msg_announce_credits *announce_credits;
+	struct smc_wr_tx_pend_priv *pend;
+	struct smc_wr_buf *wr_buf;
+	int rc;
+	u8 saved_credits = 0;
+
+	if (!link->credits_enable ||
+	    (!force && !smc_wr_rx_credits_need_announce(link)))
+		return 0;
+
+	saved_credits = (u8)smc_wr_rx_get_credits(link);
+	if (!saved_credits)
+		/* maybe synced by cdc msg */
+		return 0;
+
+	rc = smc_llc_add_pending_send(link, &wr_buf, &pend);
+	if (rc) {
+		smc_wr_rx_put_credits(link, saved_credits);
+		return rc;
+	}
+
+	announce_credits = (struct smc_llc_msg_announce_credits *)wr_buf;
+	memset(announce_credits, 0, sizeof(*announce_credits));
+	announce_credits->hd.common.type = SMC_LLC_ANNOUNCE_CREDITS;
+	announce_credits->hd.length = sizeof(struct smc_llc_msg_announce_credits);
+	if (reqresp == SMC_LLC_RESP)
+		announce_credits->hd.flags |= SMC_LLC_FLAG_RESP;
+	announce_credits->credits = saved_credits;
+	/* send llc message */
+	rc = smc_wr_tx_send(link, pend);
+	if (rc)
+		smc_wr_rx_put_credits(link, saved_credits);
+
+	return rc;
+}
+
 /* schedule an llc send on link, may wait for buffers */
 static int smc_llc_send_message(struct smc_link *link, void *llcbuf)
 {
@@ -1012,6 +1060,13 @@  static void smc_llc_save_add_link_info(struct smc_link *link,
 	memcpy(link->peer_mac, add_llc->sender_mac, ETH_ALEN);
 	link->peer_psn = ntoh24(add_llc->initial_psn);
 	link->peer_mtu = add_llc->qp_mtu;
+	link->credits_enable = add_llc->init_credits ? 1 : 0;
+	if (link->credits_enable) {
+		atomic_set(&link->peer_rq_credits, add_llc->init_credits);
+		// set peer rq credits watermark, if less than init_credits * 2/3,
+		// then credit announcement is needed.
+		link->peer_cr_watermark_low = max(add_llc->init_credits * 2 / 3, 1);
+	}
 }
 
 /* as an SMC client, process an add link request */
@@ -1931,6 +1986,10 @@  static void smc_llc_event_handler(struct smc_llc_qentry *qentry)
 			smc_llc_flow_stop(lgr, &lgr->llc_flow_rmt);
 		}
 		return;
+	case SMC_LLC_ANNOUNCE_CREDITS:
+		if (smc_link_active(link))
+			smc_wr_tx_put_credits(link, llc->announce_credits.credits, true);
+		break;
 	case SMC_LLC_REQ_ADD_LINK:
 		/* handle response here, smc_llc_flow_stop() cannot be called
 		 * in tasklet context
@@ -2016,6 +2075,10 @@  static void smc_llc_rx_response(struct smc_link *link,
 	case SMC_LLC_CONFIRM_RKEY_CONT:
 		/* not used because max links is 3 */
 		break;
+	case SMC_LLC_ANNOUNCE_CREDITS:
+		if (smc_link_active(link))
+			smc_wr_tx_put_credits(link, qentry->msg.announce_credits.credits, true);
+		break;
 	default:
 		smc_llc_protocol_violation(link->lgr,
 					   qentry->msg.raw.hdr.common.type);
@@ -2109,6 +2172,27 @@  static void smc_llc_testlink_work(struct work_struct *work)
 	schedule_delayed_work(&link->llc_testlink_wrk, next_interval);
 }
 
+static void smc_llc_announce_credits_work(struct work_struct *work)
+{
+	struct smc_link *link = container_of(work,
+					     struct smc_link, credits_announce_work);
+	int rc, retry = 0, agains = 0;
+
+again:
+	do {
+		rc = smc_llc_announce_credits(link, SMC_LLC_RESP, false);
+	} while ((rc == -EBUSY) && smc_link_sendable(link) &&
+			(retry++ < SMC_LLC_ANNOUNCE_CR_MAX_RETRY));
+
+	if (smc_wr_rx_credits_need_announce(link) &&
+	    smc_link_sendable(link) && agains <= 5 && !rc) {
+		agains++;
+		goto again;
+	}
+
+	clear_bit(SMC_LINKFLAG_ANNOUNCE_PENDING, &link->flags);
+}
+
 void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc)
 {
 	struct net *net = sock_net(smc->clcsock->sk);
@@ -2144,6 +2228,7 @@  int smc_llc_link_init(struct smc_link *link)
 {
 	init_completion(&link->llc_testlink_resp);
 	INIT_DELAYED_WORK(&link->llc_testlink_wrk, smc_llc_testlink_work);
+	INIT_WORK(&link->credits_announce_work, smc_llc_announce_credits_work);
 	return 0;
 }
 
@@ -2177,6 +2262,7 @@  void smc_llc_link_clear(struct smc_link *link, bool log)
 				    link->smcibdev->ibdev->name, link->ibport);
 	complete(&link->llc_testlink_resp);
 	cancel_delayed_work_sync(&link->llc_testlink_wrk);
+	cancel_work_sync(&link->credits_announce_work);
 }
 
 /* register a new rtoken at the remote peer (for all links) */
@@ -2291,6 +2377,10 @@  static struct smc_wr_rx_handler smc_llc_rx_handlers[] = {
 		.handler	= smc_llc_rx_handler,
 		.type		= SMC_LLC_DELETE_RKEY
 	},
+	{
+		.handler    = smc_llc_rx_handler,
+		.type       = SMC_LLC_ANNOUNCE_CREDITS
+	},
 	/* V2 types */
 	{
 		.handler	= smc_llc_rx_handler,
diff --git a/net/smc/smc_llc.h b/net/smc/smc_llc.h
index 4404e52b3346..f8a14643faf4 100644
--- a/net/smc/smc_llc.h
+++ b/net/smc/smc_llc.h
@@ -20,6 +20,8 @@ 
 #define SMC_LLC_WAIT_FIRST_TIME		(5 * HZ)
 #define SMC_LLC_WAIT_TIME		(2 * HZ)
 
+#define SMC_LLC_ANNOUNCE_CR_MAX_RETRY	(1)
+
 enum smc_llc_reqresp {
 	SMC_LLC_REQ,
 	SMC_LLC_RESP
@@ -35,6 +37,7 @@  enum smc_llc_msg_type {
 	SMC_LLC_TEST_LINK		= 0x07,
 	SMC_LLC_CONFIRM_RKEY_CONT	= 0x08,
 	SMC_LLC_DELETE_RKEY		= 0x09,
+	SMC_LLC_ANNOUNCE_CREDITS	= 0X0A,
 	/* V2 types */
 	SMC_LLC_CONFIRM_LINK_V2		= 0x21,
 	SMC_LLC_ADD_LINK_V2		= 0x22,
@@ -86,6 +89,8 @@  int smc_llc_send_add_link(struct smc_link *link, u8 mac[], u8 gid[],
 int smc_llc_send_delete_link(struct smc_link *link, u8 link_del_id,
 			     enum smc_llc_reqresp reqresp, bool orderly,
 			     u32 reason);
+int smc_llc_announce_credits(struct smc_link *link,
+			     enum smc_llc_reqresp reqresp, bool force);
 void smc_llc_srv_delete_link_local(struct smc_link *link, u8 del_link_id);
 void smc_llc_lgr_init(struct smc_link_group *lgr, struct smc_sock *smc);
 void smc_llc_lgr_clear(struct smc_link_group *lgr);
diff --git a/net/smc/smc_wr.c b/net/smc/smc_wr.c
index 24be1d03fef9..7a0136e1f78a 100644
--- a/net/smc/smc_wr.c
+++ b/net/smc/smc_wr.c
@@ -173,11 +173,16 @@  static inline int smc_wr_tx_get_free_slot_index(struct smc_link *link, u32 *idx)
 	*idx = link->wr_tx_cnt;
 	if (!smc_link_sendable(link))
 		return -ENOLINK;
+
+	if (!smc_wr_tx_get_credit(link))
+		return -EBUSY;
+
 	for_each_clear_bit(*idx, link->wr_tx_mask, link->wr_tx_cnt) {
 		if (!test_and_set_bit(*idx, link->wr_tx_mask))
 			return 0;
 	}
 	*idx = link->wr_tx_cnt;
+	smc_wr_tx_put_credits(link, 1, false);
 	return -EBUSY;
 }
 
@@ -283,7 +288,7 @@  int smc_wr_tx_put_slot(struct smc_link *link,
 		memset(&link->wr_tx_bufs[idx], 0,
 		       sizeof(link->wr_tx_bufs[idx]));
 		test_and_clear_bit(idx, link->wr_tx_mask);
-		wake_up(&link->wr_tx_wait);
+		smc_wr_tx_put_credits(link, 1, true);
 		return 1;
 	} else if (link->lgr->smc_version == SMC_V2 &&
 		   pend->idx == link->wr_tx_cnt) {
@@ -471,6 +476,12 @@  static inline void smc_wr_rx_process_cqes(struct ib_wc wc[], int num)
 				break;
 			}
 		}
+
+		if (smc_wr_rx_credits_need_announce(link) &&
+		    !test_bit(SMC_LINKFLAG_ANNOUNCE_PENDING, &link->flags)) {
+			set_bit(SMC_LINKFLAG_ANNOUNCE_PENDING, &link->flags);
+			schedule_work(&link->credits_announce_work);
+		}
 	}
 }
 
@@ -513,6 +524,8 @@  int smc_wr_rx_post_init(struct smc_link *link)
 
 	for (i = 0; i < link->wr_rx_cnt; i++)
 		rc = smc_wr_rx_post(link);
+	// credits have already been announced to peer
+	atomic_set(&link->local_rq_credits, 0);
 	return rc;
 }
 
@@ -547,7 +560,7 @@  void smc_wr_remember_qp_attr(struct smc_link *lnk)
 
 	lnk->wr_tx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
 			       lnk->qp_attr.cap.max_send_wr);
-	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT * 3,
+	lnk->wr_rx_cnt = min_t(size_t, SMC_WR_BUF_CNT,
 			       lnk->qp_attr.cap.max_recv_wr);
 }
 
@@ -736,7 +749,7 @@  int smc_wr_alloc_link_mem(struct smc_link *link)
 	link->wr_tx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE, GFP_KERNEL);
 	if (!link->wr_tx_bufs)
 		goto no_mem;
-	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT * 3, SMC_WR_BUF_SIZE,
+	link->wr_rx_bufs = kcalloc(SMC_WR_BUF_CNT, SMC_WR_BUF_SIZE,
 				   GFP_KERNEL);
 	if (!link->wr_rx_bufs)
 		goto no_mem_wr_tx_bufs;
@@ -744,7 +757,7 @@  int smc_wr_alloc_link_mem(struct smc_link *link)
 				  GFP_KERNEL);
 	if (!link->wr_tx_ibs)
 		goto no_mem_wr_rx_bufs;
-	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_ibs = kcalloc(SMC_WR_BUF_CNT,
 				  sizeof(link->wr_rx_ibs[0]),
 				  GFP_KERNEL);
 	if (!link->wr_rx_ibs)
@@ -763,7 +776,7 @@  int smc_wr_alloc_link_mem(struct smc_link *link)
 				   GFP_KERNEL);
 	if (!link->wr_tx_sges)
 		goto no_mem_wr_tx_rdma_sges;
-	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT * 3,
+	link->wr_rx_sges = kcalloc(SMC_WR_BUF_CNT,
 				   sizeof(link->wr_rx_sges[0]) * sges_per_buf,
 				   GFP_KERNEL);
 	if (!link->wr_rx_sges)
@@ -886,6 +899,13 @@  int smc_wr_create_link(struct smc_link *lnk)
 	atomic_set(&lnk->wr_tx_refcnt, 0);
 	init_waitqueue_head(&lnk->wr_reg_wait);
 	atomic_set(&lnk->wr_reg_refcnt, 0);
+	atomic_set(&lnk->peer_rq_credits, 0);
+	atomic_set(&lnk->local_rq_credits, 0);
+	lnk->flags = 0;
+	// set local rq credits high watermark to lnk->wr_rx_cnt / 3,
+	// if local rq credits more than high watermark, announcement is needed.
+	lnk->local_cr_watermark_high = max(lnk->wr_rx_cnt / 3, 1U);
+	lnk->peer_cr_watermark_low = 0;
 	return rc;
 
 dma_unmap:
diff --git a/net/smc/smc_wr.h b/net/smc/smc_wr.h
index 47512ccce5ef..1104bcf1040a 100644
--- a/net/smc/smc_wr.h
+++ b/net/smc/smc_wr.h
@@ -19,7 +19,12 @@ 
 #include "smc.h"
 #include "smc_core.h"
 
-#define SMC_WR_BUF_CNT 16	/* # of ctrl buffers per link */
+#define SMC_WR_BUF_CNT 64	/* # of ctrl buffers per link, SMC_WR_BUF_CNT
+				 * should not be less than 2 * SMC_RMBS_PER_LGR_MAX,
+				 * since every connection at least has two rq/sq
+				 * credits in average, otherwise may result in
+				 * waiting for credits in sending process.
+				 */
 
 #define SMC_WR_TX_WAIT_FREE_SLOT_TIME	(10 * HZ)
 
@@ -83,6 +88,51 @@  static inline void smc_wr_wakeup_reg_wait(struct smc_link *lnk)
 	wake_up(&lnk->wr_reg_wait);
 }
 
+// get one tx credit, and peer rq credits dec
+static inline int smc_wr_tx_get_credit(struct smc_link *link)
+{
+	return !link->credits_enable || atomic_dec_if_positive(&link->peer_rq_credits) >= 0;
+}
+
+// put tx credits, when some failures occurred after tx credits got
+// or receive announce credits msgs
+static inline void smc_wr_tx_put_credits(struct smc_link *link, int credits, bool wakeup)
+{
+	if (link->credits_enable && credits) {
+		atomic_add(credits, &link->peer_rq_credits);
+		if (wakeup && wq_has_sleeper(&link->wr_tx_wait))
+			wake_up_nr(&link->wr_tx_wait, credits);
+	}
+}
+
+// to check whether peer rq credits is lower than watermark.
+static inline int smc_wr_tx_credits_need_announce(struct smc_link *link)
+{
+	return link->credits_enable &&
+		atomic_read(&link->peer_rq_credits) <= link->peer_cr_watermark_low;
+}
+
+// get local rq credits and set credits to zero.
+// may called when announcing credits
+static inline int smc_wr_rx_get_credits(struct smc_link *link)
+{
+	return link->credits_enable ? atomic_fetch_and(0, &link->local_rq_credits) : 0;
+}
+
+// called when post_recv a rqe
+static inline void smc_wr_rx_put_credits(struct smc_link *link, int credits)
+{
+	if (link->credits_enable && credits)
+		atomic_add(credits, &link->local_rq_credits);
+}
+
+// to check whether local rq credits is higher than watermark.
+static inline int smc_wr_rx_credits_need_announce(struct smc_link *link)
+{
+	return link->credits_enable &&
+		atomic_read(&link->local_rq_credits) >= link->local_cr_watermark_high;
+}
+
 /* post a new receive work request to fill a completed old work request entry */
 static inline int smc_wr_rx_post(struct smc_link *link)
 {
@@ -95,6 +145,8 @@  static inline int smc_wr_rx_post(struct smc_link *link)
 	index = do_div(temp_wr_id, link->wr_rx_cnt);
 	link->wr_rx_ibs[index].wr_id = wr_id;
 	rc = ib_post_recv(link->roce_qp, &link->wr_rx_ibs[index], NULL);
+	if (!rc)
+		smc_wr_rx_put_credits(link, 1);
 	return rc;
 }