diff mbox series

[net-next,v2,01/10] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending

Message ID 688d165fe630989665e5091a28a5b1238123fbdc.1661407821.git.alibuda@linux.alibaba.com (mailing list archive)
State Superseded
Headers show
Series optimize the parallelism of SMC-R connections | expand

Commit Message

D. Wythe Aug. 26, 2022, 9:51 a.m. UTC
From: "D. Wythe" <alibuda@linux.alibaba.com>

This patch attempts to remove locks named smc_client_lgr_pending and
smc_server_lgr_pending, which aim to serialize the creation of link
group. However, once link group existed already, those locks are
meaningless, worse still, they make incoming connections have to be
queued one after the other.

Now, the creation of link group is no longer generated by competition,
but allocated through following strategy.

1. Try to find a suitable link group, if successd, current connection
is considered as NON first contact connection. ends.

2. Check the number of connections currently waiting for a suitable
link group to be created, if it is not less that the number of link
groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
increase the number of link groups to be created, current connection
is considered as the first contact connection. ends.

3. Increase the number of connections currently waiting, and wait
for woken up.

4. Decrease the number of connections currently waiting, goto 1.

We wake up the connection that was put to sleep in stage 3 through
the SMC link state change event. Once the link moves out of the
SMC_LNK_ACTIVATING state, decrease the number of link groups to
be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
connections.

In the iplementation, we introduce the concept of lnk cluster, which is
a collection of links with the same characteristics (see
smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
wake up efficiently in the scenario of N v.s 1.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c   |  13 +-
 net/smc/smc_core.c | 352 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/smc/smc_core.h |  53 ++++++++
 net/smc/smc_llc.c  |   9 +-
 4 files changed, 411 insertions(+), 16 deletions(-)

Comments

Jan Karcher Aug. 29, 2022, 2:48 p.m. UTC | #1
On 26.08.2022 11:51, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch attempts to remove locks named smc_client_lgr_pending and
> smc_server_lgr_pending, which aim to serialize the creation of link
> group. However, once link group existed already, those locks are
> meaningless, worse still, they make incoming connections have to be
> queued one after the other.
> 
> Now, the creation of link group is no longer generated by competition,
> but allocated through following strategy.
> 
> 1. Try to find a suitable link group, if successd, current connection
> is considered as NON first contact connection. ends.
> 
> 2. Check the number of connections currently waiting for a suitable
> link group to be created, if it is not less that the number of link
> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> increase the number of link groups to be created, current connection
> is considered as the first contact connection. ends.
> 
> 3. Increase the number of connections currently waiting, and wait
> for woken up.
> 
> 4. Decrease the number of connections currently waiting, goto 1.
> 
> We wake up the connection that was put to sleep in stage 3 through
> the SMC link state change event. Once the link moves out of the
> SMC_LNK_ACTIVATING state, decrease the number of link groups to
> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> connections.
> 
> In the iplementation, we introduce the concept of lnk cluster, which is
> a collection of links with the same characteristics (see
> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> wake up efficiently in the scenario of N v.s 1.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>   net/smc/af_smc.c   |  13 +-
>   net/smc/smc_core.c | 352 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   net/smc/smc_core.h |  53 ++++++++
>   net/smc/smc_llc.c  |   9 +-
>   4 files changed, 411 insertions(+), 16 deletions(-)

Thank you for the v2.
I'm going to start testing and give you feedback ASAP.

- Jan
Jan Karcher Aug. 31, 2022, 3:04 p.m. UTC | #2
On 26.08.2022 11:51, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patch attempts to remove locks named smc_client_lgr_pending and
> smc_server_lgr_pending, which aim to serialize the creation of link
> group. However, once link group existed already, those locks are
> meaningless, worse still, they make incoming connections have to be
> queued one after the other.
> 
> Now, the creation of link group is no longer generated by competition,
> but allocated through following strategy.
> 
> 1. Try to find a suitable link group, if successd, current connection
> is considered as NON first contact connection. ends.
> 
> 2. Check the number of connections currently waiting for a suitable
> link group to be created, if it is not less that the number of link
> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
> increase the number of link groups to be created, current connection
> is considered as the first contact connection. ends.
> 
> 3. Increase the number of connections currently waiting, and wait
> for woken up.
> 
> 4. Decrease the number of connections currently waiting, goto 1.
> 
> We wake up the connection that was put to sleep in stage 3 through
> the SMC link state change event. Once the link moves out of the
> SMC_LNK_ACTIVATING state, decrease the number of link groups to
> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
> connections.
> 
> In the iplementation, we introduce the concept of lnk cluster, which is
> a collection of links with the same characteristics (see
> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
> wake up efficiently in the scenario of N v.s 1.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>

Hello D.,

thanks for the v2 and the patience.
I got to testing and as with v1 I want to share our findings with you. 
If you need more information or want us to look deeper into the findings 
please let us know.

Regarding SMC-R test-suite:
We see a refcount error during one of our stress tests. This lets us 
believe that the smc_link_cluster_put() to smc_link_cluster_hold() ratio 
is not right anymore.
The patch provided by yacan does fix this issue but we did not verify if 
it is the right way to balance the hold and put calls.

[root@t8345011 ~]# journalctl --dmesg | tail -100
Aug 31 16:17:36 t8345011.lnxne.boe smc-tests: test_smcapp_50x_ifdown started
Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 
link removed: id 00000101, peerid 00000101, ibdev mlx5_0, ibport 1
Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 
state changed: SINGLE, pnetid NET25
Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 
link added: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 
state changed: ASYMMETRIC_PEER, pnetid NET25
Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 
link added: id 00000104, peerid 00000104, ibdev mlx5_0, ibport 1
Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 
state changed: SYMMETRIC, pnetid NET25
Aug 31 16:17:55 t8345011.lnxne.boe kernel: ------------[ cut here 
]------------
Aug 31 16:17:55 t8345011.lnxne.boe kernel: refcount_t: underflow; 
use-after-free.
Aug 31 16:17:55 t8345011.lnxne.boe kernel: WARNING: CPU: 1 PID: 150 at 
lib/refcount.c:87 refcount_dec_not_one+0x88/0xa8
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Modules linked in: smc_diag 
tcp_diag inet_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set 
nf_tables nfnetlink mlx5_ib ism smc ib_uverbs ib_core vfio_ccw mdev 
s390_trng vfio_iommu_type1 vfio sch_fq_codel configfs ip_tables x_tables 
ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes 
sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common 
pkey zcrypt rng_core autofs4
Aug 31 16:17:55 t8345011.lnxne.boe kernel: CPU: 1 PID: 150 Comm: 
kworker/1:2 Not tainted 6.0.0-rc2-00493-g91ecd751199f #8
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Hardware name: IBM 8561 T01 
701 (z/VM 7.2.0)
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Workqueue: events 
smc_llc_add_link_work [smc]
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl PSW : 0704c00180000000 
000000005b31f32c (refcount_dec_not_one+0x8c/0xa8)
Aug 31 16:17:55 t8345011.lnxne.boe kernel:            R:0 T:1 IO:1 EX:1 
Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl GPRS: 00000000ffffffea 
0000000000000027 0000000000000026 000000005c3151e0
Aug 31 16:17:55 t8345011.lnxne.boe kernel:            00000000fee80000 
0000038000000001 000000008e0e9a00 000000008de79c24
Aug 31 16:17:55 t8345011.lnxne.boe kernel:            0000038000000000 
000003ff803f05ac 0000000095038360 000000008de79c00
Aug 31 16:17:55 t8345011.lnxne.boe kernel:            00000000828ca100 
0000000095038360 000000005b31f328 0000038000943b50
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl Code: 000000005b31f31c: 
c02000466122        larl        %r2,000000005bbeb560
                                                       000000005b31f322: 
c0e500232e53        brasl        %r14,000000005b784fc8
                                                      #000000005b31f328: 
af000000                mc        0,0
                                                      >000000005b31f32c: 
a7280001                lhi        %r2,1
                                                       000000005b31f330: 
ebeff0a00004        lmg        %r14,%r15,160(%r15)
                                                       000000005b31f336: 
ec223fbf0055        risbg        %r2,%r2,63,191,0
                                                       000000005b31f33c: 
07fe                bcr        15,%r14
                                                       000000005b31f33e: 
47000700                bc        0,1792
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Call Trace:
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b31f32c>] 
refcount_dec_not_one+0x8c/0xa8
Aug 31 16:17:55 t8345011.lnxne.boe kernel: ([<000000005b31f328>] 
refcount_dec_not_one+0x88/0xa8)
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803ef16a>] 
smcr_link_cluster_on_link_state.part.0+0x1ba/0x440 [smc]
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803f05ac>] 
smcr_link_clear+0x5c/0x1b0 [smc]
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803fadf4>] 
smc_llc_add_link_work+0x43c/0x470 [smc]
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac1f0e2>] 
process_one_work+0x1fa/0x478
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac1f88c>] 
worker_thread+0x64/0x468
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac28580>] 
kthread+0x108/0x110
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005abaf2dc>] 
__ret_from_fork+0x3c/0x58
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b7a4d6a>] 
ret_from_fork+0xa/0x40
Aug 31 16:17:55 t8345011.lnxne.boe kernel: Last Breaking-Event-Address:
Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b785028>] 
__warn_printk+0x60/0x68
Aug 31 16:17:55 t8345011.lnxne.boe kernel: ---[ end trace 
0000000000000000 ]---
Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 
link removed: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
[root@t8345011 ~]#



Regarding SMC-D test-suite:
For SMC-D we also see errors during another stress test. While we expect 
connections to fall back to TCP due to the limit of parallel connections 
your patch introduces TCP fallbacks with a new reason.

[root@t8345011 ~]# journalctl --dmesg | tail -10
Aug 31 16:30:07 t8345011.lnxne.boe smc-tests: 
test_oob7_send_multi_urg_at_start started
Aug 31 16:30:16 t8345011.lnxne.boe smc-tests: 
test_oob8_ignore_some_urg_data started
Aug 31 16:30:30 t8345011.lnxne.boe smc-tests: test_smc_tool_second started
Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_tshark started
Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_smcapp_torture_test 
started
Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 
link added: id 00000401, peerid 00000401, ibdev mlx5_0, ibport 1
Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 
state changed: SINGLE, pnetid NET25
Aug 31 16:30:49 t8345011.lnxne.boe kernel: TCP: request_sock_TCP: 
Possible SYN flooding on port 51897. Sending cookies.  Check SNMP counters.
Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 
link added: id 00000402, peerid 00000402, ibdev mlx5_1, ibport 1
Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 
state changed: SYMMETRIC, pnetid NET25

^
I am wondering why we see SMC-R dmesgs even if we communicate with 
SMC-D. Gotta verify that. Can be an error on our side.

[root@t8345011 ~]#
[root@t8345011 ~]# smcss
ACTIVE         00000 0067005 10.25.45.10:48096       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0067001 10.25.45.10:48060       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0066999 10.25.45.10:48054       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0068762 10.25.45.10:48046       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0066997 10.25.45.10:48044       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0068760 10.25.45.10:48036       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0066995 10.25.45.10:48026       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0068758 10.25.45.10:48024       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0066993 10.25.45.10:48022       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0068756 10.25.45.10:48006       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0066991 10.25.45.10:47998       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0068754 10.25.45.10:47984       10.25.45.11:51897 
     0000 SMCD
ACTIVE         00000 0067124 10.25.45.11:51897       10.25.45.10:48314 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0067121 10.25.45.11:51897       10.25.45.10:48302 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0067120 10.25.45.11:51897       10.25.45.10:48284 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0067114 10.25.45.11:51897       10.25.45.10:48282 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0067115 10.25.45.11:51897       10.25.45.10:48254 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0067111 10.25.45.11:51897       10.25.45.10:48250 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0066415 10.25.45.11:51897       10.25.45.10:48242 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0067113 10.25.45.11:51897       10.25.45.10:48230 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0066409 10.25.45.11:51897       10.25.45.10:48202 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0066413 10.25.45.11:51897       10.25.45.10:48214 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0066414 10.25.45.11:51897       10.25.45.10:48204 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0066397 10.25.45.11:51897       10.25.45.10:48120 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0066399 10.25.45.11:51897       10.25.45.10:48084 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0066396 10.25.45.11:51897       10.25.45.10:48078 
     0000 TCP 0x05000000/0x030d0000
ACTIVE         00000 0062632 10.25.45.11:51897       10.25.45.10:43120 
     0000 TCP 0x03010000
ACTIVE         00000 0062631 10.25.45.11:51897       10.25.45.10:43134 
     0000 TCP 0x03010000
ACTIVE         00000 0062626 10.25.45.11:51897       10.25.45.10:43106 
     0000 TCP 0x03010000
ACTIVE         00000 0062625 10.25.45.11:51897       10.25.45.10:43138 
     0000 TCP 0x03010000
ACTIVE         00000 0062621 10.25.45.11:51897       10.25.45.10:43160 
     0000 TCP 0x03010000
ACTIVE         00000 0061580 10.25.45.11:51897       10.25.45.10:42820 
     0000 TCP 0x03010000
ACTIVE         00000 0061558 10.25.45.11:51897       10.25.45.10:42792 
     0000 TCP 0x03010000
ACTIVE         00000 0061549 10.25.45.11:51897       10.25.45.10:42816 
     0000 TCP 0x03010000
ACTIVE         00000 0061548 10.25.45.11:51897       10.25.45.10:42764 
     0000 TCP 0x03010000
ACTIVE         00000 0061544 10.25.45.11:51897       10.25.45.10:42804 
     0000 TCP 0x03010000
ACTIVE         00000 0061543 10.25.45.11:51897       10.25.45.10:42856 
     0000 TCP 0x03010000
ACTIVE         00000 0061542 10.25.45.11:51897       10.25.45.10:42756 
     0000 TCP 0x03010000
ACTIVE         00000 0062554 10.25.45.11:51897       10.25.45.10:42852 
     0000 TCP 0x03010000
ACTIVE         00000 0062553 10.25.45.11:51897       10.25.45.10:42844 
     0000 TCP 0x03010000
ACTIVE         00000 0062549 10.25.45.11:51897       10.25.45.10:42836 
     0000 TCP 0x03010000

^
Here SMCD and 0x05000000/0x030d0000 are expected. But:
   [353] smcss confirmed connection of type SMCD
   [353] Error: Found TCP fallback due to unexpected reasons: 0x03010000
We also exeperience that the lsmod count stays above 2 even after the 
testcase finished and takes quite a while before it goes down again (we 
send a kill signal at the end of our testcase).

During test (which is fine)

[root@t8345011 ~]# lsmod | grep smc
smc_diag               16384  0
smc                   225280  2981 ism,smc_diag
ib_core               413696  3 smc,ib_uverbs,mlx5_ib

Count > 2 even after tests finish!

[root@t8345011 ~]# lsmod | grep smc
smc_diag               16384  0
smc                   225280  40 ism,smc_diag
ib_core               413696  3 smc,ib_uverbs,mlx5_ib

Let us know if you need any more information.
Thanks, Jan
D. Wythe Sept. 2, 2022, 11:25 a.m. UTC | #3
On 8/31/22 11:04 PM, Jan Karcher wrote:
> 
> 
> On 26.08.2022 11:51, D. Wythe wrote:
>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>
>> This patch attempts to remove locks named smc_client_lgr_pending and
>> smc_server_lgr_pending, which aim to serialize the creation of link
>> group. However, once link group existed already, those locks are
>> meaningless, worse still, they make incoming connections have to be
>> queued one after the other.
>>
>> Now, the creation of link group is no longer generated by competition,
>> but allocated through following strategy.
>>
>> 1. Try to find a suitable link group, if successd, current connection
>> is considered as NON first contact connection. ends.
>>
>> 2. Check the number of connections currently waiting for a suitable
>> link group to be created, if it is not less that the number of link
>> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
>> increase the number of link groups to be created, current connection
>> is considered as the first contact connection. ends.
>>
>> 3. Increase the number of connections currently waiting, and wait
>> for woken up.
>>
>> 4. Decrease the number of connections currently waiting, goto 1.
>>
>> We wake up the connection that was put to sleep in stage 3 through
>> the SMC link state change event. Once the link moves out of the
>> SMC_LNK_ACTIVATING state, decrease the number of link groups to
>> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
>> connections.
>>
>> In the iplementation, we introduce the concept of lnk cluster, which is
>> a collection of links with the same characteristics (see
>> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
>> wake up efficiently in the scenario of N v.s 1.
>>
>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> 
> Hello D.,
> 
> thanks for the v2 and the patience.
> I got to testing and as with v1 I want to share our findings with you. If you need more information or want us to look deeper into the findings please let us know.
> 
> Regarding SMC-R test-suite:
> We see a refcount error during one of our stress tests. This lets us believe that the smc_link_cluster_put() to smc_link_cluster_hold() ratio is not right anymore.
> The patch provided by yacan does fix this issue but we did not verify if it is the right way to balance the hold and put calls.
> 
> [root@t8345011 ~]# journalctl --dmesg | tail -100
> Aug 31 16:17:36 t8345011.lnxne.boe smc-tests: test_smcapp_50x_ifdown started
> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link removed: id 00000101, peerid 00000101, ibdev mlx5_0, ibport 1
> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: SINGLE, pnetid NET25
> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link added: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: ASYMMETRIC_PEER, pnetid NET25
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link added: id 00000104, peerid 00000104, ibdev mlx5_0, ibport 1
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: SYMMETRIC, pnetid NET25
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ------------[ cut here ]------------
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: refcount_t: underflow; use-after-free.
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: WARNING: CPU: 1 PID: 150 at lib/refcount.c:87 refcount_dec_not_one+0x88/0xa8
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Modules linked in: smc_diag tcp_diag inet_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib ism smc ib_uverbs ib_core vfio_ccw mdev s390_trng vfio_iommu_type1 vfio sch_fq_codel configfs ip_tables x_tables ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: CPU: 1 PID: 150 Comm: kworker/1:2 Not tainted 6.0.0-rc2-00493-g91ecd751199f #8
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Workqueue: events smc_llc_add_link_work [smc]
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl PSW : 0704c00180000000 000000005b31f32c (refcount_dec_not_one+0x8c/0xa8)
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl GPRS: 00000000ffffffea 0000000000000027 0000000000000026 000000005c3151e0
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            00000000fee80000 0000038000000001 000000008e0e9a00 000000008de79c24
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            0000038000000000 000003ff803f05ac 0000000095038360 000000008de79c00
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            00000000828ca100 0000000095038360 000000005b31f328 0000038000943b50
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl Code: 000000005b31f31c: c02000466122        larl        %r2,000000005bbeb560
>                                                        000000005b31f322: c0e500232e53        brasl        %r14,000000005b784fc8
>                                                       #000000005b31f328: af000000                mc        0,0
>                                                       >000000005b31f32c: a7280001                lhi        %r2,1
>                                                        000000005b31f330: ebeff0a00004        lmg        %r14,%r15,160(%r15)
>                                                        000000005b31f336: ec223fbf0055        risbg        %r2,%r2,63,191,0
>                                                        000000005b31f33c: 07fe                bcr        15,%r14
>                                                        000000005b31f33e: 47000700                bc        0,1792
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Call Trace:
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b31f32c>] refcount_dec_not_one+0x8c/0xa8
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ([<000000005b31f328>] refcount_dec_not_one+0x88/0xa8)
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803ef16a>] smcr_link_cluster_on_link_state.part.0+0x1ba/0x440 [smc]
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803f05ac>] smcr_link_clear+0x5c/0x1b0 [smc]
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803fadf4>] smc_llc_add_link_work+0x43c/0x470 [smc]
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac1f0e2>] process_one_work+0x1fa/0x478
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac1f88c>] worker_thread+0x64/0x468
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac28580>] kthread+0x108/0x110
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005abaf2dc>] __ret_from_fork+0x3c/0x58
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b7a4d6a>] ret_from_fork+0xa/0x40
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Last Breaking-Event-Address:
> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b785028>] __warn_printk+0x60/0x68

Thank you for your test, I need to think about it, please give me some time.


> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ---[ end trace 0000000000000000 ]---
> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link removed: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
> [root@t8345011 ~]#
> 
> 
> 
> Regarding SMC-D test-suite:
> For SMC-D we also see errors during another stress test. While we expect connections to fall back to TCP due to the limit of parallel connections your patch introduces TCP fallbacks with a new reason.
> 
> [root@t8345011 ~]# journalctl --dmesg | tail -10
> Aug 31 16:30:07 t8345011.lnxne.boe smc-tests: test_oob7_send_multi_urg_at_start started
> Aug 31 16:30:16 t8345011.lnxne.boe smc-tests: test_oob8_ignore_some_urg_data started
> Aug 31 16:30:30 t8345011.lnxne.boe smc-tests: test_smc_tool_second started
> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_tshark started
> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_smcapp_torture_test started
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 link added: id 00000401, peerid 00000401, ibdev mlx5_0, ibport 1
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 state changed: SINGLE, pnetid NET25
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: TCP: request_sock_TCP: Possible SYN flooding on port 51897. Sending cookies.  Check SNMP counters.
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 link added: id 00000402, peerid 00000402, ibdev mlx5_1, ibport 1
> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 state changed: SYMMETRIC, pnetid NET25
> 
> ^
> I am wondering why we see SMC-R dmesgs even if we communicate with SMC-D. Gotta verify that. Can be an error on our side.

This is very weird, is there no such SMC-R dmesgs before apply my PATCH?

I am not sure if there is logic to downgrade SMC-D to SMC-R, maybe it's has related to 0x03010000.
I need to check the code, the reason will be sent out as soon as possible


> [root@t8345011 ~]#
> [root@t8345011 ~]# smcss
> ACTIVE         00000 0067005 10.25.45.10:48096       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0067001 10.25.45.10:48060       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0066999 10.25.45.10:48054       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0068762 10.25.45.10:48046       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0066997 10.25.45.10:48044       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0068760 10.25.45.10:48036       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0066995 10.25.45.10:48026       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0068758 10.25.45.10:48024       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0066993 10.25.45.10:48022       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0068756 10.25.45.10:48006       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0066991 10.25.45.10:47998       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0068754 10.25.45.10:47984       10.25.45.11:51897     0000 SMCD
> ACTIVE         00000 0067124 10.25.45.11:51897       10.25.45.10:48314     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0067121 10.25.45.11:51897       10.25.45.10:48302     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0067120 10.25.45.11:51897       10.25.45.10:48284     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0067114 10.25.45.11:51897       10.25.45.10:48282     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0067115 10.25.45.11:51897       10.25.45.10:48254     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0067111 10.25.45.11:51897       10.25.45.10:48250     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0066415 10.25.45.11:51897       10.25.45.10:48242     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0067113 10.25.45.11:51897       10.25.45.10:48230     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0066409 10.25.45.11:51897       10.25.45.10:48202     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0066413 10.25.45.11:51897       10.25.45.10:48214     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0066414 10.25.45.11:51897       10.25.45.10:48204     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0066397 10.25.45.11:51897       10.25.45.10:48120     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0066399 10.25.45.11:51897       10.25.45.10:48084     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0066396 10.25.45.11:51897       10.25.45.10:48078     0000 TCP 0x05000000/0x030d0000
> ACTIVE         00000 0062632 10.25.45.11:51897       10.25.45.10:43120     0000 TCP 0x03010000
> ACTIVE         00000 0062631 10.25.45.11:51897       10.25.45.10:43134     0000 TCP 0x03010000
> ACTIVE         00000 0062626 10.25.45.11:51897       10.25.45.10:43106     0000 TCP 0x03010000
> ACTIVE         00000 0062625 10.25.45.11:51897       10.25.45.10:43138     0000 TCP 0x03010000
> ACTIVE         00000 0062621 10.25.45.11:51897       10.25.45.10:43160     0000 TCP 0x03010000
> ACTIVE         00000 0061580 10.25.45.11:51897       10.25.45.10:42820     0000 TCP 0x03010000
> ACTIVE         00000 0061558 10.25.45.11:51897       10.25.45.10:42792     0000 TCP 0x03010000
> ACTIVE         00000 0061549 10.25.45.11:51897       10.25.45.10:42816     0000 TCP 0x03010000
> ACTIVE         00000 0061548 10.25.45.11:51897       10.25.45.10:42764     0000 TCP 0x03010000
> ACTIVE         00000 0061544 10.25.45.11:51897       10.25.45.10:42804     0000 TCP 0x03010000
> ACTIVE         00000 0061543 10.25.45.11:51897       10.25.45.10:42856     0000 TCP 0x03010000
> ACTIVE         00000 0061542 10.25.45.11:51897       10.25.45.10:42756     0000 TCP 0x03010000
> ACTIVE         00000 0062554 10.25.45.11:51897       10.25.45.10:42852     0000 TCP 0x03010000
> ACTIVE         00000 0062553 10.25.45.11:51897       10.25.45.10:42844     0000 TCP 0x03010000
> ACTIVE         00000 0062549 10.25.45.11:51897       10.25.45.10:42836     0000 TCP 0x03010000
> 
> ^
> Here SMCD and 0x05000000/0x030d0000 are expected. But:
>    [353] smcss confirmed connection of type SMCD
>    [353] Error: Found TCP fallback due to unexpected reasons: 0x03010000
sysctl -w net.ipv4.tcp_syncookies=0

Can you retry your test after set above configure? When TCP detects a potential flooding attack,
it will starts syn-cookies to verify traffic. In this case, SMC can't work, and then triggering a fallback with
error code 0x03010000.

This doesn't seem to be the problem that my PATCH can cause, but my PATCH removes the lock in
the handshake phase, which may speed up the frequency of your test initiating connections,
But I can't be sure ...


> We also exeperience that the lsmod count stays above 2 even after the testcase finished and takes quite a while before it goes down again (we send a kill signal at the end of our testcase).

> 
> During test (which is fine)
> 
> [root@t8345011 ~]# lsmod | grep smc
> smc_diag               16384  0
> smc                   225280  2981 ism,smc_diag
> ib_core               413696  3 smc,ib_uverbs,mlx5_ib
> 
> Count > 2 even after tests finish!
> 
> [root@t8345011 ~]# lsmod | grep smc
> smc_diag               16384  0
> smc                   225280  40 ism,smc_diag
> ib_core               413696  3 smc,ib_uverbs,mlx5_ib

> Let us know if you need any more information.
> Thanks, Jan


This usually means that there are still connections that are not really destroyed,
can you try this and to see if there are any remaining connections?

smcd linkgroup; #or smcr, it depends, if any, can you show us the connection state (smcss -r or -d)

ps aux | grep D; # check if there is work thread hungs, if any, please show us the /proc/$PID/stack.


D. Wythe
Thanks.
Jan Karcher Sept. 7, 2022, 8:10 a.m. UTC | #4
On 02.09.2022 13:25, D. Wythe wrote:
> 
> 
> On 8/31/22 11:04 PM, Jan Karcher wrote:
>>
>>
>> On 26.08.2022 11:51, D. Wythe wrote:
>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>
>>> This patch attempts to remove locks named smc_client_lgr_pending and
>>> smc_server_lgr_pending, which aim to serialize the creation of link
>>> group. However, once link group existed already, those locks are
>>> meaningless, worse still, they make incoming connections have to be
>>> queued one after the other.
>>>
>>> Now, the creation of link group is no longer generated by competition,
>>> but allocated through following strategy.
>>>
>>> 1. Try to find a suitable link group, if successd, current connection
>>> is considered as NON first contact connection. ends.
>>>
>>> 2. Check the number of connections currently waiting for a suitable
>>> link group to be created, if it is not less that the number of link
>>> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
>>> increase the number of link groups to be created, current connection
>>> is considered as the first contact connection. ends.
>>>
>>> 3. Increase the number of connections currently waiting, and wait
>>> for woken up.
>>>
>>> 4. Decrease the number of connections currently waiting, goto 1.
>>>
>>> We wake up the connection that was put to sleep in stage 3 through
>>> the SMC link state change event. Once the link moves out of the
>>> SMC_LNK_ACTIVATING state, decrease the number of link groups to
>>> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
>>> connections.
>>>
>>> In the iplementation, we introduce the concept of lnk cluster, which is
>>> a collection of links with the same characteristics (see
>>> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
>>> wake up efficiently in the scenario of N v.s 1.
>>>
>>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>>
>> Hello D.,
>>
>> thanks for the v2 and the patience.
>> I got to testing and as with v1 I want to share our findings with you. 
>> If you need more information or want us to look deeper into the 
>> findings please let us know.
>>
>> Regarding SMC-R test-suite:
>> We see a refcount error during one of our stress tests. This lets us 
>> believe that the smc_link_cluster_put() to smc_link_cluster_hold() 
>> ratio is not right anymore.
>> The patch provided by yacan does fix this issue but we did not verify 
>> if it is the right way to balance the hold and put calls.
>>
>> [root@t8345011 ~]# journalctl --dmesg | tail -100
>> Aug 31 16:17:36 t8345011.lnxne.boe smc-tests: test_smcapp_50x_ifdown 
>> started
>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 
>> 1 link removed: id 00000101, peerid 00000101, ibdev mlx5_0, ibport 1
>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 
>> 1 state changed: SINGLE, pnetid NET25
>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 
>> 1 link added: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 
>> 1 state changed: ASYMMETRIC_PEER, pnetid NET25
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 
>> 1 link added: id 00000104, peerid 00000104, ibdev mlx5_0, ibport 1
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 
>> 1 state changed: SYMMETRIC, pnetid NET25
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ------------[ cut here 
>> ]------------
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: refcount_t: underflow; 
>> use-after-free.
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: WARNING: CPU: 1 PID: 150 at 
>> lib/refcount.c:87 refcount_dec_not_one+0x88/0xa8
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Modules linked in: smc_diag 
>> tcp_diag inet_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
>> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct 
>> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set 
>> nf_tables nfnetlink mlx5_ib ism smc ib_uverbs ib_core vfio_ccw mdev 
>> s390_trng vfio_iommu_type1 vfio sch_fq_codel configfs ip_tables 
>> x_tables ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core 
>> des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 
>> sha1_s390 sha_common pkey zcrypt rng_core autofs4
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: CPU: 1 PID: 150 Comm: 
>> kworker/1:2 Not tainted 6.0.0-rc2-00493-g91ecd751199f #8
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Hardware name: IBM 8561 T01 
>> 701 (z/VM 7.2.0)
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Workqueue: events 
>> smc_llc_add_link_work [smc]
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl PSW : 0704c00180000000 
>> 000000005b31f32c (refcount_dec_not_one+0x8c/0xa8)
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            R:0 T:1 IO:1 
>> EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl GPRS: 00000000ffffffea 
>> 0000000000000027 0000000000000026 000000005c3151e0
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            00000000fee80000 
>> 0000038000000001 000000008e0e9a00 000000008de79c24
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            0000038000000000 
>> 000003ff803f05ac 0000000095038360 000000008de79c00
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            00000000828ca100 
>> 0000000095038360 000000005b31f328 0000038000943b50
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl Code: 
>> 000000005b31f31c: c02000466122        larl        %r2,000000005bbeb560
>>                                                        
>> 000000005b31f322: c0e500232e53        brasl        %r14,000000005b784fc8
>>                                                       
>> #000000005b31f328: af000000                mc        0,0
>>                                                       
>> >000000005b31f32c: a7280001                lhi        %r2,1
>>                                                        
>> 000000005b31f330: ebeff0a00004        lmg        %r14,%r15,160(%r15)
>>                                                        
>> 000000005b31f336: ec223fbf0055        risbg        %r2,%r2,63,191,0
>>                                                        
>> 000000005b31f33c: 07fe                bcr        15,%r14
>>                                                        
>> 000000005b31f33e: 47000700                bc        0,1792
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Call Trace:
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b31f32c>] 
>> refcount_dec_not_one+0x8c/0xa8
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ([<000000005b31f328>] 
>> refcount_dec_not_one+0x88/0xa8)
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803ef16a>] 
>> smcr_link_cluster_on_link_state.part.0+0x1ba/0x440 [smc]
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803f05ac>] 
>> smcr_link_clear+0x5c/0x1b0 [smc]
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803fadf4>] 
>> smc_llc_add_link_work+0x43c/0x470 [smc]
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac1f0e2>] 
>> process_one_work+0x1fa/0x478
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac1f88c>] 
>> worker_thread+0x64/0x468
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac28580>] 
>> kthread+0x108/0x110
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005abaf2dc>] 
>> __ret_from_fork+0x3c/0x58
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b7a4d6a>] 
>> ret_from_fork+0xa/0x40
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Last Breaking-Event-Address:
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b785028>] 
>> __warn_printk+0x60/0x68
> 
> Thank you for your test, I need to think about it, please give me some 
> time.
> 
> 
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ---[ end trace 
>> 0000000000000000 ]---
>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 
>> 1 link removed: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
>> [root@t8345011 ~]#
>>
>>
>>
>> Regarding SMC-D test-suite:
>> For SMC-D we also see errors during another stress test. While we 
>> expect connections to fall back to TCP due to the limit of parallel 
>> connections your patch introduces TCP fallbacks with a new reason.
>>
>> [root@t8345011 ~]# journalctl --dmesg | tail -10
>> Aug 31 16:30:07 t8345011.lnxne.boe smc-tests: 
>> test_oob7_send_multi_urg_at_start started
>> Aug 31 16:30:16 t8345011.lnxne.boe smc-tests: 
>> test_oob8_ignore_some_urg_data started
>> Aug 31 16:30:30 t8345011.lnxne.boe smc-tests: test_smc_tool_second 
>> started
>> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_tshark started
>> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_smcapp_torture_test 
>> started
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 
>> 1 link added: id 00000401, peerid 00000401, ibdev mlx5_0, ibport 1
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 
>> 1 state changed: SINGLE, pnetid NET25
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: TCP: request_sock_TCP: 
>> Possible SYN flooding on port 51897. Sending cookies.  Check SNMP 
>> counters.
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 
>> 1 link added: id 00000402, peerid 00000402, ibdev mlx5_1, ibport 1
>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 
>> 1 state changed: SYMMETRIC, pnetid NET25
>>
>> ^
>> I am wondering why we see SMC-R dmesgs even if we communicate with 
>> SMC-D. Gotta verify that. Can be an error on our side.
> 
> This is very weird, is there no such SMC-R dmesgs before apply my PATCH?
> 
> I am not sure if there is logic to downgrade SMC-D to SMC-R, maybe it's 
> has related to 0x03010000.
> I need to check the code, the reason will be sent out as soon as possible
> 
> 
>> [root@t8345011 ~]#
>> [root@t8345011 ~]# smcss
>> ACTIVE         00000 0067005 10.25.45.10:48096       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0067001 10.25.45.10:48060       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0066999 10.25.45.10:48054       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0068762 10.25.45.10:48046       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0066997 10.25.45.10:48044       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0068760 10.25.45.10:48036       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0066995 10.25.45.10:48026       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0068758 10.25.45.10:48024       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0066993 10.25.45.10:48022       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0068756 10.25.45.10:48006       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0066991 10.25.45.10:47998       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0068754 10.25.45.10:47984       10.25.45.11:51897 
>>     0000 SMCD
>> ACTIVE         00000 0067124 10.25.45.11:51897       10.25.45.10:48314 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0067121 10.25.45.11:51897       10.25.45.10:48302 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0067120 10.25.45.11:51897       10.25.45.10:48284 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0067114 10.25.45.11:51897       10.25.45.10:48282 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0067115 10.25.45.11:51897       10.25.45.10:48254 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0067111 10.25.45.11:51897       10.25.45.10:48250 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0066415 10.25.45.11:51897       10.25.45.10:48242 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0067113 10.25.45.11:51897       10.25.45.10:48230 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0066409 10.25.45.11:51897       10.25.45.10:48202 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0066413 10.25.45.11:51897       10.25.45.10:48214 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0066414 10.25.45.11:51897       10.25.45.10:48204 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0066397 10.25.45.11:51897       10.25.45.10:48120 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0066399 10.25.45.11:51897       10.25.45.10:48084 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0066396 10.25.45.11:51897       10.25.45.10:48078 
>>     0000 TCP 0x05000000/0x030d0000
>> ACTIVE         00000 0062632 10.25.45.11:51897       10.25.45.10:43120 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0062631 10.25.45.11:51897       10.25.45.10:43134 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0062626 10.25.45.11:51897       10.25.45.10:43106 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0062625 10.25.45.11:51897       10.25.45.10:43138 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0062621 10.25.45.11:51897       10.25.45.10:43160 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0061580 10.25.45.11:51897       10.25.45.10:42820 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0061558 10.25.45.11:51897       10.25.45.10:42792 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0061549 10.25.45.11:51897       10.25.45.10:42816 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0061548 10.25.45.11:51897       10.25.45.10:42764 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0061544 10.25.45.11:51897       10.25.45.10:42804 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0061543 10.25.45.11:51897       10.25.45.10:42856 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0061542 10.25.45.11:51897       10.25.45.10:42756 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0062554 10.25.45.11:51897       10.25.45.10:42852 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0062553 10.25.45.11:51897       10.25.45.10:42844 
>>     0000 TCP 0x03010000
>> ACTIVE         00000 0062549 10.25.45.11:51897       10.25.45.10:42836 
>>     0000 TCP 0x03010000
>>
>> ^
>> Here SMCD and 0x05000000/0x030d0000 are expected. But:
>>    [353] smcss confirmed connection of type SMCD
>>    [353] Error: Found TCP fallback due to unexpected reasons: 0x03010000
> sysctl -w net.ipv4.tcp_syncookies=0
> 
> Can you retry your test after set above configure? When TCP detects a 
> potential flooding attack,
> it will starts syn-cookies to verify traffic. In this case, SMC can't 
> work, and then triggering a fallback with
> error code 0x03010000.
> 
> This doesn't seem to be the problem that my PATCH can cause, but my 
> PATCH removes the lock in
> the handshake phase, which may speed up the frequency of your test 
> initiating connections,
> But I can't be sure ...
> 
> 
>> We also exeperience that the lsmod count stays above 2 even after the 
>> testcase finished and takes quite a while before it goes down again 
>> (we send a kill signal at the end of our testcase).
> 
>>
>> During test (which is fine)
>>
>> [root@t8345011 ~]# lsmod | grep smc
>> smc_diag               16384  0
>> smc                   225280  2981 ism,smc_diag
>> ib_core               413696  3 smc,ib_uverbs,mlx5_ib
>>
>> Count > 2 even after tests finish!
>>
>> [root@t8345011 ~]# lsmod | grep smc
>> smc_diag               16384  0
>> smc                   225280  40 ism,smc_diag
>> ib_core               413696  3 smc,ib_uverbs,mlx5_ib
> 
>> Let us know if you need any more information.
>> Thanks, Jan
> 
> 
> This usually means that there are still connections that are not really 
> destroyed,
> can you try this and to see if there are any remaining connections?
> 
> smcd linkgroup; #or smcr, it depends, if any, can you show us the 
> connection state (smcss -r or -d)
> 
> ps aux | grep D; # check if there is work thread hungs, if any, please 
> show us the /proc/$PID/stack.
> 
> 
> D. Wythe
> Thanks.
> 

Thank you for the tip with the syncookies. We disabled them on both 
systems and here is the new output which should be not as confusing.

For your understanding of the output:
t8345010 is the system driving the tests and is acting as the client in 
this testcase.
t8345011 is the pair system and the server of this test.
What we are doing is we spawn a lot of connection between the two 
systems to see what is happening if there is stress (in terms of 
connection handling) on the system.

We see the following:
- The driver falls back to SMCR in many occasions. This should not be. 
Also note the missmatch of numbers of connections handled. There were no 
other connections beside the test.

   T8345010
   > SMC-D Connections Summary
   >   Total connections handled          1012
   > SMC-R Connections Summary
   >   Total connections handled          1512

   T8345011
   > SMC-D Connections Summary
   >   Total connections handled          1190
   > SMC-R Connections Summary
   >   Total connections handled          1513


- Linkgroups for the SMCD & SMCR connections are being build up.

   T8345011
   > [root@t8345011 ~]# smcd linkgroup
   > LG-ID    VLAN  #Conns  PNET-ID
   > 00000300    0      37  NET25
   > [root@t8345011 ~]# smcr linkgroup
   > LG-ID    LG-Role  LG-Type  VLAN  #Conns  PNET-ID
   > 00000400 SERV     SYM         0       0  NET25
   > [ 5 more LG 0500-0900]


- Linkgroups for the SMCD & SMCR connections are being build down once 
the clients finish.
- ALL SMCR linkgoups are being cleared completely as expected. They 
still reside empty for a while which is fine.
- The SMCD linkgroups are NOT cleared all the way. A few connections 
stay in there (See output above).
- If we perform smcss on the server side those connections are listed 
there as ACTIVE while the smcss list on the client side is empty.

   T8345011
   > [root@t8345011 ~]# smcss
   > State          UID   Inode   Local Address           Peer Address 
          Intf Mode
   > ACTIVE         00000 0100758 10.25.45.11:40237 
10.25.45.10:55790       0000 SMCD
   > [ 36 more ACTIVE connections ]


- The remaing ACTIVE connections on the server are reflected in the smcd 
linkgroup #Conns aswell.
- On the client the lsmod count for the smc module is 39 also reflecting 
the leftover connections.

   T8345010
   > [root@t8345010 tela-kernel]# lsmod |grep smc
   > smc                   225280  39 ism,smc_diag


- On the server the lsmod count for the smc module is 79.

   T8345011
   > [root@t8345011 ~]# lsmod | grep smc
   > smc                   225280  79 ism,smc_diag


- The most important smc_dbg outputs are provided and are showing that 
the client is pretty clean and the server is still handling ghost 
connections.

   T8345011
   > [root@t8345011 ~]# smc_dbg
   > State          UID   Inode   Local Address           Peer Address 
          Intf Mode GID              Token            Peer-GID 
Peer-Token       Linkid
   > ACTIVE         00000 0100758 10.25.45.11:40237 
10.25.45.10:55790       0000 SMCD 120014a12e488561 0000890fd0000000 
3e0014a32e488561 00008a0bd0000000 00000300
   > State          UID   Inode   Local Address           Peer Address 
          Intf Mode Shutd Token    Sndbuf   Rcvbuf   Peerbuf 
rxprod-Cursor rxcons-Cursor rxFlags txprod-Cursor txcons-Cursor txFlags 
txprep-Cursor txsent-Cursor txfin-Cursor
   > ACTIVE         00000 0100758 10.25.45.11:40237 
10.25.45.10:55790       0000 SMCD  <->  00001611 00004000 0000ffe0 
0000ffe0 0000:00000000 0000:00000000 00:00   0000:00000000 0000:00000000 
00:00   0000:00000000 0000:00000000 0000:00000000


- Via netstat we see that the server is in a CLOSE_WAIT state for the 
connections and the client in a FIN_WAIT2

   T8345010
   > [root@t8345010 tela-kernel]# netstat -nta
   > Proto Recv-Q Send-Q Local Address           Foreign Address 
State
   > tcp        0      0 10.25.45.10:55790       10.25.45.11:40237 
FIN_WAIT2
   T8345011
   > [root@t8345011 ~]# netstat -nta | grep "40237"
   > tcp        1      0 10.25.45.11:40237       10.25.45.10:55790 
CLOSE_WAIT


While I'm pretty new to the mailing list we had a discussion about how 
to provide the log data in a reasonable way.
To prevent too much information we decided to go for the short output on 
top. If that is not enough for you shot me a message and i can send you 
the full output outside the mailing list.
If you have any ideas on how to provide larger output in a reasonable 
way feel free to share your oppinion.

I hope the new output helps you locating the error.
Feel free to contact us in case you have questions.
- Jan
D. Wythe Sept. 16, 2022, 5:16 a.m. UTC | #5
Hi, Jan

Thanks a lot for your test, your information is very important!
And feel sorry to reply late, this issues is quite hidden but stupid.

Here's the problem:

We use ini->is_smcd to determine whether the lock needs to be removed in our patch,
however, this value is only available after smc_listen_find_device be invoked,
Before that, the value is always false (kzalloc). Unfortunately, we had used it before.

In other words, it is equivalent that we removed the SMC-D lock but did not do SMC-D link cluster,
which will lead to a large number of critical area issues, including abnormal downgrade to SMC-R,
residual connection status, etc.

Considering that, it seems that we have to do the link cluster for SMC-D too, It won't take much time,
I expect to send a new version next week.

Thanks
D. Wythe

On 9/7/22 4:10 PM, Jan Karcher wrote:
> 
> 
> On 02.09.2022 13:25, D. Wythe wrote:
>>
>>
>> On 8/31/22 11:04 PM, Jan Karcher wrote:
>>>
>>>
>>> On 26.08.2022 11:51, D. Wythe wrote:
>>>> From: "D. Wythe" <alibuda@linux.alibaba.com>
>>>>
>>>> This patch attempts to remove locks named smc_client_lgr_pending and
>>>> smc_server_lgr_pending, which aim to serialize the creation of link
>>>> group. However, once link group existed already, those locks are
>>>> meaningless, worse still, they make incoming connections have to be
>>>> queued one after the other.
>>>>
>>>> Now, the creation of link group is no longer generated by competition,
>>>> but allocated through following strategy.
>>>>
>>>> 1. Try to find a suitable link group, if successd, current connection
>>>> is considered as NON first contact connection. ends.
>>>>
>>>> 2. Check the number of connections currently waiting for a suitable
>>>> link group to be created, if it is not less that the number of link
>>>> groups to be created multiplied by (SMC_RMBS_PER_LGR_MAX - 1), then
>>>> increase the number of link groups to be created, current connection
>>>> is considered as the first contact connection. ends.
>>>>
>>>> 3. Increase the number of connections currently waiting, and wait
>>>> for woken up.
>>>>
>>>> 4. Decrease the number of connections currently waiting, goto 1.
>>>>
>>>> We wake up the connection that was put to sleep in stage 3 through
>>>> the SMC link state change event. Once the link moves out of the
>>>> SMC_LNK_ACTIVATING state, decrease the number of link groups to
>>>> be created, and then wake up at most (SMC_RMBS_PER_LGR_MAX - 1)
>>>> connections.
>>>>
>>>> In the iplementation, we introduce the concept of lnk cluster, which is
>>>> a collection of links with the same characteristics (see
>>>> smcr_lnk_cluster_cmpfn() with more details), which makes it possible to
>>>> wake up efficiently in the scenario of N v.s 1.
>>>>
>>>> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
>>>
>>> Hello D.,
>>>
>>> thanks for the v2 and the patience.
>>> I got to testing and as with v1 I want to share our findings with you. If you need more information or want us to look deeper into the findings please let us know.
>>>
>>> Regarding SMC-R test-suite:
>>> We see a refcount error during one of our stress tests. This lets us believe that the smc_link_cluster_put() to smc_link_cluster_hold() ratio is not right anymore.
>>> The patch provided by yacan does fix this issue but we did not verify if it is the right way to balance the hold and put calls.
>>>
>>> [root@t8345011 ~]# journalctl --dmesg | tail -100
>>> Aug 31 16:17:36 t8345011.lnxne.boe smc-tests: test_smcapp_50x_ifdown started
>>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link removed: id 00000101, peerid 00000101, ibdev mlx5_0, ibport 1
>>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: SINGLE, pnetid NET25
>>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link added: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
>>> Aug 31 16:17:46 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: ASYMMETRIC_PEER, pnetid NET25
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link added: id 00000104, peerid 00000104, ibdev mlx5_0, ibport 1
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 state changed: SYMMETRIC, pnetid NET25
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ------------[ cut here ]------------
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: refcount_t: underflow; use-after-free.
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: WARNING: CPU: 1 PID: 150 at lib/refcount.c:87 refcount_dec_not_one+0x88/0xa8
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Modules linked in: smc_diag tcp_diag inet_diag nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink mlx5_ib ism smc ib_uverbs ib_core vfio_ccw mdev s390_trng vfio_iommu_type1 vfio sch_fq_codel configfs ip_tables x_tables ghash_s390 prng chacha_s390 libchacha aes_s390 mlx5_core des_s390 libdes sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 sha_common pkey zcrypt rng_core autofs4
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: CPU: 1 PID: 150 Comm: kworker/1:2 Not tainted 6.0.0-rc2-00493-g91ecd751199f #8
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Hardware name: IBM 8561 T01 701 (z/VM 7.2.0)
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Workqueue: events smc_llc_add_link_work [smc]
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl PSW : 0704c00180000000 000000005b31f32c (refcount_dec_not_one+0x8c/0xa8)
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl GPRS: 00000000ffffffea 0000000000000027 0000000000000026 000000005c3151e0
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            00000000fee80000 0000038000000001 000000008e0e9a00 000000008de79c24
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            0000038000000000 000003ff803f05ac 0000000095038360 000000008de79c00
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:            00000000828ca100 0000000095038360 000000005b31f328 0000038000943b50
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Krnl Code: 000000005b31f31c: c02000466122        larl        %r2,000000005bbeb560
>>> 000000005b31f322: c0e500232e53        brasl        %r14,000000005b784fc8
>>> #000000005b31f328: af000000                mc        0,0
>>> >000000005b31f32c: a7280001                lhi        %r2,1
>>> 000000005b31f330: ebeff0a00004        lmg        %r14,%r15,160(%r15)
>>> 000000005b31f336: ec223fbf0055        risbg        %r2,%r2,63,191,0
>>> 000000005b31f33c: 07fe                bcr        15,%r14
>>> 000000005b31f33e: 47000700                bc        0,1792
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Call Trace:
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b31f32c>] refcount_dec_not_one+0x8c/0xa8
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ([<000000005b31f328>] refcount_dec_not_one+0x88/0xa8)
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803ef16a>] smcr_link_cluster_on_link_state.part.0+0x1ba/0x440 [smc]
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803f05ac>] smcr_link_clear+0x5c/0x1b0 [smc]
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000003ff803fadf4>] smc_llc_add_link_work+0x43c/0x470 [smc]
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac1f0e2>] process_one_work+0x1fa/0x478
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac1f88c>] worker_thread+0x64/0x468
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005ac28580>] kthread+0x108/0x110
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005abaf2dc>] __ret_from_fork+0x3c/0x58
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b7a4d6a>] ret_from_fork+0xa/0x40
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: Last Breaking-Event-Address:
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel:  [<000000005b785028>] __warn_printk+0x60/0x68
>>
>> Thank you for your test, I need to think about it, please give me some time.
>>
>>
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: ---[ end trace 0000000000000000 ]---
>>> Aug 31 16:17:55 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000100 net 1 link removed: id 00000103, peerid 00000103, ibdev mlx5_0, ibport 1
>>> [root@t8345011 ~]#
>>>
>>>
>>>
>>> Regarding SMC-D test-suite:
>>> For SMC-D we also see errors during another stress test. While we expect connections to fall back to TCP due to the limit of parallel connections your patch introduces TCP fallbacks with a new reason.
>>>
>>> [root@t8345011 ~]# journalctl --dmesg | tail -10
>>> Aug 31 16:30:07 t8345011.lnxne.boe smc-tests: test_oob7_send_multi_urg_at_start started
>>> Aug 31 16:30:16 t8345011.lnxne.boe smc-tests: test_oob8_ignore_some_urg_data started
>>> Aug 31 16:30:30 t8345011.lnxne.boe smc-tests: test_smc_tool_second started
>>> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_tshark started
>>> Aug 31 16:30:34 t8345011.lnxne.boe smc-tests: test_smcapp_torture_test started
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 link added: id 00000401, peerid 00000401, ibdev mlx5_0, ibport 1
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 state changed: SINGLE, pnetid NET25
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: TCP: request_sock_TCP: Possible SYN flooding on port 51897. Sending cookies.  Check SNMP counters.
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 link added: id 00000402, peerid 00000402, ibdev mlx5_1, ibport 1
>>> Aug 31 16:30:49 t8345011.lnxne.boe kernel: smc: SMC-R lg 00000400 net 1 state changed: SYMMETRIC, pnetid NET25
>>>
>>> ^
>>> I am wondering why we see SMC-R dmesgs even if we communicate with SMC-D. Gotta verify that. Can be an error on our side.
>>
>> This is very weird, is there no such SMC-R dmesgs before apply my PATCH?
>>
>> I am not sure if there is logic to downgrade SMC-D to SMC-R, maybe it's has related to 0x03010000.
>> I need to check the code, the reason will be sent out as soon as possible
>>
>>
>>> [root@t8345011 ~]#
>>> [root@t8345011 ~]# smcss
>>> ACTIVE         00000 0067005 10.25.45.10:48096       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0067001 10.25.45.10:48060       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0066999 10.25.45.10:48054       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0068762 10.25.45.10:48046       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0066997 10.25.45.10:48044       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0068760 10.25.45.10:48036       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0066995 10.25.45.10:48026       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0068758 10.25.45.10:48024       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0066993 10.25.45.10:48022       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0068756 10.25.45.10:48006       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0066991 10.25.45.10:47998       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0068754 10.25.45.10:47984       10.25.45.11:51897     0000 SMCD
>>> ACTIVE         00000 0067124 10.25.45.11:51897       10.25.45.10:48314     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0067121 10.25.45.11:51897       10.25.45.10:48302     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0067120 10.25.45.11:51897       10.25.45.10:48284     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0067114 10.25.45.11:51897       10.25.45.10:48282     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0067115 10.25.45.11:51897       10.25.45.10:48254     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0067111 10.25.45.11:51897       10.25.45.10:48250     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0066415 10.25.45.11:51897       10.25.45.10:48242     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0067113 10.25.45.11:51897       10.25.45.10:48230     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0066409 10.25.45.11:51897       10.25.45.10:48202     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0066413 10.25.45.11:51897       10.25.45.10:48214     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0066414 10.25.45.11:51897       10.25.45.10:48204     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0066397 10.25.45.11:51897       10.25.45.10:48120     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0066399 10.25.45.11:51897       10.25.45.10:48084     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0066396 10.25.45.11:51897       10.25.45.10:48078     0000 TCP 0x05000000/0x030d0000
>>> ACTIVE         00000 0062632 10.25.45.11:51897       10.25.45.10:43120     0000 TCP 0x03010000
>>> ACTIVE         00000 0062631 10.25.45.11:51897       10.25.45.10:43134     0000 TCP 0x03010000
>>> ACTIVE         00000 0062626 10.25.45.11:51897       10.25.45.10:43106     0000 TCP 0x03010000
>>> ACTIVE         00000 0062625 10.25.45.11:51897       10.25.45.10:43138     0000 TCP 0x03010000
>>> ACTIVE         00000 0062621 10.25.45.11:51897       10.25.45.10:43160     0000 TCP 0x03010000
>>> ACTIVE         00000 0061580 10.25.45.11:51897       10.25.45.10:42820     0000 TCP 0x03010000
>>> ACTIVE         00000 0061558 10.25.45.11:51897       10.25.45.10:42792     0000 TCP 0x03010000
>>> ACTIVE         00000 0061549 10.25.45.11:51897       10.25.45.10:42816     0000 TCP 0x03010000
>>> ACTIVE         00000 0061548 10.25.45.11:51897       10.25.45.10:42764     0000 TCP 0x03010000
>>> ACTIVE         00000 0061544 10.25.45.11:51897       10.25.45.10:42804     0000 TCP 0x03010000
>>> ACTIVE         00000 0061543 10.25.45.11:51897       10.25.45.10:42856     0000 TCP 0x03010000
>>> ACTIVE         00000 0061542 10.25.45.11:51897       10.25.45.10:42756     0000 TCP 0x03010000
>>> ACTIVE         00000 0062554 10.25.45.11:51897       10.25.45.10:42852     0000 TCP 0x03010000
>>> ACTIVE         00000 0062553 10.25.45.11:51897       10.25.45.10:42844     0000 TCP 0x03010000
>>> ACTIVE         00000 0062549 10.25.45.11:51897       10.25.45.10:42836     0000 TCP 0x03010000
>>>
>>> ^
>>> Here SMCD and 0x05000000/0x030d0000 are expected. But:
>>>    [353] smcss confirmed connection of type SMCD
>>>    [353] Error: Found TCP fallback due to unexpected reasons: 0x03010000
>> sysctl -w net.ipv4.tcp_syncookies=0
>>
>> Can you retry your test after set above configure? When TCP detects a potential flooding attack,
>> it will starts syn-cookies to verify traffic. In this case, SMC can't work, and then triggering a fallback with
>> error code 0x03010000.
>>
>> This doesn't seem to be the problem that my PATCH can cause, but my PATCH removes the lock in
>> the handshake phase, which may speed up the frequency of your test initiating connections,
>> But I can't be sure ...
>>
>>
>>> We also exeperience that the lsmod count stays above 2 even after the testcase finished and takes quite a while before it goes down again (we send a kill signal at the end of our testcase).
>>
>>>
>>> During test (which is fine)
>>>
>>> [root@t8345011 ~]# lsmod | grep smc
>>> smc_diag               16384  0
>>> smc                   225280  2981 ism,smc_diag
>>> ib_core               413696  3 smc,ib_uverbs,mlx5_ib
>>>
>>> Count > 2 even after tests finish!
>>>
>>> [root@t8345011 ~]# lsmod | grep smc
>>> smc_diag               16384  0
>>> smc                   225280  40 ism,smc_diag
>>> ib_core               413696  3 smc,ib_uverbs,mlx5_ib
>>
>>> Let us know if you need any more information.
>>> Thanks, Jan
>>
>>
>> This usually means that there are still connections that are not really destroyed,
>> can you try this and to see if there are any remaining connections?
>>
>> smcd linkgroup; #or smcr, it depends, if any, can you show us the connection state (smcss -r or -d)
>>
>> ps aux | grep D; # check if there is work thread hungs, if any, please show us the /proc/$PID/stack.
>>
>>
>> D. Wythe
>> Thanks.
>>
> 
> Thank you for the tip with the syncookies. We disabled them on both systems and here is the new output which should be not as confusing.
> 
> For your understanding of the output:
> t8345010 is the system driving the tests and is acting as the client in this testcase.
> t8345011 is the pair system and the server of this test.
> What we are doing is we spawn a lot of connection between the two systems to see what is happening if there is stress (in terms of connection handling) on the system.
> 
> We see the following:
> - The driver falls back to SMCR in many occasions. This should not be. Also note the missmatch of numbers of connections handled. There were no other connections beside the test.
> 
>    T8345010
>    > SMC-D Connections Summary
>    >   Total connections handled          1012
>    > SMC-R Connections Summary
>    >   Total connections handled          1512
> 
>    T8345011
>    > SMC-D Connections Summary
>    >   Total connections handled          1190
>    > SMC-R Connections Summary
>    >   Total connections handled          1513
> 
> 
> - Linkgroups for the SMCD & SMCR connections are being build up.
> 
>    T8345011
>    > [root@t8345011 ~]# smcd linkgroup
>    > LG-ID    VLAN  #Conns  PNET-ID
>    > 00000300    0      37  NET25
>    > [root@t8345011 ~]# smcr linkgroup
>    > LG-ID    LG-Role  LG-Type  VLAN  #Conns  PNET-ID
>    > 00000400 SERV     SYM         0       0  NET25
>    > [ 5 more LG 0500-0900]
> 
> 
> - Linkgroups for the SMCD & SMCR connections are being build down once the clients finish.
> - ALL SMCR linkgoups are being cleared completely as expected. They still reside empty for a while which is fine.
> - The SMCD linkgroups are NOT cleared all the way. A few connections stay in there (See output above).
> - If we perform smcss on the server side those connections are listed there as ACTIVE while the smcss list on the client side is empty.
> 
>    T8345011
>    > [root@t8345011 ~]# smcss
>    > State          UID   Inode   Local Address           Peer Address          Intf Mode
>    > ACTIVE         00000 0100758 10.25.45.11:40237 10.25.45.10:55790       0000 SMCD
>    > [ 36 more ACTIVE connections ]
> 
> 
> - The remaing ACTIVE connections on the server are reflected in the smcd linkgroup #Conns aswell.
> - On the client the lsmod count for the smc module is 39 also reflecting the leftover connections.
> 
>    T8345010
>    > [root@t8345010 tela-kernel]# lsmod |grep smc
>    > smc                   225280  39 ism,smc_diag
> 
> 
> - On the server the lsmod count for the smc module is 79.
> 
>    T8345011
>    > [root@t8345011 ~]# lsmod | grep smc
>    > smc                   225280  79 ism,smc_diag
> 
> 
> - The most important smc_dbg outputs are provided and are showing that the client is pretty clean and the server is still handling ghost connections.
> 
>    T8345011
>    > [root@t8345011 ~]# smc_dbg
>    > State          UID   Inode   Local Address           Peer Address          Intf Mode GID              Token            Peer-GID Peer-Token       Linkid
>    > ACTIVE         00000 0100758 10.25.45.11:40237 10.25.45.10:55790       0000 SMCD 120014a12e488561 0000890fd0000000 3e0014a32e488561 00008a0bd0000000 00000300
>    > State          UID   Inode   Local Address           Peer Address          Intf Mode Shutd Token    Sndbuf   Rcvbuf   Peerbuf rxprod-Cursor rxcons-Cursor rxFlags txprod-Cursor txcons-Cursor txFlags txprep-Cursor txsent-Cursor txfin-Cursor
>    > ACTIVE         00000 0100758 10.25.45.11:40237 10.25.45.10:55790       0000 SMCD  <->  00001611 00004000 0000ffe0 0000ffe0 0000:00000000 0000:00000000 00:00   0000:00000000 0000:00000000 00:00   0000:00000000 0000:00000000 0000:00000000
> 
> 
> - Via netstat we see that the server is in a CLOSE_WAIT state for the connections and the client in a FIN_WAIT2
> 
>    T8345010
>    > [root@t8345010 tela-kernel]# netstat -nta
>    > Proto Recv-Q Send-Q Local Address           Foreign Address State
>    > tcp        0      0 10.25.45.10:55790       10.25.45.11:40237 FIN_WAIT2
>    T8345011
>    > [root@t8345011 ~]# netstat -nta | grep "40237"
>    > tcp        1      0 10.25.45.11:40237       10.25.45.10:55790 CLOSE_WAIT
> 
> 
> While I'm pretty new to the mailing list we had a discussion about how to provide the log data in a reasonable way.
> To prevent too much information we decided to go for the short output on top. If that is not enough for you shot me a message and i can send you the full output outside the mailing list.
> If you have any ideas on how to provide larger output in a reasonable way feel free to share your oppinion.
> 
> I hope the new output helps you locating the error.
> Feel free to contact us in case you have questions.
> - Jan
diff mbox series

Patch

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 79c1318..d0e6bec 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1194,10 +1194,8 @@  static int smc_connect_rdma(struct smc_sock *smc,
 	if (reason_code)
 		return reason_code;
 
-	mutex_lock(&smc_client_lgr_pending);
 	reason_code = smc_conn_create(smc, ini);
 	if (reason_code) {
-		mutex_unlock(&smc_client_lgr_pending);
 		return reason_code;
 	}
 
@@ -1289,7 +1287,6 @@  static int smc_connect_rdma(struct smc_sock *smc,
 		if (reason_code)
 			goto connect_abort;
 	}
-	mutex_unlock(&smc_client_lgr_pending);
 
 	smc_copy_sock_settings_to_clc(smc);
 	smc->connect_nonblock = 0;
@@ -1299,7 +1296,6 @@  static int smc_connect_rdma(struct smc_sock *smc,
 	return 0;
 connect_abort:
 	smc_conn_abort(smc, ini->first_contact_local);
-	mutex_unlock(&smc_client_lgr_pending);
 	smc->connect_nonblock = 0;
 
 	return reason_code;
@@ -2377,7 +2373,8 @@  static void smc_listen_work(struct work_struct *work)
 	if (rc)
 		goto out_decl;
 
-	mutex_lock(&smc_server_lgr_pending);
+	if (ini->is_smcd)
+		mutex_lock(&smc_server_lgr_pending);
 	smc_close_init(new_smc);
 	smc_rx_init(new_smc);
 	smc_tx_init(new_smc);
@@ -2404,8 +2401,6 @@  static void smc_listen_work(struct work_struct *work)
 	rc = smc_clc_wait_msg(new_smc, cclc, sizeof(*buf),
 			      SMC_CLC_CONFIRM, CLC_WAIT_TIME);
 	if (rc) {
-		if (!ini->is_smcd)
-			goto out_unlock;
 		goto out_decl;
 	}
 
@@ -2415,7 +2410,6 @@  static void smc_listen_work(struct work_struct *work)
 					    ini->first_contact_local, ini);
 		if (rc)
 			goto out_unlock;
-		mutex_unlock(&smc_server_lgr_pending);
 	}
 	smc_conn_save_peer_info(new_smc, cclc);
 	smc_listen_out_connected(new_smc);
@@ -2423,7 +2417,8 @@  static void smc_listen_work(struct work_struct *work)
 	goto out_free;
 
 out_unlock:
-	mutex_unlock(&smc_server_lgr_pending);
+	if (ini->is_smcd)
+		mutex_unlock(&smc_server_lgr_pending);
 out_decl:
 	smc_listen_decline(new_smc, rc, ini ? ini->first_contact_local : 0,
 			   proposal_version);
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index ff49a11..cfaddf2 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -46,6 +46,10 @@  struct smc_lgr_list smc_lgr_list = {	/* established link groups */
 	.num = 0,
 };
 
+static struct smc_lgr_manager smc_lgr_manager = {
+	.lock = __SPIN_LOCK_UNLOCKED(smc_lgr_manager.lock),
+};
+
 static atomic_t lgr_cnt = ATOMIC_INIT(0); /* number of existing link groups */
 static DECLARE_WAIT_QUEUE_HEAD(lgrs_deleted);
 
@@ -55,6 +59,255 @@  static void smc_buf_free(struct smc_link_group *lgr, bool is_rmb,
 
 static void smc_link_down_work(struct work_struct *work);
 
+/* SMC-R lnk cluster compare func
+ * All lnks that meet the description conditions of this function
+ * are logically aggregated, called lnk cluster.
+ * For the server side, lnk cluster is used to determine whether
+ * a new group needs to be created when processing new imcoming connections.
+ * For the client side, lnk cluster is used to determine whether
+ * to wait for link ready (in other words, first contact ready).
+ */
+static int smcr_link_cluster_cmpfn(struct rhashtable_compare_arg *arg, const void *obj)
+{
+	const struct smc_link_cluster_compare_arg *key = arg->key;
+	const struct smc_link_cluster *lnkc = obj;
+
+	if (memcmp(key->peer_systemid, lnkc->peer_systemid, SMC_SYSTEMID_LEN))
+		return 1;
+
+	if (memcmp(key->peer_gid, lnkc->peer_gid, SMC_GID_SIZE))
+		return 1;
+
+	if ((key->role == SMC_SERV || key->clcqpn == lnkc->clcqpn) &&
+	    (key->smcr_version == SMC_V2 ||
+	    !memcmp(key->peer_mac, lnkc->peer_mac, ETH_ALEN)))
+		return 0;
+
+	return 1;
+}
+
+/* SMC-R lnk cluster hash func */
+static u32 smcr_link_cluster_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct smc_link_cluster *lnkc = data;
+
+	return jhash2((u32 *)lnkc->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
+		+ ((lnkc->role == SMC_SERV) ? 0 : lnkc->clcqpn);
+}
+
+/* SMC-R lnk cluster compare arg hash func */
+static u32 smcr_link_cluster_compare_arg_hashfn(const void *data, u32 len, u32 seed)
+{
+	const struct smc_link_cluster_compare_arg *key = data;
+
+	return jhash2((u32 *)key->peer_systemid, SMC_SYSTEMID_LEN / sizeof(u32), seed)
+		+ ((key->role == SMC_SERV) ? 0 : key->clcqpn);
+}
+
+static const struct rhashtable_params smcr_link_cluster_rhl_params = {
+	.head_offset = offsetof(struct smc_link_cluster, rnode),
+	.key_len = sizeof(struct smc_link_cluster_compare_arg),
+	.obj_cmpfn = smcr_link_cluster_cmpfn,
+	.obj_hashfn = smcr_link_cluster_hashfn,
+	.hashfn = smcr_link_cluster_compare_arg_hashfn,
+	.automatic_shrinking = true,
+};
+
+/* hold a reference for smc_link_cluster */
+static inline void smc_link_cluster_hold(struct smc_link_cluster *lnkc)
+{
+	if (likely(lnkc))
+		refcount_inc(&lnkc->ref);
+}
+
+/* release a reference for smc_link_cluster */
+static inline void smc_link_cluster_put(struct smc_link_cluster *lnkc)
+{
+	bool do_free = false;
+
+	if (!lnkc)
+		return;
+
+	if (refcount_dec_not_one(&lnkc->ref))
+		return;
+
+	spin_lock_bh(&smc_lgr_manager.lock);
+	/* last ref */
+	if (refcount_dec_and_test(&lnkc->ref)) {
+		do_free = true;
+		rhashtable_remove_fast(&smc_lgr_manager.link_cluster_maps, &lnkc->rnode,
+				       smcr_link_cluster_rhl_params);
+	}
+	spin_unlock_bh(&smc_lgr_manager.lock);
+	if (do_free)
+		kfree(lnkc);
+}
+
+/* Get or create smc_link_cluster by key
+ * This function will hold a reference of returned smc_link_cluster
+ * or create a new smc_link_cluster with the reference initialized to 1。
+ * caller MUST call smc_link_cluster_put after this.
+ */
+static inline struct smc_link_cluster *
+smcr_link_get_or_create_cluster(struct smc_link_cluster_compare_arg *key)
+{
+	struct smc_link_cluster *lnkc;
+	int err;
+
+	spin_lock_bh(&smc_lgr_manager.lock);
+	lnkc = rhashtable_lookup_fast(&smc_lgr_manager.link_cluster_maps, key,
+				      smcr_link_cluster_rhl_params);
+	if (!lnkc) {
+		lnkc = kzalloc(sizeof(*lnkc), GFP_ATOMIC);
+		if (unlikely(!lnkc))
+			goto fail;
+
+		/* init cluster */
+		spin_lock_init(&lnkc->lock);
+		lnkc->role = key->role;
+		if (key->role == SMC_CLNT)
+			lnkc->clcqpn = key->clcqpn;
+		init_waitqueue_head(&lnkc->first_contact_waitqueue);
+		memcpy(lnkc->peer_systemid, key->peer_systemid, SMC_SYSTEMID_LEN);
+		memcpy(lnkc->peer_gid, key->peer_gid, SMC_GID_SIZE);
+		memcpy(lnkc->peer_mac, key->peer_mac, ETH_ALEN);
+		refcount_set(&lnkc->ref, 1);
+
+		err = rhashtable_insert_fast(&smc_lgr_manager.link_cluster_maps,
+					     &lnkc->rnode, smcr_link_cluster_rhl_params);
+		if (unlikely(err)) {
+			pr_warn_ratelimited("smc: rhashtable_insert_fast failed (%d)", err);
+			kfree(lnkc);
+			lnkc = NULL;
+		}
+	} else {
+		smc_link_cluster_hold(lnkc);
+	}
+fail:
+	spin_unlock_bh(&smc_lgr_manager.lock);
+	return lnkc;
+}
+
+/* Get or create a smc_link_cluster by lnk
+ * caller MUST call smc_link_cluster_put after this.
+ */
+static inline struct smc_link_cluster *smcr_link_get_cluster(struct smc_link *lnk)
+{
+	struct smc_link_cluster_compare_arg key;
+	struct smc_link_group *lgr;
+
+	lgr = lnk->lgr;
+	if (!lgr || lgr->is_smcd)
+		return NULL;
+
+	key.smcr_version = lgr->smc_version;
+	key.peer_systemid = lgr->peer_systemid;
+	key.peer_gid = lnk->peer_gid;
+	key.peer_mac = lnk->peer_mac;
+	key.role	 = lgr->role;
+	if (key.role == SMC_CLNT)
+		key.clcqpn = lnk->peer_qpn;
+
+	return smcr_link_get_or_create_cluster(&key);
+}
+
+/* Get or create a smc_link_cluster by ini
+ * caller MUST call smc_link_cluster_put after this.
+ */
+static inline struct smc_link_cluster *
+smcr_link_get_cluster_by_ini(struct smc_init_info *ini, int role)
+{
+	struct smc_link_cluster_compare_arg key;
+
+	if (ini->is_smcd)
+		return NULL;
+
+	key.smcr_version = ini->smcr_version;
+	key.peer_systemid = ini->peer_systemid;
+	key.peer_gid = ini->peer_gid;
+	key.peer_mac = ini->peer_mac;
+	key.role	= role;
+	if (role == SMC_CLNT)
+		key.clcqpn	= ini->ib_clcqpn;
+
+	return smcr_link_get_or_create_cluster(&key);
+}
+
+/* callback when smc link state change */
+void smcr_link_cluster_on_link_state(struct smc_link *lnk)
+{
+	struct smc_link_cluster *lnkc;
+	int nr = 0;
+
+	/* barrier for lnk->state */
+	smp_mb();
+
+	/* only first link can made connections block on
+	 * first_contact_waitqueue
+	 */
+	if (lnk->link_idx != SMC_SINGLE_LINK)
+		return;
+
+	/* state already seen  */
+	if (lnk->state_record & SMC_LNK_STATE_BIT(lnk->state))
+		return;
+
+	lnkc = smcr_link_get_cluster(lnk);
+
+	if (unlikely(!lnkc))
+		return;
+
+	spin_lock_bh(&lnkc->lock);
+
+	/* all lnk state change should be
+	 * 1. SMC_LNK_UNUSED -> SMC_LNK_TEAR_DOWN (link init failed)
+	 * 2. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_TEAR_DOWN
+	 * 3. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DOWN
+	 * 4. SMC_LNK_UNUSED -> SMC_LNK_ACTIVATING -> SMC_LNK_INACTIVE -> SMC_LNK_TEAR_DOWN
+	 * 5. SMC_LNK_UNUSED -> SMC_LNK_ATIVATING -> SMC_LNK_ACTIVE ->SMC_LNK_INACTIVE
+	 * -> SMC_LNK_TEAR_DOWN
+	 */
+	switch (lnk->state) {
+	case SMC_LNK_ACTIVATING:
+		/* It's safe to hold a reference without lock
+		 * dues to the smcr_link_get_cluster already hold one
+		 */
+		smc_link_cluster_hold(lnkc);
+		break;
+	case SMC_LNK_TEAR_DOWN:
+		if (lnk->state_record & SMC_LNK_STATE_BIT(SMC_LNK_ACTIVATING))
+			/* smc_link_cluster_hold in SMC_LNK_ACTIVATING */
+			smc_link_cluster_put(lnkc);
+		fallthrough;
+	case SMC_LNK_ACTIVE:
+	case SMC_LNK_INACTIVE:
+		if (!(lnk->state_record &
+			(SMC_LNK_STATE_BIT(SMC_LNK_ACTIVE)
+			| SMC_LNK_STATE_BIT(SMC_LNK_INACTIVE)))) {
+			lnkc->pending_capability -= (SMC_RMBS_PER_LGR_MAX - 1);
+			nr = SMC_RMBS_PER_LGR_MAX - 1;
+			if (unlikely(lnk->state != SMC_LNK_ACTIVE)) {
+				lnkc->lacking_first_contact++;
+				/* only to wake up one connection to perfrom
+				 * first contact in server side, client MUST wake up
+				 * all to decline.
+				 */
+				if (lnkc->role == SMC_SERV)
+					nr = 1;
+			}
+		}
+		break;
+	case SMC_LNK_UNUSED:
+		pr_warn_ratelimited("net/smc: invalid lnk state. ");
+		break;
+	}
+	SMC_LNK_STATE_RECORD(lnk, lnk->state);
+	spin_unlock_bh(&lnkc->lock);
+	if (nr)
+		wake_up_nr(&lnkc->first_contact_waitqueue, nr);
+	smc_link_cluster_put(lnkc);	/* smc_link_cluster_hold in smcr_link_get_cluster */
+}
+
 /* return head of link group list and its lock for a given link group */
 static inline struct list_head *smc_lgr_list_head(struct smc_link_group *lgr,
 						  spinlock_t **lgr_lock)
@@ -651,8 +904,10 @@  static void smcr_lgr_link_deactivate_all(struct smc_link_group *lgr)
 	for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
 		struct smc_link *lnk = &lgr->lnk[i];
 
-		if (smc_link_sendable(lnk))
+		if (smc_link_sendable(lnk)) {
 			lnk->state = SMC_LNK_INACTIVE;
+			smcr_link_cluster_on_link_state(lnk);
+		}
 	}
 	wake_up_all(&lgr->llc_msg_waiter);
 	wake_up_all(&lgr->llc_flow_waiter);
@@ -756,12 +1011,16 @@  int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	lnk->link_id = smcr_next_link_id(lgr);
 	lnk->lgr = lgr;
 	smc_lgr_hold(lgr); /* lgr_put in smcr_link_clear() */
+	rwlock_init(&lnk->rtokens_lock);
 	lnk->link_idx = link_idx;
 	smc_ibdev_cnt_inc(lnk);
 	smcr_copy_dev_info_to_link(lnk);
 	atomic_set(&lnk->conn_cnt, 0);
 	smc_llc_link_set_uid(lnk);
 	INIT_WORK(&lnk->link_down_wrk, smc_link_down_work);
+	lnk->peer_qpn = ini->ib_clcqpn;
+	memcpy(lnk->peer_gid, ini->peer_gid, SMC_GID_SIZE);
+	memcpy(lnk->peer_mac, ini->peer_mac, sizeof(lnk->peer_mac));
 	if (!lnk->smcibdev->initialized) {
 		rc = (int)smc_ib_setup_per_ibdev(lnk->smcibdev);
 		if (rc)
@@ -792,6 +1051,7 @@  int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	if (rc)
 		goto destroy_qp;
 	lnk->state = SMC_LNK_ACTIVATING;
+	smcr_link_cluster_on_link_state(lnk);
 	return 0;
 
 destroy_qp:
@@ -806,6 +1066,8 @@  int smcr_link_init(struct smc_link_group *lgr, struct smc_link *lnk,
 	smc_ibdev_cnt_dec(lnk);
 	put_device(&lnk->smcibdev->ibdev->dev);
 	smcibdev = lnk->smcibdev;
+	lnk->state = SMC_LNK_TEAR_DOWN;
+	smcr_link_cluster_on_link_state(lnk);
 	memset(lnk, 0, sizeof(struct smc_link));
 	lnk->state = SMC_LNK_UNUSED;
 	if (!atomic_dec_return(&smcibdev->lnk_cnt))
@@ -1263,6 +1525,8 @@  void smcr_link_clear(struct smc_link *lnk, bool log)
 	if (!lnk->lgr || lnk->clearing ||
 	    lnk->state == SMC_LNK_UNUSED)
 		return;
+	lnk->state = SMC_LNK_TEAR_DOWN;
+	smcr_link_cluster_on_link_state(lnk);
 	lnk->clearing = 1;
 	lnk->peer_qpn = 0;
 	smc_llc_link_clear(lnk, log);
@@ -1712,6 +1976,7 @@  void smcr_link_down_cond(struct smc_link *lnk)
 {
 	if (smc_link_downing(&lnk->state)) {
 		trace_smcr_link_down(lnk, __builtin_return_address(0));
+		smcr_link_cluster_on_link_state(lnk);
 		smcr_link_down(lnk);
 	}
 }
@@ -1721,6 +1986,7 @@  void smcr_link_down_cond_sched(struct smc_link *lnk)
 {
 	if (smc_link_downing(&lnk->state)) {
 		trace_smcr_link_down(lnk, __builtin_return_address(0));
+		smcr_link_cluster_on_link_state(lnk);
 		schedule_work(&lnk->link_down_wrk);
 	}
 }
@@ -1850,11 +2116,13 @@  int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 {
 	struct smc_connection *conn = &smc->conn;
 	struct net *net = sock_net(&smc->sk);
+	DECLARE_WAITQUEUE(wait, current);
+	struct smc_link_cluster *lnkc = NULL;
 	struct list_head *lgr_list;
 	struct smc_link_group *lgr;
 	enum smc_lgr_role role;
 	spinlock_t *lgr_lock;
-	int rc = 0;
+	int rc = 0, timeo = CLC_WAIT_TIME;
 
 	lgr_list = ini->is_smcd ? &ini->ism_dev[ini->ism_selected]->lgr_list :
 				  &smc_lgr_list.list;
@@ -1862,12 +2130,29 @@  int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 				  &smc_lgr_list.lock;
 	ini->first_contact_local = 1;
 	role = smc->listen_smc ? SMC_SERV : SMC_CLNT;
-	if (role == SMC_CLNT && ini->first_contact_peer)
+
+	if (!ini->is_smcd) {
+		lnkc = smcr_link_get_cluster_by_ini(ini, role);
+		if (unlikely(!lnkc))
+			return SMC_CLC_DECL_INTERR;
+	}
+
+	if (role == SMC_CLNT && ini->first_contact_peer) {
+		if (!ini->is_smcd) {
+			/* first_contact */
+			spin_lock_bh(&lnkc->lock);
+			lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
+			spin_unlock_bh(&lnkc->lock);
+		}
 		/* create new link group as well */
 		goto create;
+	}
 
 	/* determine if an existing link group can be reused */
 	spin_lock_bh(lgr_lock);
+	if (!ini->is_smcd)
+		spin_lock(&lnkc->lock);
+again:
 	list_for_each_entry(lgr, lgr_list, list) {
 		write_lock_bh(&lgr->conns_lock);
 		if ((ini->is_smcd ?
@@ -1894,9 +2179,41 @@  int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 		}
 		write_unlock_bh(&lgr->conns_lock);
 	}
+	if (!ini->is_smcd && ini->first_contact_local) {
+		if (lnkc->pending_capability > lnkc->conns_pending) {
+			lnkc->conns_pending++;
+			add_wait_queue(&lnkc->first_contact_waitqueue, &wait);
+			spin_unlock(&lnkc->lock);
+			spin_unlock_bh(lgr_lock);
+			set_current_state(TASK_INTERRUPTIBLE);
+			/* need to wait at least once first contact done */
+			timeo = schedule_timeout(timeo);
+			set_current_state(TASK_RUNNING);
+			remove_wait_queue(&lnkc->first_contact_waitqueue, &wait);
+			spin_lock_bh(lgr_lock);
+			spin_lock(&lnkc->lock);
+
+			lnkc->conns_pending--;
+			if (likely(timeo && !lnkc->lacking_first_contact))
+				goto again;
+
+			/* lnk create failed, only server side can raise
+			 * a new first contact. client side here will
+			 * fallback by SMC_CLC_DECL_SYNCERR.
+			 */
+			if (role == SMC_SERV && lnkc->lacking_first_contact)
+				lnkc->lacking_first_contact--;
+		}
+		if (role == SMC_SERV) {
+			/* first_contact */
+			lnkc->pending_capability += (SMC_RMBS_PER_LGR_MAX - 1);
+		}
+	}
+	if (!ini->is_smcd)
+		spin_unlock(&lnkc->lock);
 	spin_unlock_bh(lgr_lock);
 	if (rc)
-		return rc;
+		goto out;
 
 	if (role == SMC_CLNT && !ini->first_contact_peer &&
 	    ini->first_contact_local) {
@@ -1904,7 +2221,8 @@  int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 		 * a new one
 		 * send out_of_sync decline, reason synchr. error
 		 */
-		return SMC_CLC_DECL_SYNCERR;
+		rc = SMC_CLC_DECL_SYNCERR;
+		goto out;
 	}
 
 create:
@@ -1941,6 +2259,9 @@  int smc_conn_create(struct smc_sock *smc, struct smc_init_info *ini)
 #endif
 
 out:
+	/* smc_link_cluster_hold in smcr_link_get_or_create_cluster */
+	if (!ini->is_smcd)
+		smc_link_cluster_put(lnkc);
 	return rc;
 }
 
@@ -2500,19 +2821,24 @@  int smc_rtoken_add(struct smc_link *lnk, __be64 nw_vaddr, __be32 nw_rkey)
 	u32 rkey = ntohl(nw_rkey);
 	int i;
 
+	write_lock_bh(&lnk->rtokens_lock);
 	for (i = 0; i < SMC_RMBS_PER_LGR_MAX; i++) {
 		if (lgr->rtokens[i][lnk->link_idx].rkey == rkey &&
 		    lgr->rtokens[i][lnk->link_idx].dma_addr == dma_addr &&
 		    test_bit(i, lgr->rtokens_used_mask)) {
 			/* already in list */
+			write_unlock_bh(&lnk->rtokens_lock);
 			return i;
 		}
 	}
 	i = smc_rmb_reserve_rtoken_idx(lgr);
-	if (i < 0)
+	if (i < 0) {
+		write_unlock_bh(&lnk->rtokens_lock);
 		return i;
+	}
 	lgr->rtokens[i][lnk->link_idx].rkey = rkey;
 	lgr->rtokens[i][lnk->link_idx].dma_addr = dma_addr;
+	write_unlock_bh(&lnk->rtokens_lock);
 	return i;
 }
 
@@ -2523,6 +2849,7 @@  int smc_rtoken_delete(struct smc_link *lnk, __be32 nw_rkey)
 	u32 rkey = ntohl(nw_rkey);
 	int i, j;
 
+	write_lock_bh(&lnk->rtokens_lock);
 	for (i = 0; i < SMC_RMBS_PER_LGR_MAX; i++) {
 		if (lgr->rtokens[i][lnk->link_idx].rkey == rkey &&
 		    test_bit(i, lgr->rtokens_used_mask)) {
@@ -2531,9 +2858,11 @@  int smc_rtoken_delete(struct smc_link *lnk, __be32 nw_rkey)
 				lgr->rtokens[i][j].dma_addr = 0;
 			}
 			clear_bit(i, lgr->rtokens_used_mask);
+			write_unlock_bh(&lnk->rtokens_lock);
 			return 0;
 		}
 	}
+	write_unlock_bh(&lnk->rtokens_lock);
 	return -ENOENT;
 }
 
@@ -2599,12 +2928,23 @@  static int smc_core_reboot_event(struct notifier_block *this,
 
 int __init smc_core_init(void)
 {
+	/* init smc lnk cluster maps */
+	rhashtable_init(&smc_lgr_manager.link_cluster_maps, &smcr_link_cluster_rhl_params);
 	return register_reboot_notifier(&smc_reboot_notifier);
 }
 
+static void smc_link_cluster_free_cb(void *ptr, void *arg)
+{
+	pr_warn("smc: smc lnk cluster refcnt leak.\n");
+	kfree(ptr);
+}
+
 /* Called (from smc_exit) when module is removed */
 void smc_core_exit(void)
 {
 	unregister_reboot_notifier(&smc_reboot_notifier);
 	smc_lgrs_shutdown();
+	/* destroy smc lnk cluster maps */
+	rhashtable_free_and_destroy(&smc_lgr_manager.link_cluster_maps, smc_link_cluster_free_cb,
+				    NULL);
 }
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index fe8b524..3c3bc11 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -15,6 +15,7 @@ 
 #include <linux/atomic.h>
 #include <linux/smc.h>
 #include <linux/pci.h>
+#include <linux/rhashtable.h>
 #include <rdma/ib_verbs.h>
 #include <net/genetlink.h>
 
@@ -29,18 +30,66 @@  struct smc_lgr_list {			/* list of link group definition */
 	u32			num;	/* unique link group number */
 };
 
+struct smc_lgr_manager {		/* manager for link group */
+	struct rhashtable	link_cluster_maps;	/* maps of smc_link_cluster */
+	spinlock_t		lock;	/* lock for lgr_cm_maps */
+};
+
+struct smc_link_cluster {
+	struct rhash_head	rnode;	/* node for rhashtable */
+	struct wait_queue_head	first_contact_waitqueue;
+					/* queue for non first contact to wait
+					 * first contact to be established.
+					 */
+	spinlock_t		lock;	/* protection for link group */
+	refcount_t		ref;	/* refcount for cluster */
+	unsigned long		pending_capability;
+					/* maximum pending number of connections that
+					 * need wait first contact complete.
+					 */
+	unsigned long		conns_pending;
+					/* connections that are waiting for first contact
+					 * complete
+					 */
+	u32					lacking_first_contact;
+					/* indicate that the connection
+					 * should perform first contact.
+					 */
+	u8		peer_systemid[SMC_SYSTEMID_LEN];
+	u8		peer_mac[ETH_ALEN];	/* = gid[8:10||13:15] */
+	u8		peer_gid[SMC_GID_SIZE];	/* gid of peer*/
+	int		clcqpn;
+	int		role;
+};
+
 enum smc_lgr_role {		/* possible roles of a link group */
 	SMC_CLNT,	/* client */
 	SMC_SERV	/* server */
 };
 
+struct smc_link_cluster_compare_arg	/* key for smc_link_cluster */
+{
+	int	smcr_version;
+	enum smc_lgr_role role;
+	u8	*peer_systemid;
+	u8	*peer_gid;
+	u8	*peer_mac;
+	int clcqpn;
+};
+
 enum smc_link_state {			/* possible states of a link */
 	SMC_LNK_UNUSED,		/* link is unused */
 	SMC_LNK_INACTIVE,	/* link is inactive */
 	SMC_LNK_ACTIVATING,	/* link is being activated */
 	SMC_LNK_ACTIVE,		/* link is active */
+	SMC_LNK_TEAR_DOWN,	/* link is tear down */
 };
 
+#define SMC_LNK_STATE_BIT(state)	(1 << (state))
+
+#define	SMC_LNK_STATE_RECORD(lnk, state)	\
+	((lnk)->state_record |= SMC_LNK_STATE_BIT(state))
+
 #define SMC_WR_BUF_SIZE		48	/* size of work request buffer */
 #define SMC_WR_BUF_V2_SIZE	8192	/* size of v2 work request buffer */
 
@@ -107,6 +156,7 @@  struct smc_link {
 	u32			wr_tx_cnt;	/* number of WR send buffers */
 	wait_queue_head_t	wr_tx_wait;	/* wait for free WR send buf */
 	atomic_t		wr_tx_refcnt;	/* tx refs to link */
+	rwlock_t		rtokens_lock;
 
 	struct smc_wr_buf	*wr_rx_bufs;	/* WR recv payload buffers */
 	struct ib_recv_wr	*wr_rx_ibs;	/* WR recv meta data */
@@ -145,6 +195,7 @@  struct smc_link {
 	int			ndev_ifidx; /* network device ifindex */
 
 	enum smc_link_state	state;		/* state of link */
+	int			state_record;		/* record of previous state */
 	struct delayed_work	llc_testlink_wrk; /* testlink worker */
 	struct completion	llc_testlink_resp; /* wait for rx of testlink */
 	int			llc_testlink_time; /* testlink interval */
@@ -557,6 +608,8 @@  struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
 int smcr_nl_get_link(struct sk_buff *skb, struct netlink_callback *cb);
 int smcd_nl_get_lgr(struct sk_buff *skb, struct netlink_callback *cb);
 
+void smcr_link_cluster_on_link_state(struct smc_link *lnk);
+
 static inline struct smc_link_group *smc_get_lgr(struct smc_link *link)
 {
 	return link->lgr;
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index 175026a..c1ce80b 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -1099,6 +1099,7 @@  int smc_llc_cli_add_link(struct smc_link *link, struct smc_llc_qentry *qentry)
 		goto out;
 out_clear_lnk:
 	lnk_new->state = SMC_LNK_INACTIVE;
+	smcr_link_cluster_on_link_state(lnk_new);
 	smcr_link_clear(lnk_new, false);
 out_reject:
 	smc_llc_cli_add_link_reject(qentry);
@@ -1278,6 +1279,7 @@  static void smc_llc_delete_asym_link(struct smc_link_group *lgr)
 		return; /* no asymmetric link */
 	if (!smc_link_downing(&lnk_asym->state))
 		return;
+	smcr_link_cluster_on_link_state(lnk_asym);
 	lnk_new = smc_switch_conns(lgr, lnk_asym, false);
 	smc_wr_tx_wait_no_pending_sends(lnk_asym);
 	if (!lnk_new)
@@ -1492,6 +1494,7 @@  int smc_llc_srv_add_link(struct smc_link *link,
 out_err:
 	if (link_new) {
 		link_new->state = SMC_LNK_INACTIVE;
+		smcr_link_cluster_on_link_state(link_new);
 		smcr_link_clear(link_new, false);
 	}
 out:
@@ -1602,8 +1605,10 @@  static void smc_llc_process_cli_delete_link(struct smc_link_group *lgr)
 	del_llc->reason = 0;
 	smc_llc_send_message(lnk, &qentry->msg); /* response */
 
-	if (smc_link_downing(&lnk_del->state))
+	if (smc_link_downing(&lnk_del->state)) {
+		smcr_link_cluster_on_link_state(lnk);
 		smc_switch_conns(lgr, lnk_del, false);
+	}
 	smcr_link_clear(lnk_del, true);
 
 	active_links = smc_llc_active_link_count(lgr);
@@ -1676,6 +1681,7 @@  static void smc_llc_process_srv_delete_link(struct smc_link_group *lgr)
 		goto out; /* asymmetric link already deleted */
 
 	if (smc_link_downing(&lnk_del->state)) {
+		smcr_link_cluster_on_link_state(lnk);
 		if (smc_switch_conns(lgr, lnk_del, false))
 			smc_wr_tx_wait_no_pending_sends(lnk_del);
 	}
@@ -2167,6 +2173,7 @@  void smc_llc_link_active(struct smc_link *link)
 		schedule_delayed_work(&link->llc_testlink_wrk,
 				      link->llc_testlink_time);
 	}
+	smcr_link_cluster_on_link_state(link);
 }
 
 /* called in worker context */