diff mbox series

[net] net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error

Message ID 1646140644-121649-1-git-send-email-alibuda@linux.alibaba.com (mailing list archive)
State Not Applicable
Headers show
Series [net] net/smc: fix unexpected SMC_CLC_DECL_ERR_REGRMB error | expand

Commit Message

D. Wythe March 1, 2022, 1:17 p.m. UTC
From: "D. Wythe" <alibuda@linux.alibaba.com>

Remove connections from link group is not synchronous with handling
SMC_LLC_DELETE_RKEY, which means that even the number of connections is
less that SMC_RMBS_PER_LGR_MAX, it does not mean that the connection can
register rtoken successfully later, in other words, the rtoken entry may
have not been released. This will cause an unexpected
SMC_CLC_DECL_ERR_REGRMB to be reported, and then ths smc connection have
to fallback to TCP.

We found that the main reason for the problem dues to following execution
sequence:

Server Conn A:           Server Conn B:			Client Conn B:

smc_lgr_unregister_conn
                        smc_lgr_register_conn
                        smc_clc_send_accept     ->
                                                        smc_rtoken_add
smcr_buf_unuse
		->		Client Conn A:
				smc_rtoken_delete

smc_lgr_unregister_conn() makes current link available to assigned to new
incoming connection, while smcr_buf_unuse() has not executed yet, which
means that smc_rtoken_add may fail because of insufficient rtoken_entry,
reversing their execution order will avoid this problem.

Fixes: 3e034725c0d8 ("net/smc: common functions for RMBs and send buffers")
Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/smc_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Wen Gu March 2, 2022, 7:28 a.m. UTC | #1
在 2022/3/1 下午9:17, D. Wythe 写道:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> Remove connections from link group is not synchronous with handling
> SMC_LLC_DELETE_RKEY, which means that even the number of connections is
> less that SMC_RMBS_PER_LGR_MAX, it does not mean that the connection can
> register rtoken successfully later, in other words, the rtoken entry may
> have not been released. This will cause an unexpected
> SMC_CLC_DECL_ERR_REGRMB to be reported, and then ths smc connection have
> to fallback to TCP.
> 


IMHO, if there are SMC_RMBS_PER_LGR_MAX connections in the link group now,
one of them is being removed and here comes a new connection at this moment,
then:

(1) without this patch, the new connection will be added into the old link group
     but fallback if the removing connection has not finished unregistering its rmb.

(2) with this patch, a new link group will be created and the new connection
     will be added into the new link group.

I am wondering if (1) should be considered as a issue, or just a bydesign?
If it is a issue, I think this patch works, Thanks!

Reviewed-by: Wen Gu <guwen@linux.alibaba.com>

Best Regards,
Wen Gu
D. Wythe March 2, 2022, 7:51 a.m. UTC | #2
On Wed, Mar 02, 2022 at 03:28:32PM +0800, Wen Gu wrote:
> 
> 
> 在 2022/3/1 下午9:17, D. Wythe 写道:
> >From: "D. Wythe" <alibuda@linux.alibaba.com>
> >
> >Remove connections from link group is not synchronous with handling
> >SMC_LLC_DELETE_RKEY, which means that even the number of connections is
> >less that SMC_RMBS_PER_LGR_MAX, it does not mean that the connection can
> >register rtoken successfully later, in other words, the rtoken entry may
> >have not been released. This will cause an unexpected
> >SMC_CLC_DECL_ERR_REGRMB to be reported, and then ths smc connection have
> >to fallback to TCP.
> >
> 
> 
> IMHO, if there are SMC_RMBS_PER_LGR_MAX connections in the link group now,
> one of them is being removed and here comes a new connection at this moment,
> then:
> 
> (1) without this patch, the new connection will be added into the old link group
>     but fallback if the removing connection has not finished unregistering its rmb.
> 
> (2) with this patch, a new link group will be created and the new connection
>     will be added into the new link group.
> 
> I am wondering if (1) should be considered as a issue, or just a bydesign?
> If it is a issue, I think this patch works, Thanks!


We should always be willing to improve the success rate of the SMC 
connection, creating a new group is not a side effect of this patch, it 
actually dues to the state bewteen connections that can not achieve 
clock synchronization. In fact, it can happen in any times.

Thanks.
Wen Gu March 2, 2022, 9:17 a.m. UTC | #3
On 2022/3/2 3:51 pm, D. Wythe wrote:

> We should always be willing to improve the success rate of the SMC
> connection, creating a new group is not a side effect of this patch, it
> actually dues to the state bewteen connections that can not achieve
> clock synchronization. In fact, it can happen in any times.
> 
> Thanks.

OK, I understand. Thanks.
D. Wythe March 2, 2022, 11:44 a.m. UTC | #4
在 2022/3/1 下午9:17, D. Wythe 写道:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> Remove connections from link group is not synchronous with handling
> SMC_LLC_DELETE_RKEY, which means that even the number of connections is
> less that SMC_RMBS_PER_LGR_MAX, it does not mean that the connection can
> register rtoken successfully later, in other words, the rtoken entry may
> have not been released. This will cause an unexpected
> SMC_CLC_DECL_ERR_REGRMB to be reported, and then ths smc connection have
> to fallback to TCP.
> 
> We found that the main reason for the problem dues to following execution
> sequence:
> 
> Server Conn A:           Server Conn B:			Client Conn B:
> 
> smc_lgr_unregister_conn
>                          smc_lgr_register_conn
>                          smc_clc_send_accept     ->
>                                                          smc_rtoken_add
> smcr_buf_unuse
> 		->		Client Conn A:
> 				smc_rtoken_delete
> 
> smc_lgr_unregister_conn() makes current link available to assigned to new
> incoming connection, while smcr_buf_unuse() has not executed yet, which
> means that smc_rtoken_add may fail because of insufficient rtoken_entry,
> reversing their execution order will avoid this problem.
> 
> Fixes: 3e034725c0d8 ("net/smc: common functions for RMBs and send buffers")
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>   net/smc/smc_core.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 2f321d2..c9c3a68 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -1161,8 +1161,8 @@ void smc_conn_free(struct smc_connection *conn)
>   			cancel_work_sync(&conn->abort_work);
>   	}
>   	if (!list_empty(&lgr->list)) {
> -		smc_lgr_unregister_conn(conn);
>   		smc_buf_unuse(conn, lgr); /* allow buffer reuse */
> +		smc_lgr_unregister_conn(conn);
>   	}
>   
>   	if (!lgr->conns_num)

I have two patch for this issue, and i missed one, I'll post it in v2 
series.
diff mbox series

Patch

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 2f321d2..c9c3a68 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1161,8 +1161,8 @@  void smc_conn_free(struct smc_connection *conn)
 			cancel_work_sync(&conn->abort_work);
 	}
 	if (!list_empty(&lgr->list)) {
-		smc_lgr_unregister_conn(conn);
 		smc_buf_unuse(conn, lgr); /* allow buffer reuse */
+		smc_lgr_unregister_conn(conn);
 	}
 
 	if (!lgr->conns_num)