[RFC,net-next,0/2] Optimize the parallelism of SMC-R connections

Message ID	1694008530-85087-1-git-send-email-alibuda@linux.alibaba.com (mailing list archive)
Headers	show Return-Path: <linux-rdma-owner@vger.kernel.org> From: "D. Wythe" <alibuda@linux.alibaba.com> To: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com Cc: kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org, linux-s390@vger.kernel.org, linux-rdma@vger.kernel.org, "D. Wythe" <alibuda@linux.alibaba.com> Subject: [RFC net-next 0/2] Optimize the parallelism of SMC-R connections Date: Wed, 6 Sep 2023 21:55:28 +0800 Message-Id: <1694008530-85087-1-git-send-email-alibuda@linux.alibaba.com> Precedence: bulk
Series	Optimize the parallelism of SMC-R connections \| expand [RFC,net-next,0/2] Optimize the parallelism of SMC-R connections [RFC,net-next,1/2] net/smc: refactoring lgr pending lock [RFC,net-next,2/2] net/smc: remove locks smc_client_lgr_pending and smc_server_lgr_pending

Message ID

1694008530-85087-1-git-send-email-alibuda@linux.alibaba.com (mailing list archive)

Headers

From: "D. Wythe" <alibuda@linux.alibaba.com>
To: kgraul@linux.ibm.com, wenjia@linux.ibm.com, jaka@linux.ibm.com
Cc: kuba@kernel.org, davem@davemloft.net, netdev@vger.kernel.org,
        linux-s390@vger.kernel.org, linux-rdma@vger.kernel.org,
        "D. Wythe" <alibuda@linux.alibaba.com>
Subject: [RFC net-next 0/2] Optimize the parallelism of SMC-R connections 
Date: Wed,  6 Sep 2023 21:55:28 +0800
Message-Id: <1694008530-85087-1-git-send-email-alibuda@linux.alibaba.com>
Precedence: bulk

Series

Optimize the parallelism of SMC-R connections | expand

Message

D. Wythe Sept. 6, 2023, 1:55 p.m. UTC

From: "D. Wythe" <alibuda@linux.alibaba.com>

This patchset attempts to optimize the parallelism of SMC-R connections
in quite a SIMPLE way, reduce unnecessary blocking on locks.

According to Off-CPU statistics, SMC worker's off-CPU statistics
as that: 

smc_listen_work 			(48.17%)
	__mutex_lock.isra.11 		(47.96%)

An ideal SMC-R connection process should only block on the IO events
of the network, but it's quite clear that the SMC-R connection now is
queued on the lock most of the time.

Before creating a connection, we always try to see if it can be
successfully created without allowing the creation of an lgr,
if so, it means it does not rely on new link group.
In other words, locking on xxx_lgr_pending is not necessary
any more.

Noted that removing this lock will not have an immediate effect
in the current version, as there are still some concurrency issues
in the SMC handshake phase. However, regardless, removing this lock
is a prerequisite for other optimizations.

If you have any questions or suggestions, please let me know.

D. Wythe (2):
  net/smc: refactoring lgr pending lock
  net/smc: remove locks smc_client_lgr_pending and
    smc_server_lgr_pending

 net/smc/af_smc.c   | 24 ++++++++++++------------
 net/smc/smc_clc.h  |  1 +
 net/smc/smc_core.c | 28 ++++++++++++++++++++++++++--
 net/smc/smc_core.h | 21 +++++++++++++++++++++
 4 files changed, 60 insertions(+), 14 deletions(-)

Comments

Alexandra Winter Sept. 8, 2023, 9:07 a.m. UTC | #1

On 06.09.23 15:55, D. Wythe wrote:
> From: "D. Wythe" <alibuda@linux.alibaba.com>
> 
> This patchset attempts to optimize the parallelism of SMC-R connections
> in quite a SIMPLE way, reduce unnecessary blocking on locks.
> 
> According to Off-CPU statistics, SMC worker's off-CPU statistics
> as that: 
> 
> smc_listen_work 			(48.17%)
> 	__mutex_lock.isra.11 		(47.96%)
> 
> An ideal SMC-R connection process should only block on the IO events
> of the network, but it's quite clear that the SMC-R connection now is
> queued on the lock most of the time.
> 
> Before creating a connection, we always try to see if it can be
> successfully created without allowing the creation of an lgr,
> if so, it means it does not rely on new link group.
> In other words, locking on xxx_lgr_pending is not necessary
> any more.
> 
> Noted that removing this lock will not have an immediate effect
> in the current version, as there are still some concurrency issues
> in the SMC handshake phase. However, regardless, removing this lock
> is a prerequisite for other optimizations.
> 
> If you have any questions or suggestions, please let me know.
> 
> D. Wythe (2):
>   net/smc: refactoring lgr pending lock
>   net/smc: remove locks smc_client_lgr_pending and
>     smc_server_lgr_pending
> 
>  net/smc/af_smc.c   | 24 ++++++++++++------------
>  net/smc/smc_clc.h  |  1 +
>  net/smc/smc_core.c | 28 ++++++++++++++++++++++++++--
>  net/smc/smc_core.h | 21 +++++++++++++++++++++
>  4 files changed, 60 insertions(+), 14 deletions(-)
> 

I have to admit that locking in SMC is quite confusing to me, so this is just my thougths.

Your proposal seems to make things even more complex.

I understand the goal to optimize parallelism.
Today we have the global smc_server/client_lgr_pending AND smc_lgr_list.lock (and more).
There seems to be some overlpa in scope..
Maybe there is some way to reduce the length of the locked paths?
Or use other mechanisms than the big fat smc_server/client_lgr_pending mutex?
e.g.
If you think you can unlock after __smc_conn_create in the re-use-existing_LGR case,
why is the lock needed until after smc_clc_send_confirm in the new-LGR case??

Your use of storing the global lock per ini and then double-freeing it sometimes,
seems a bit homebrewed, though.
E.g. I'm afraid the existing lock checking algorithms could not verify this pattern.

Alexandra Winter Sept. 21, 2023, 12:36 p.m. UTC | #2

On 18.09.23 05:58, D. Wythe wrote:
> Hi Alexandra,
> 
> Sorry for the late reply. I have been thinking about the question you mentioned for a while, and this is a great opportunity to discuss this issue.
> My point is that the purpose of the locks is to minimize the expansion of the number of link groups as much as possible.
> 
> As we all know, the SMC-R protocol has the following specifications:
> 
>  * A SMC-R connection MUST be mapped into one link group.
>  * A link group is usually created by a connection, which is also known
>    as "First Contact."
> 
> If we start from scratch, we can design the connection process as follows:
> 
> 1. Check if there are any available link groups. If so, map the
>    connection into it and go to step 3.
> 2. Mark this connection as "First Contact," create a link group, and
>    mark the new link group as unavailable.
> 3. Finish connection establishment.
> 4. If the connection is "First Contact," mark the new link group as
>    available and map the connection into it.
> 
> I think there is no logical problem with this process, but there is a practical issue where burst traffic can result in burst link groups.
> 
> For example, if there are 10,000 incoming connections, based on the above logic, the most extreme scenario would be to create 10,000 link groups.
> This can cause significant memory pressure and even be used for security attacks.
> 
> To address this goal, the simplest way is to make each connection process mutually exclusive, having the following process:
> 
> 1. Block other incoming connections.
> 2. Check if there are any available link groups. If so, map the
>    connection into it and go to step 4.
> 3. Mark this connection as "First Contact," create a link group, and
>    mark it as unavailable.
> 4. Finish connection establishment.
> 5. If the connection is "First Contact," mark the new link group as
>    available and map the connection into it.
> 6. Allow other connections to come in.
> 
> And this is our current process now!
> 
> Regarding the purpose of the locks, to minimize the expansion of the number of link groups. If we agree with this point, we can observe that
> in phase 2 going to phase 4, this process will never create a new link group. Obviously, the lock is not needed here.

Well, you still have issue of a link group going away. Thread 1 is deleting the last connection from a link group and shutting it down. Thread 2 is adding a 'second' connection (from its poitn ov view) to the linkgroup.

> 
> Then the last question: why is the lock needed until after smc_clc_send_confirm in the new-LGR case? We can try to move phase 6 ahead as follows:
> 
> 1. Block other incoming connections.
> 2. Check if there are any available link groups. If so, map the
>    connection into it and go to step 4.
> 3. Mark this connection as "First Contact," create a link group, and
>    mark it as unavailable.
> 4. Allow other connections to come in.
> 5. Finish connection establishment.
> 6. If the connection is "First Contact," mark the new link group as
>    available and map the connection into it.
> 
> There is also no problem with this process! However, note that this logic does not address burst issues.
> Burst traffic will still result in burst link groups because a new link group can only be marked as available when the "First Contact" is completed,
> which is after sending the CLC Confirm.
> 
> Hope my point is helpful to you. If you have any questions, please let me know. Thanks.
> 
> Best wishes,
> D. Wythe

You are asking exactly the right questions here. Creation of new connections is on the critical path,
and if the design can be optimized for parallelism that will increase perfromance, while insufficient
locking will create nasty bugs.
Many programmers have dealt with these issues before us. I would recommend to consult existing proven
patterns; e.g. the ones listed in Paul McKenney's book 
(https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/) 
e.g. 'Chapter 10.3 Read-Mostly Data Structures' and of course the kernel documentation folder.
Improving an existing codebase like smc without breaking is not trivial. Obviuosly a step-by-step approach,
works best. So if you can identify actions that can be be done under a smaller (as in more granular) lock
instead of under a global lock. OR change a mutex into R/W or RCU.
Smaller changes are easier to review (and bisect in case of regressions).

D. Wythe Sept. 25, 2023, 10:10 a.m. UTC | #3

On 9/21/23 8:36 PM, Alexandra Winter wrote:
> On 18.09.23 05:58, D. Wythe wrote:
>> Hi Alexandra,
>>
>> Sorry for the late reply. I have been thinking about the question you mentioned for a while, and this is a great opportunity to discuss this issue.
>> My point is that the purpose of the locks is to minimize the expansion of the number of link groups as much as possible.
>>
>> As we all know, the SMC-R protocol has the following specifications:
>>
>>   * A SMC-R connection MUST be mapped into one link group.
>>   * A link group is usually created by a connection, which is also known
>>     as "First Contact."
>>
>> If we start from scratch, we can design the connection process as follows:
>>
>> 1. Check if there are any available link groups. If so, map the
>>     connection into it and go to step 3.
>> 2. Mark this connection as "First Contact," create a link group, and
>>     mark the new link group as unavailable.
>> 3. Finish connection establishment.
>> 4. If the connection is "First Contact," mark the new link group as
>>     available and map the connection into it.
>>
>> I think there is no logical problem with this process, but there is a practical issue where burst traffic can result in burst link groups.
>>
>> For example, if there are 10,000 incoming connections, based on the above logic, the most extreme scenario would be to create 10,000 link groups.
>> This can cause significant memory pressure and even be used for security attacks.
>>
>> To address this goal, the simplest way is to make each connection process mutually exclusive, having the following process:
>>
>> 1. Block other incoming connections.
>> 2. Check if there are any available link groups. If so, map the
>>     connection into it and go to step 4.
>> 3. Mark this connection as "First Contact," create a link group, and
>>     mark it as unavailable.
>> 4. Finish connection establishment.
>> 5. If the connection is "First Contact," mark the new link group as
>>     available and map the connection into it.
>> 6. Allow other connections to come in.
>>
>> And this is our current process now!
>>
>> Regarding the purpose of the locks, to minimize the expansion of the number of link groups. If we agree with this point, we can observe that
>> in phase 2 going to phase 4, this process will never create a new link group. Obviously, the lock is not needed here.
> Well, you still have issue of a link group going away. Thread 1 is deleting the last connection from a link group and shutting it down. Thread 2 is adding a 'second' connection (from its poitn ov view) to the linkgroup.

Hi Alexandra,

That's right.  But even if we do nothing, the current implements still 
has this problem.
And this problem can be solved by the spinlock inside smc_conn_create, 
rather than the
pending lock.

And also deleting the last connection from a link group will not 
shutting the down right now,
usually waiting for 10 minutes of idle time.

>> Then the last question: why is the lock needed until after smc_clc_send_confirm in the new-LGR case? We can try to move phase 6 ahead as follows:
>>
>> 1. Block other incoming connections.
>> 2. Check if there are any available link groups. If so, map the
>>     connection into it and go to step 4.
>> 3. Mark this connection as "First Contact," create a link group, and
>>     mark it as unavailable.
>> 4. Allow other connections to come in.
>> 5. Finish connection establishment.
>> 6. If the connection is "First Contact," mark the new link group as
>>     available and map the connection into it.
>>
>> There is also no problem with this process! However, note that this logic does not address burst issues.
>> Burst traffic will still result in burst link groups because a new link group can only be marked as available when the "First Contact" is completed,
>> which is after sending the CLC Confirm.
>>
>> Hope my point is helpful to you. If you have any questions, please let me know. Thanks.
>>
>> Best wishes,
>> D. Wythe
> You are asking exactly the right questions here. Creation of new connections is on the critical path,
> and if the design can be optimized for parallelism that will increase perfromance, while insufficient
> locking will create nasty bugs.
> Many programmers have dealt with these issues before us. I would recommend to consult existing proven
> patterns; e.g. the ones listed in Paul McKenney's book
> (https://mirrors.edge.kernel.org/pub/linux/kernel/people/paulmck/perfbook/)
> e.g. 'Chapter 10.3 Read-Mostly Data Structures' and of course the kernel documentation folder.
> Improving an existing codebase like smc without breaking is not trivial. Obviuosly a step-by-step approach,
> works best. So if you can identify actions that can be be done under a smaller (as in more granular) lock
> instead of under a global lock. OR change a mutex into R/W or RCU.
> Smaller changes are easier to review (and bisect in case of regressions).

I have to say it's quite hard to make the lock smaller, we have indeed 
considered the impact of the complexity of the patch on review,
and this might be the simplest solution we can think of. If this 
solution is not okay for you, perhaps we can discuss
whether there is a better solution ?

Best wishes,
D. Wythe

Alexandra Winter Sept. 26, 2023, 7:37 a.m. UTC | #4

On 25.09.23 12:10, D. Wythe wrote:
> That's right.  But even if we do nothing, the current implements still has this problem.
> And this problem can be solved by the spinlock inside smc_conn_create, rather than the
> pending lock.
> 

May I kindly propose to fix this problem first and then do performance improvements after that?

> And also deleting the last connection from a link group will not shutting the down right now,
> usually waiting for 10 minutes of idle time.

Still the new connection could come in just the moment when the 10 minutes are over.