mbox series

[net-next,0/2] net/smc: Spread workload over multiple cores

Message ID 20220126130140.66316-1-tonylu@linux.alibaba.com (mailing list archive)
Headers show
Series net/smc: Spread workload over multiple cores | expand

Message

Tony Lu Jan. 26, 2022, 1:01 p.m. UTC
Currently, SMC creates one CQ per IB device, and shares this cq among
all the QPs of links. Meanwhile, this CQ is always binded to the first
completion vector, the IRQ affinity of this vector binds to some CPU
core. 

┌────────┐    ┌──────────────┐   ┌──────────────┐
│ SMC IB │    ├────┐         │   │              │
│ DEVICE │ ┌─▶│ QP │ SMC LINK├──▶│SMC Link Group│
│   ┌────┤ │  ├────┘         │   │              │
│   │ CQ ├─┘  └──────────────┘   └──────────────┘
│   │    ├─┐  ┌──────────────┐   ┌──────────────┐
│   └────┤ │  ├────┐         │   │              │
│        │ └─▶│ QP │ SMC LINK├──▶│SMC Link Group│
│        │    ├────┘         │   │              │
└────────┘    └──────────────┘   └──────────────┘

In this model, when connections execeeds SMC_RMBS_PER_LGR_MAX, it will
create multiple link groups and corresponding QPs. All the connections
share limited QPs and one CQ (both recv and send sides). Generally, one
completion vector binds to a fixed CPU core, it will limit the
performance by single core, and large-scale scenes, such as multiple
threads and lots of connections.

Running nginx and wrk test with 8 threads and 800 connections on 8 cores
host, the softirq of CPU 0 is limited the scalability:

04:18:54 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:18:55 PM  all    5.81    0.00   19.42    0.00    2.94   10.21    0.00    0.00    0.00   61.63
04:18:55 PM    0    0.00    0.00    0.00    0.00   16.80   82.78    0.00    0.00    0.00    0.41
<snip>

Nowadays, RDMA devices have more than one completion vectors, such as
mlx5 has 8, eRDMA has 4 completion vector by default. This unlocks the
limitation of single vector and single CPU core.

To enhance scalability and take advantage of multi-core resources, we
can spread CQs to different CPU cores, and introduce more flexible
mapping. Here comes up a new model, the main different is that creating
multiple CQs per IB device, which the max number of CQs is limited by
ibdev's ability (num_comp_vectors). In the scene of multiple linkgroups,
the link group's QP can bind to the least used CQ, and CQs are binded
to different completion vector and CPU cores. So that we can spread
the softirq (tasklet of wr tx/rx) handler to different cores.

                        ┌──────────────┐   ┌──────────────┐
┌────────┐  ┌───────┐   ├────┐         │   │              │
│        ├─▶│ CQ 0  ├──▶│ QP │ SMC LINK├──▶│SMC Link Group│
│        │  └───────┘   ├────┘         │   │              │
│ SMC IB │  ┌───────┐   └──────────────┘   └──────────────┘
│ DEVICE ├─▶│ CQ 1  │─┐                                    
│        │  └───────┘ │ ┌──────────────┐   ┌──────────────┐
│        │  ┌───────┐ │ ├────┐         │   │              │
│        ├─▶│ CQ n  │ └▶│ QP │ SMC LINK├──▶│SMC Link Group│
└────────┘  └───────┘   ├────┘         │   │              │
                        └──────────────┘   └──────────────┘

After sperad one CQ (4 linkgroups) to four CPU cores, the softirq load
spreads to different cores:

04:26:25 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
04:26:26 PM  all   10.70    0.00   35.80    0.00    7.64   26.62    0.00    0.00    0.00   19.24
04:26:26 PM    0    0.00    0.00    0.00    0.00   16.33   50.00    0.00    0.00    0.00   33.67
04:26:26 PM    1    0.00    0.00    0.00    0.00   15.46   69.07    0.00    0.00    0.00   15.46
04:26:26 PM    2    0.00    0.00    0.00    0.00   13.13   39.39    0.00    0.00    0.00   47.47
04:26:26 PM    3    0.00    0.00    0.00    0.00   13.27   55.10    0.00    0.00    0.00   31.63
<snip>

Here is the benchmark with this patch set:

Test environment:
- CPU Intel Xeon Platinum 8 core, mem 32 GiB, nic Mellanox CX4.
- nginx + wrk HTTP benchmark.
- nginx: disable access_log, increase keepalive_timeout and
  keepalive_requests, long-live connection, return 200 directly.
- wrk: 8 threads and 100, 200, 400 connections.

Benchmark result:

Conns/QPS         100        200        400
w/o patch   338502.49  359216.66  398167.16
w/  patch   677247.40  694193.70  812502.69
Ratio        +100.07%    +93.25%   +104.06%

This patch set shows nearly 1x increasement of QPS.

The benchmarks of 100, 200, 400 connections use 1, 1, 2 link groups.
When link group is one, it spreads send/recv to two cores. Once more
than one link groups, it would spread to more cores.

RFC Link: https://lore.kernel.org/netdev/YeRaSdg8TcNJsGBB@TonyMac-Alibaba/T/

These two patches split from previous RFC, and move netlink related patch
to the next patch set.

Tony Lu (2):
  net/smc: Introduce smc_ib_cq to bind link and cq
  net/smc: Multiple CQs per IB devices

 net/smc/smc_core.h |   2 +
 net/smc/smc_ib.c   | 132 ++++++++++++++++++++++++++++++++++++---------
 net/smc/smc_ib.h   |  15 ++++--
 net/smc/smc_wr.c   |  44 +++++++++------
 4 files changed, 148 insertions(+), 45 deletions(-)

Comments

Jason Gunthorpe Jan. 26, 2022, 3:29 p.m. UTC | #1
On Wed, Jan 26, 2022 at 09:01:39PM +0800, Tony Lu wrote:
> Currently, SMC creates one CQ per IB device, and shares this cq among
> all the QPs of links. Meanwhile, this CQ is always binded to the first
> completion vector, the IRQ affinity of this vector binds to some CPU
> core.

As we said in the RFC discussion this should be updated to use the
proper core APIS, not re-implement them in a driver like this.

Jason
Tony Lu Jan. 27, 2022, 3:19 a.m. UTC | #2
On Wed, Jan 26, 2022 at 11:29:16AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 26, 2022 at 09:01:39PM +0800, Tony Lu wrote:
> > Currently, SMC creates one CQ per IB device, and shares this cq among
> > all the QPs of links. Meanwhile, this CQ is always binded to the first
> > completion vector, the IRQ affinity of this vector binds to some CPU
> > core.
> 
> As we said in the RFC discussion this should be updated to use the
> proper core APIS, not re-implement them in a driver like this.

Thanks for your advice. As I replied in the RFC, I will start to do that
after a clear plan is determined.

Glad to hear your advice. 

Tony Lu
Leon Romanovsky Jan. 27, 2022, 6:18 a.m. UTC | #3
On Thu, Jan 27, 2022 at 11:19:10AM +0800, Tony Lu wrote:
> On Wed, Jan 26, 2022 at 11:29:16AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 26, 2022 at 09:01:39PM +0800, Tony Lu wrote:
> > > Currently, SMC creates one CQ per IB device, and shares this cq among
> > > all the QPs of links. Meanwhile, this CQ is always binded to the first
> > > completion vector, the IRQ affinity of this vector binds to some CPU
> > > core.
> > 
> > As we said in the RFC discussion this should be updated to use the
> > proper core APIS, not re-implement them in a driver like this.
> 
> Thanks for your advice. As I replied in the RFC, I will start to do that
> after a clear plan is determined.
> 
> Glad to hear your advice. 

Please do right thing from the beginning.

You are improving code from 2017 to be aligned with core code that
exists from 2020.

Thanks

> 
> Tony Lu
>
Tony Lu Jan. 27, 2022, 8:05 a.m. UTC | #4
On Thu, Jan 27, 2022 at 08:18:48AM +0200, Leon Romanovsky wrote:
> On Thu, Jan 27, 2022 at 11:19:10AM +0800, Tony Lu wrote:
> > On Wed, Jan 26, 2022 at 11:29:16AM -0400, Jason Gunthorpe wrote:
> > > On Wed, Jan 26, 2022 at 09:01:39PM +0800, Tony Lu wrote:
> > > > Currently, SMC creates one CQ per IB device, and shares this cq among
> > > > all the QPs of links. Meanwhile, this CQ is always binded to the first
> > > > completion vector, the IRQ affinity of this vector binds to some CPU
> > > > core.
> > > 
> > > As we said in the RFC discussion this should be updated to use the
> > > proper core APIS, not re-implement them in a driver like this.
> > 
> > Thanks for your advice. As I replied in the RFC, I will start to do that
> > after a clear plan is determined.
> > 
> > Glad to hear your advice. 
> 
> Please do right thing from the beginning.
> 
> You are improving code from 2017 to be aligned with core code that
> exists from 2020.

Thanks for your reply. The implement of this patch set isn't a brand-new
feature, just existed codes and logics adjustment and recombination,
aims to solve an existed issue in real world. So I fixes it now.

The other thing is to align code to now with new API. I will do it
before a full discussion with Karsten.

Thank you,
Tony Lu
Karsten Graul Jan. 27, 2022, 2:59 p.m. UTC | #5
On 26/01/2022 14:01, Tony Lu wrote:
> Currently, SMC creates one CQ per IB device, and shares this cq among
> all the QPs of links. Meanwhile, this CQ is always binded to the first
> completion vector, the IRQ affinity of this vector binds to some CPU
> core. 

As discussed in the RFC thread, please come back with the complete fix.

Thanks for the work you are putting in here!

And thanks for the feedback from the rdma side!