[RFC,Optimizing,veth,xsk,performance,00/10]

Message ID	20230803140441.53596-1-huangjie.albert@bytedance.com (mailing list archive)
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 640993D8E for <bpf@vger.kernel.org>; Thu, 3 Aug 2023 14:05:39 +0000 (UTC) From: "huangjie.albert" <huangjie.albert@bytedance.com> To: davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com Cc: "huangjie.albert" <huangjie.albert@bytedance.com>, Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Jesper Dangaard Brouer <hawk@kernel.org>, John Fastabend <john.fastabend@gmail.com>, =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn@kernel.org>, Magnus Karlsson <magnus.karlsson@intel.com>, Maciej Fijalkowski <maciej.fijalkowski@intel.com>, Jonathan Lemon <jonathan.lemon@gmail.com>, Pavel Begunkov <asml.silence@gmail.com>, Yunsheng Lin <linyunsheng@huawei.com>, Kees Cook <keescook@chromium.org>, Richard Gobert <richardbgobert@gmail.com>, netdev@vger.kernel.org (open list:NETWORKING DRIVERS), linux-kernel@vger.kernel.org (open list), bpf@vger.kernel.org (open list:XDP (eXpress Data Path)) Subject: [RFC Optimizing veth xsk performance 00/10] Date: Thu, 3 Aug 2023 22:04:26 +0800 Message-Id: <20230803140441.53596-1-huangjie.albert@bytedance.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset=yes Content-Transfer-Encoding: 8bit
Series	[RFC,Optimizing,veth,xsk,performance,01/10] veth: Implement ethtool's get_ringparam() callback \| expand [RFC,Optimizing,veth,xsk,performance,00/10] [RFC,Optimizing,veth,xsk,performance,01/10] veth: Implement ethtool's get_ringparam() callback [RFC,Optimizing,veth,xsk,performance,02/10] xsk: add dma_check_skip for skipping dma check [RFC,Optimizing,veth,xsk,performance,03/10] veth: add support for send queue [RFC,Optimizing,veth,xsk,performance,04/10] xsk: add xsk_tx_completed_addr function [RFC,Optimizing,veth,xsk,performance,05/10] veth: use send queue tx napi to xmit xsk tx desc [RFC,Optimizing,veth,xsk,performance,06/10] veth: add ndo_xsk_wakeup callback for veth [RFC,Optimizing,veth,xsk,performance,07/10] sk_buff: add destructor_arg_xsk_pool for zero copy [RFC,Optimizing,veth,xsk,performance,08/10] xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX [RFC,Optimizing,veth,xsk,performance,09/10] veth: support zero copy for af xdp [RFC,Optimizing,veth,xsk,performance,10/10] veth: af_xdp tx batch support for ipv4 udp

Message ID

20230803140441.53596-1-huangjie.albert@bytedance.com (mailing list archive)

Headers

From: "huangjie.albert" <huangjie.albert@bytedance.com>
To: davem@davemloft.net,
	edumazet@google.com,
	kuba@kernel.org,
	pabeni@redhat.com
Cc: "huangjie.albert" <huangjie.albert@bytedance.com>,
 Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>,
 Jesper Dangaard Brouer <hawk@kernel.org>,
 John Fastabend <john.fastabend@gmail.com>,
 =?utf-8?b?QmrDtnJuIFTDtnBlbA==?= <bjorn@kernel.org>,
 Magnus Karlsson <magnus.karlsson@intel.com>,
 Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
 Jonathan Lemon <jonathan.lemon@gmail.com>,
 Pavel Begunkov <asml.silence@gmail.com>,
 Yunsheng Lin <linyunsheng@huawei.com>, Kees Cook <keescook@chromium.org>,
 Richard Gobert <richardbgobert@gmail.com>,
 netdev@vger.kernel.org (open list:NETWORKING DRIVERS),
 linux-kernel@vger.kernel.org (open list),
 bpf@vger.kernel.org (open list:XDP (eXpress Data Path))
Subject: [RFC Optimizing veth xsk performance 00/10]
Date: Thu,  3 Aug 2023 22:04:26 +0800
Message-Id: <20230803140441.53596-1-huangjie.albert@bytedance.com>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=yes
Content-Transfer-Encoding: 8bit

Series

[RFC,Optimizing,veth,xsk,performance,01/10] veth: Implement ethtool's get_ringparam() callback | expand

Message

黄杰 Aug. 3, 2023, 2:04 p.m. UTC

AF_XDP is a kernel bypass technology that can greatly improve performance.
However, for virtual devices like veth, even with the use of AF_XDP sockets,
there are still many additional software paths that consume CPU resources. 
This patch series focuses on optimizing the performance of AF_XDP sockets 
for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. 
Patch 5 introduces tx queue and tx napi for packet transmission, while 
patch 9 primarily implements zero-copy, and patch 10 adds support for 
batch sending of IPv4 UDP packets. These optimizations significantly reduce 
the software path and support checksum offload.

I tested those feature with
A typical topology is shown below:
veth<-->veth-peer                                    veth1-peer<--->veth1
	1       |                                                  |   7
	        |2                                                6|
	        |                                                  |
	      bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
                  3                    4                 5    
             (machine1)                              (machine2)    
AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
veth:(172.17.0.2/24)
bridge:(172.17.0.1/24)
eth0:(192.168.156.66/24)

eth1(172.17.0.2/24)
bridge1:(172.17.0.1/24)
eth0:(192.168.156.88/24)

after set default route、snat、dnat. we can have a tests
to get the performance results.

packets send from veth to veth1:
af_xdp test tool:
link:https://github.com/cclinuxer/libxudp
send:(veth)
./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
recv:(veth1)
./objs/xudpperf recv --src 172.17.0.2:6002

udp test tool:iperf3
send:(veth)
iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 60G -u
recv:(veth1)
iperf3 -s -p 6002

performance:
performance:(test weth libxdp lib)
UDP                              : 250 Kpps (with 100% cpu)
AF_XDP   no  zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with zerocopy + no batch : 540 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with  batch  +  zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu)

With af_xdp batch, the libxdp user-space program reaches a bottleneck.
Therefore, the softirq did not reach the limit.

This is just an RFC patch series, and some code details still need 
further consideration. Please review this proposal.

thanks!

huangjie.albert (10):
  veth: Implement ethtool's get_ringparam() callback
  xsk: add dma_check_skip for  skipping dma check
  veth: add support for send queue
  xsk: add xsk_tx_completed_addr function
  veth: use send queue tx napi to xmit xsk tx desc
  veth: add ndo_xsk_wakeup callback for veth
  sk_buff: add destructor_arg_xsk_pool for zero copy
  xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX
  veth: support zero copy for af xdp
  veth: af_xdp tx batch support for ipv4 udp

 drivers/net/veth.c          | 729 +++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h      |   1 +
 include/net/xdp.h           |   1 +
 include/net/xdp_sock_drv.h  |   1 +
 include/net/xsk_buff_pool.h |   1 +
 net/xdp/xsk.c               |   6 +
 net/xdp/xsk_buff_pool.c     |   3 +-
 net/xdp/xsk_queue.h         |  11 +
 8 files changed, 751 insertions(+), 2 deletions(-)

Comments

Paolo Abeni Aug. 3, 2023, 2:20 p.m. UTC | #1

On Thu, 2023-08-03 at 22:04 +0800, huangjie.albert wrote:
> AF_XDP is a kernel bypass technology that can greatly improve performance.
> However, for virtual devices like veth, even with the use of AF_XDP sockets,
> there are still many additional software paths that consume CPU resources. 
> This patch series focuses on optimizing the performance of AF_XDP sockets 
> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work. 
> Patch 5 introduces tx queue and tx napi for packet transmission, while 
> patch 9 primarily implements zero-copy, and patch 10 adds support for 
> batch sending of IPv4 UDP packets. These optimizations significantly reduce 
> the software path and support checksum offload.
> 
> I tested those feature with
> A typical topology is shown below:
> veth<-->veth-peer                                    veth1-peer<--->veth1
> 	1       |                                                  |   7
> 	        |2                                                6|
> 	        |                                                  |
> 	      bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>                   3                    4                 5    
>              (machine1)                              (machine2)    
> AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
> veth:(172.17.0.2/24)
> bridge:(172.17.0.1/24)
> eth0:(192.168.156.66/24)
> 
> eth1(172.17.0.2/24)
> bridge1:(172.17.0.1/24)
> eth0:(192.168.156.88/24)
> 
> after set default route��?snat��?dnat. we can have a tests
> to get the performance results.
> 
> packets send from veth to veth1:
> af_xdp test tool:
> link:https://github.com/cclinuxer/libxudp
> send:(veth)
> ./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
> recv:(veth1)
> ./objs/xudpperf recv --src 172.17.0.2:6002
> 
> udp test tool:iperf3
> send:(veth)
> iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 60G -u

Should be: '-b 0' otherwise you will experience additional overhead.

And you would likely pin processes and irqs to ensure BH and US run on
different cores of the same numa node.

Cheers,

Paolo

Jesper Dangaard Brouer Aug. 3, 2023, 3:01 p.m. UTC | #2

On 03/08/2023 16.04, huangjie.albert wrote:
> AF_XDP is a kernel bypass technology that can greatly improve performance.
> However, for virtual devices like veth, even with the use of AF_XDP sockets,
> there are still many additional software paths that consume CPU resources.
> This patch series focuses on optimizing the performance of AF_XDP sockets
> for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
> Patch 5 introduces tx queue and tx napi for packet transmission, while
> patch 9 primarily implements zero-copy, and patch 10 adds support for
> batch sending of IPv4 UDP packets. These optimizations significantly reduce
> the software path and support checksum offload.
> 
> I tested those feature with
> A typical topology is shown below:
> veth<-->veth-peer                                    veth1-peer<--->veth1
> 	1       |                                                  |   7
> 	        |2                                                6|
> 	        |                                                  |
> 	      bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
>                    3                    4                 5
>               (machine1)                              (machine2)
> AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
> veth:(172.17.0.2/24)
> bridge:(172.17.0.1/24)
> eth0:(192.168.156.66/24)
> 
> eth1(172.17.0.2/24)
> bridge1:(172.17.0.1/24)
> eth0:(192.168.156.88/24)
> 
> after set default route、snat、dnat. we can have a tests
> to get the performance results.
> 
> packets send from veth to veth1:
> af_xdp test tool:
> link:https://github.com/cclinuxer/libxudp
> send:(veth)
> ./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
> recv:(veth1)
> ./objs/xudpperf recv --src 172.17.0.2:6002
> 
> udp test tool:iperf3
> send:(veth)
> iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 60G -u
> recv:(veth1)
> iperf3 -s -p 6002
> 
> performance:
> performance:(test weth libxdp lib)
> UDP                              : 250 Kpps (with 100% cpu)
> AF_XDP   no  zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu)
> AF_XDP  with zerocopy + no batch : 540 Kpps (with ksoftirqd 100% cpu)
> AF_XDP  with  batch  +  zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu)
> 
> With af_xdp batch, the libxdp user-space program reaches a bottleneck.

Do you mean libxdp [1] or libxudp ?

[1] https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp

> Therefore, the softirq did not reach the limit.
> 
> This is just an RFC patch series, and some code details still need
> further consideration. Please review this proposal.
>

I find this performance work interesting as we have customer requests
(via Maryam (cc)) to improve AF_XDP performance both native and on veth.

Our benchmark is stored at:
  https://github.com/maryamtahhan/veth-benchmark

Great to see other companies also interested in this area.

--Jesper

> thanks!
> 
> huangjie.albert (10):
>    veth: Implement ethtool's get_ringparam() callback
>    xsk: add dma_check_skip for  skipping dma check
>    veth: add support for send queue
>    xsk: add xsk_tx_completed_addr function
>    veth: use send queue tx napi to xmit xsk tx desc
>    veth: add ndo_xsk_wakeup callback for veth
>    sk_buff: add destructor_arg_xsk_pool for zero copy
>    xdp: add xdp_mem_type MEM_TYPE_XSK_BUFF_POOL_TX
>    veth: support zero copy for af xdp
>    veth: af_xdp tx batch support for ipv4 udp
> 
>   drivers/net/veth.c          | 729 +++++++++++++++++++++++++++++++++++-
>   include/linux/skbuff.h      |   1 +
>   include/net/xdp.h           |   1 +
>   include/net/xdp_sock_drv.h  |   1 +
>   include/net/xsk_buff_pool.h |   1 +
>   net/xdp/xsk.c               |   6 +
>   net/xdp/xsk_buff_pool.c     |   3 +-
>   net/xdp/xsk_queue.h         |  11 +
>   8 files changed, 751 insertions(+), 2 deletions(-)
>

黄杰 Aug. 4, 2023, 4:16 a.m. UTC | #3

Paolo Abeni <pabeni@redhat.com> 于2023年8月3日周四 22:20写道：
>
> On Thu, 2023-08-03 at 22:04 +0800, huangjie.albert wrote:
> > AF_XDP is a kernel bypass technology that can greatly improve performance.
> > However, for virtual devices like veth, even with the use of AF_XDP sockets,
> > there are still many additional software paths that consume CPU resources.
> > This patch series focuses on optimizing the performance of AF_XDP sockets
> > for veth virtual devices. Patches 1 to 4 mainly involve preparatory work.
> > Patch 5 introduces tx queue and tx napi for packet transmission, while
> > patch 9 primarily implements zero-copy, and patch 10 adds support for
> > batch sending of IPv4 UDP packets. These optimizations significantly reduce
> > the software path and support checksum offload.
> >
> > I tested those feature with
> > A typical topology is shown below:
> > veth<-->veth-peer                                    veth1-peer<--->veth1
> >       1       |                                                  |   7
> >               |2                                                6|
> >               |                                                  |
> >             bridge<------->eth0(mlnx5)- switch -eth1(mlnx5)<--->bridge1
> >                   3                    4                 5
> >              (machine1)                              (machine2)
> > AF_XDP socket is attach to veth and veth1. and send packets to physical NIC(eth0)
> > veth:(172.17.0.2/24)
> > bridge:(172.17.0.1/24)
> > eth0:(192.168.156.66/24)
> >
> > eth1(172.17.0.2/24)
> > bridge1:(172.17.0.1/24)
> > eth0:(192.168.156.88/24)
> >
> > after set default route . snat . dnat. we can have a tests
> > to get the performance results.
> >
> > packets send from veth to veth1:
> > af_xdp test tool:
> > link:https://github.com/cclinuxer/libxudp
> > send:(veth)
> > ./objs/xudpperf send --dst 192.168.156.88:6002 -l 1300
> > recv:(veth1)
> > ./objs/xudpperf recv --src 172.17.0.2:6002
> >
> > udp test tool:iperf3
> > send:(veth)
> > iperf3 -c 192.168.156.88 -p 6002 -l 1300 -b 60G -u
>
> Should be: '-b 0' otherwise you will experience additional overhead.
>

with -b 0:
performance:
performance:(test weth libxdp lib)
UDP                              : 320 Kpps (with 100% cpu)
AF_XDP   no  zerocopy + no batch : 480 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with zerocopy + no batch : 540 Kpps (with ksoftirqd 100% cpu)
AF_XDP  with  batch  +  zerocopy : 1.5 Mpps (with ksoftirqd 15% cpu)

thanks.

> And you would likely pin processes and irqs to ensure BH and US run on
> different cores of the same numa node.
>
> Cheers,
>
> Paolo
>