mbox series

[net-next,v4,0/3] net: atlantic: Add XDP support

Message ID 20220408181714.15354-1-ap420073@gmail.com (mailing list archive)
Headers show
Series net: atlantic: Add XDP support | expand

Message

Taehee Yoo April 8, 2022, 6:17 p.m. UTC
This patchset is to make atlantic to support multi-buffer XDP.

The first patch implement control plane of xdp.
The aq_xdp(), callback of .xdp_bpf is added.

The second patch implements data plane of xdp.
XDP_TX, XDP_DROP, and XDP_PASS is supported.
__aq_ring_xdp_clean() is added to receive and execute xdp program.
aq_nic_xmit_xdpf() is added to send packet by XDP.

The third patch implements callback of .ndo_xdp_xmit.
aq_xdp_xmit() is added to send redirected packets and it internally
calls aq_nic_xmit_xdpf().

Memory model is MEM_TYPE_PAGE_ORDER0 so it doesn't reuse rx page when
XDP_TX, XDP_PASS, XDP_REDIRECT.

Default the maximum rx frame size is 2K.
If xdp is attached, size is changed to about 3K.
It can be reused when XDP_DROP, and XDP_ABORTED.

Atlantic driver has AQ_CFG_RX_PAGEORDER option and it will be always 0
if xdp is attached.

LRO will be disabled if XDP program supports only single buffer.

AQC chip supports 32 multi-queues and 8 vectors(irq).
There are two options.
1. under 8 cores and maximum 4 tx queues per core.
2. under 4 cores and maximum 8 tx queues per core.

Like other drivers, these tx queues can be used only for XDP_TX,
XDP_REDIRECT queue. If so, no tx_lock is needed.
But this patchset doesn't use this strategy because getting hardware tx
queue index cost is too high.
So, tx_lock is used in the aq_nic_xmit_xdpf().

single-core, single queue, 80% cpu utilization.

  30.75%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
  10.35%  [kernel]                  [k] aq_hw_read_reg <---------- here
   4.38%  [kernel]                  [k] get_page_from_freelist

single-core, 8 queues, 100% cpu utilization, half PPS.

  45.56%  [kernel]                  [k] aq_hw_read_reg <---------- here
  17.58%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
   4.72%  [kernel]                  [k] hw_atl_b0_hw_ring_rx_receive

Performance result(64 Byte)
1. XDP_TX
  a. xdp_geieric, single core
    - 2.5Mpps, 100% cpu
  b. xdp_driver, single core
    - 4.5Mpps, 80% cpu
  c. xdp_generic, 8 core(hyper thread)
    - 6.3Mpps, 5~10% cpu
  d. xdp_driver, 8 core(hyper thread)
    - 6.3Mpps, 5% cpu

2. XDP_REDIRECT
  a. xdp_generic, single core
    - 2.3Mpps
  b. xdp_driver, single core
    - 4.5Mpps

v4:
 - Fix compile warning

v3:
 - Change wrong PPS performance result 40% -> 80% in single
   core(Intel i3-12100)
 - Separate aq_nic_map_xdp() from aq_nic_map_skb()
 - Drop multi buffer packets if single buffer XDP is attached
 - Disable LRO when single buffer XDP is attached
 - Use xdp_get_{frame/buff}_len()

v2:
 - Do not use inline in C file

Taehee Yoo (3):
  net: atlantic: Implement xdp control plane
  net: atlantic: Implement xdp data plane
  net: atlantic: Implement .ndo_xdp_xmit handler

 .../net/ethernet/aquantia/atlantic/aq_cfg.h   |   1 +
 .../ethernet/aquantia/atlantic/aq_ethtool.c   |   8 +
 .../net/ethernet/aquantia/atlantic/aq_main.c  |  87 ++++
 .../net/ethernet/aquantia/atlantic/aq_main.h  |   2 +
 .../net/ethernet/aquantia/atlantic/aq_nic.c   | 137 ++++++
 .../net/ethernet/aquantia/atlantic/aq_nic.h   |   5 +
 .../net/ethernet/aquantia/atlantic/aq_ring.c  | 415 ++++++++++++++++--
 .../net/ethernet/aquantia/atlantic/aq_ring.h  |  17 +
 .../net/ethernet/aquantia/atlantic/aq_vec.c   |  23 +-
 .../net/ethernet/aquantia/atlantic/aq_vec.h   |   6 +
 .../aquantia/atlantic/hw_atl/hw_atl_a0.c      |   6 +-
 .../aquantia/atlantic/hw_atl/hw_atl_b0.c      |  10 +-
 12 files changed, 675 insertions(+), 42 deletions(-)

Comments

Paolo Abeni April 12, 2022, 7:17 a.m. UTC | #1
On Fri, 2022-04-08 at 18:17 +0000, Taehee Yoo wrote:
> This patchset is to make atlantic to support multi-buffer XDP.
> 
> The first patch implement control plane of xdp.
> The aq_xdp(), callback of .xdp_bpf is added.
> 
> The second patch implements data plane of xdp.
> XDP_TX, XDP_DROP, and XDP_PASS is supported.
> __aq_ring_xdp_clean() is added to receive and execute xdp program.
> aq_nic_xmit_xdpf() is added to send packet by XDP.
> 
> The third patch implements callback of .ndo_xdp_xmit.
> aq_xdp_xmit() is added to send redirected packets and it internally
> calls aq_nic_xmit_xdpf().
> 
> Memory model is MEM_TYPE_PAGE_ORDER0 so it doesn't reuse rx page when
> XDP_TX, XDP_PASS, XDP_REDIRECT.
> 
> Default the maximum rx frame size is 2K.
> If xdp is attached, size is changed to about 3K.
> It can be reused when XDP_DROP, and XDP_ABORTED.
> 
> Atlantic driver has AQ_CFG_RX_PAGEORDER option and it will be always 0
> if xdp is attached.
> 
> LRO will be disabled if XDP program supports only single buffer.
> 
> AQC chip supports 32 multi-queues and 8 vectors(irq).
> There are two options.
> 1. under 8 cores and maximum 4 tx queues per core.
> 2. under 4 cores and maximum 8 tx queues per core.
> 
> Like other drivers, these tx queues can be used only for XDP_TX,
> XDP_REDIRECT queue. If so, no tx_lock is needed.
> But this patchset doesn't use this strategy because getting hardware tx
> queue index cost is too high.
> So, tx_lock is used in the aq_nic_xmit_xdpf().
> 
> single-core, single queue, 80% cpu utilization.
> 
>   30.75%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
>   10.35%  [kernel]                  [k] aq_hw_read_reg <---------- here
>    4.38%  [kernel]                  [k] get_page_from_freelist
> 
> single-core, 8 queues, 100% cpu utilization, half PPS.
> 
>   45.56%  [kernel]                  [k] aq_hw_read_reg <---------- here
>   17.58%  bpf_prog_xxx_xdp_prog_tx  [k] bpf_prog_xxx_xdp_prog_tx
>    4.72%  [kernel]                  [k] hw_atl_b0_hw_ring_rx_receive
> 
> Performance result(64 Byte)
> 1. XDP_TX
>   a. xdp_geieric, single core
>     - 2.5Mpps, 100% cpu
>   b. xdp_driver, single core
>     - 4.5Mpps, 80% cpu
>   c. xdp_generic, 8 core(hyper thread)
>     - 6.3Mpps, 5~10% cpu
>   d. xdp_driver, 8 core(hyper thread)
>     - 6.3Mpps, 5% cpu
> 
> 2. XDP_REDIRECT
>   a. xdp_generic, single core
>     - 2.3Mpps
>   b. xdp_driver, single core
>     - 4.5Mpps
> 
> v4:
>  - Fix compile warning
> 
> v3:
>  - Change wrong PPS performance result 40% -> 80% in single
>    core(Intel i3-12100)
>  - Separate aq_nic_map_xdp() from aq_nic_map_skb()
>  - Drop multi buffer packets if single buffer XDP is attached
>  - Disable LRO when single buffer XDP is attached
>  - Use xdp_get_{frame/buff}_len()
> 
> v2:
>  - Do not use inline in C file
> 
> Taehee Yoo (3):
>   net: atlantic: Implement xdp control plane
>   net: atlantic: Implement xdp data plane
>   net: atlantic: Implement .ndo_xdp_xmit handler
> 
>  .../net/ethernet/aquantia/atlantic/aq_cfg.h   |   1 +
>  .../ethernet/aquantia/atlantic/aq_ethtool.c   |   8 +
>  .../net/ethernet/aquantia/atlantic/aq_main.c  |  87 ++++
>  .../net/ethernet/aquantia/atlantic/aq_main.h  |   2 +
>  .../net/ethernet/aquantia/atlantic/aq_nic.c   | 137 ++++++
>  .../net/ethernet/aquantia/atlantic/aq_nic.h   |   5 +
>  .../net/ethernet/aquantia/atlantic/aq_ring.c  | 415 ++++++++++++++++--
>  .../net/ethernet/aquantia/atlantic/aq_ring.h  |  17 +
>  .../net/ethernet/aquantia/atlantic/aq_vec.c   |  23 +-
>  .../net/ethernet/aquantia/atlantic/aq_vec.h   |   6 +
>  .../aquantia/atlantic/hw_atl/hw_atl_a0.c      |   6 +-
>  .../aquantia/atlantic/hw_atl/hw_atl_b0.c      |  10 +-
>  12 files changed, 675 insertions(+), 42 deletions(-)
> 
@Igor: this should address you concerns on v2, could you please have a
look?

Thanks!

Paolo
Igor Russkikh April 13, 2022, 7:52 a.m. UTC | #2
> v4:
>  - Fix compile warning
> 
> v3:
>  - Change wrong PPS performance result 40% -> 80% in single
>    core(Intel i3-12100)
>  - Separate aq_nic_map_xdp() from aq_nic_map_skb()
>  - Drop multi buffer packets if single buffer XDP is attached
>  - Disable LRO when single buffer XDP is attached
>  - Use xdp_get_{frame/buff}_len()

Hi Taehee, thanks for taking care of that!

Reviewed-by: Igor Russkikh <irusskikh@marvell.com>

A small notice about the selection of 3K packet size for XDP.
Its a kind of compromise I think, because with common 1.4K MTU we'll get wasted
2K bytes minimum per packet.

I was thinking it would be possible to reuse the existing page flipping technique
together with higher page_order, to keep default 2K fragment size.
E.g.
( 256(xdp_head)+2K(pkt frag) ) x 3 (flips) = ~7K

Meaning we can allocate 8K (page_order=1) pages, and fit three xdp packets into each, wasting only 1K per three packets.

But its just kind of an idea for future optimization.

Regards,
  Igor
Taehee Yoo April 13, 2022, 9:59 a.m. UTC | #3
2022. 4. 13. 오후 4:52에 Igor Russkikh 이(가) 쓴 글:

Hi Igor,

Thank you so much for your review!

 >
 >
 >> v4:
 >>   - Fix compile warning
 >>
 >> v3:
 >>   - Change wrong PPS performance result 40% -> 80% in single
 >>     core(Intel i3-12100)
 >>   - Separate aq_nic_map_xdp() from aq_nic_map_skb()
 >>   - Drop multi buffer packets if single buffer XDP is attached
 >>   - Disable LRO when single buffer XDP is attached
 >>   - Use xdp_get_{frame/buff}_len()
 >
 > Hi Taehee, thanks for taking care of that!
 >
 > Reviewed-by: Igor Russkikh <irusskikh@marvell.com>
 >
 > A small notice about the selection of 3K packet size for XDP.
 > Its a kind of compromise I think, because with common 1.4K MTU we'll 
get wasted
 > 2K bytes minimum per packet.
 >
 > I was thinking it would be possible to reuse the existing page 
flipping technique
 > together with higher page_order, to keep default 2K fragment size.
 > E.g.
 > ( 256(xdp_head)+2K(pkt frag) ) x 3 (flips) = ~7K
 >
 > Meaning we can allocate 8K (page_order=1) pages, and fit three xdp 
packets into each, wasting only 1K per three packets.
 >
 > But its just kind of an idea for future optimization.
 >

Yes, I fully agree with your idea.
When I developed an initial version of this patchset, I simply tried 
that idea.
I expected to reduce CPU utilization(not for memory optimization), but 
there is no difference because page_ref_{inc/dec}() cost is too high.
So, if we tried to switch from MEM_TYPE_PAGE_ORDER0 to 
MEM_TYPE_PAGE_SHARED, I think we should use a littie bit different 
flipping strategy like ixgbe.
If so, we would achieve memory optimization and CPU optimization.

Thanks a lot,
Taehee Yoo

 > Regards,
 >    Igor