Message ID | 20220408181714.15354-1-ap420073@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | net: atlantic: Add XDP support | expand |
On Fri, 2022-04-08 at 18:17 +0000, Taehee Yoo wrote: > This patchset is to make atlantic to support multi-buffer XDP. > > The first patch implement control plane of xdp. > The aq_xdp(), callback of .xdp_bpf is added. > > The second patch implements data plane of xdp. > XDP_TX, XDP_DROP, and XDP_PASS is supported. > __aq_ring_xdp_clean() is added to receive and execute xdp program. > aq_nic_xmit_xdpf() is added to send packet by XDP. > > The third patch implements callback of .ndo_xdp_xmit. > aq_xdp_xmit() is added to send redirected packets and it internally > calls aq_nic_xmit_xdpf(). > > Memory model is MEM_TYPE_PAGE_ORDER0 so it doesn't reuse rx page when > XDP_TX, XDP_PASS, XDP_REDIRECT. > > Default the maximum rx frame size is 2K. > If xdp is attached, size is changed to about 3K. > It can be reused when XDP_DROP, and XDP_ABORTED. > > Atlantic driver has AQ_CFG_RX_PAGEORDER option and it will be always 0 > if xdp is attached. > > LRO will be disabled if XDP program supports only single buffer. > > AQC chip supports 32 multi-queues and 8 vectors(irq). > There are two options. > 1. under 8 cores and maximum 4 tx queues per core. > 2. under 4 cores and maximum 8 tx queues per core. > > Like other drivers, these tx queues can be used only for XDP_TX, > XDP_REDIRECT queue. If so, no tx_lock is needed. > But this patchset doesn't use this strategy because getting hardware tx > queue index cost is too high. > So, tx_lock is used in the aq_nic_xmit_xdpf(). > > single-core, single queue, 80% cpu utilization. > > 30.75% bpf_prog_xxx_xdp_prog_tx [k] bpf_prog_xxx_xdp_prog_tx > 10.35% [kernel] [k] aq_hw_read_reg <---------- here > 4.38% [kernel] [k] get_page_from_freelist > > single-core, 8 queues, 100% cpu utilization, half PPS. > > 45.56% [kernel] [k] aq_hw_read_reg <---------- here > 17.58% bpf_prog_xxx_xdp_prog_tx [k] bpf_prog_xxx_xdp_prog_tx > 4.72% [kernel] [k] hw_atl_b0_hw_ring_rx_receive > > Performance result(64 Byte) > 1. XDP_TX > a. xdp_geieric, single core > - 2.5Mpps, 100% cpu > b. xdp_driver, single core > - 4.5Mpps, 80% cpu > c. xdp_generic, 8 core(hyper thread) > - 6.3Mpps, 5~10% cpu > d. xdp_driver, 8 core(hyper thread) > - 6.3Mpps, 5% cpu > > 2. XDP_REDIRECT > a. xdp_generic, single core > - 2.3Mpps > b. xdp_driver, single core > - 4.5Mpps > > v4: > - Fix compile warning > > v3: > - Change wrong PPS performance result 40% -> 80% in single > core(Intel i3-12100) > - Separate aq_nic_map_xdp() from aq_nic_map_skb() > - Drop multi buffer packets if single buffer XDP is attached > - Disable LRO when single buffer XDP is attached > - Use xdp_get_{frame/buff}_len() > > v2: > - Do not use inline in C file > > Taehee Yoo (3): > net: atlantic: Implement xdp control plane > net: atlantic: Implement xdp data plane > net: atlantic: Implement .ndo_xdp_xmit handler > > .../net/ethernet/aquantia/atlantic/aq_cfg.h | 1 + > .../ethernet/aquantia/atlantic/aq_ethtool.c | 8 + > .../net/ethernet/aquantia/atlantic/aq_main.c | 87 ++++ > .../net/ethernet/aquantia/atlantic/aq_main.h | 2 + > .../net/ethernet/aquantia/atlantic/aq_nic.c | 137 ++++++ > .../net/ethernet/aquantia/atlantic/aq_nic.h | 5 + > .../net/ethernet/aquantia/atlantic/aq_ring.c | 415 ++++++++++++++++-- > .../net/ethernet/aquantia/atlantic/aq_ring.h | 17 + > .../net/ethernet/aquantia/atlantic/aq_vec.c | 23 +- > .../net/ethernet/aquantia/atlantic/aq_vec.h | 6 + > .../aquantia/atlantic/hw_atl/hw_atl_a0.c | 6 +- > .../aquantia/atlantic/hw_atl/hw_atl_b0.c | 10 +- > 12 files changed, 675 insertions(+), 42 deletions(-) > @Igor: this should address you concerns on v2, could you please have a look? Thanks! Paolo
> v4: > - Fix compile warning > > v3: > - Change wrong PPS performance result 40% -> 80% in single > core(Intel i3-12100) > - Separate aq_nic_map_xdp() from aq_nic_map_skb() > - Drop multi buffer packets if single buffer XDP is attached > - Disable LRO when single buffer XDP is attached > - Use xdp_get_{frame/buff}_len() Hi Taehee, thanks for taking care of that! Reviewed-by: Igor Russkikh <irusskikh@marvell.com> A small notice about the selection of 3K packet size for XDP. Its a kind of compromise I think, because with common 1.4K MTU we'll get wasted 2K bytes minimum per packet. I was thinking it would be possible to reuse the existing page flipping technique together with higher page_order, to keep default 2K fragment size. E.g. ( 256(xdp_head)+2K(pkt frag) ) x 3 (flips) = ~7K Meaning we can allocate 8K (page_order=1) pages, and fit three xdp packets into each, wasting only 1K per three packets. But its just kind of an idea for future optimization. Regards, Igor
2022. 4. 13. 오후 4:52에 Igor Russkikh 이(가) 쓴 글: Hi Igor, Thank you so much for your review! > > >> v4: >> - Fix compile warning >> >> v3: >> - Change wrong PPS performance result 40% -> 80% in single >> core(Intel i3-12100) >> - Separate aq_nic_map_xdp() from aq_nic_map_skb() >> - Drop multi buffer packets if single buffer XDP is attached >> - Disable LRO when single buffer XDP is attached >> - Use xdp_get_{frame/buff}_len() > > Hi Taehee, thanks for taking care of that! > > Reviewed-by: Igor Russkikh <irusskikh@marvell.com> > > A small notice about the selection of 3K packet size for XDP. > Its a kind of compromise I think, because with common 1.4K MTU we'll get wasted > 2K bytes minimum per packet. > > I was thinking it would be possible to reuse the existing page flipping technique > together with higher page_order, to keep default 2K fragment size. > E.g. > ( 256(xdp_head)+2K(pkt frag) ) x 3 (flips) = ~7K > > Meaning we can allocate 8K (page_order=1) pages, and fit three xdp packets into each, wasting only 1K per three packets. > > But its just kind of an idea for future optimization. > Yes, I fully agree with your idea. When I developed an initial version of this patchset, I simply tried that idea. I expected to reduce CPU utilization(not for memory optimization), but there is no difference because page_ref_{inc/dec}() cost is too high. So, if we tried to switch from MEM_TYPE_PAGE_ORDER0 to MEM_TYPE_PAGE_SHARED, I think we should use a littie bit different flipping strategy like ixgbe. If so, we would achieve memory optimization and CPU optimization. Thanks a lot, Taehee Yoo > Regards, > Igor