mbox series

[net-next,v2,0/8] bpf: cpumap: enable GRO for XDP_PASS frames

Message ID 20250107152940.26530-1-aleksander.lobakin@intel.com (mailing list archive)
Headers show
Series bpf: cpumap: enable GRO for XDP_PASS frames | expand

Message

Alexander Lobakin Jan. 7, 2025, 3:29 p.m. UTC
Several months ago, I had been looking through my old XDP hints tree[0]
to check whether some patches not directly related to hints can be sent
standalone. Roughly at the same time, Daniel appeared and asked[1] about
GRO for cpumap from that tree.

Currently, cpumap uses its own kthread which processes cpumap-redirected
frames by batches of 8, without any weighting (but with rescheduling
points). The resulting skbs get passed to the stack via
netif_receive_skb_list(), which means no GRO happens.
Even though we can't currently pass checksum status from the drivers,
in many cases GRO performs better than the listified Rx without the
aggregation, confirmed by tests.

In order to enable GRO in cpumap, we need to do the following:

* patches 1-2: decouple the GRO struct from the NAPI struct and allow
  using it out of a NAPI entity within the kernel core code;
* patch 3: switch cpumap from netif_receive_skb_list() to
  gro_receive_skb().

Additional improvements:

* patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
  lists;
* patch 5-6: introduce and use function do get skbs from the NAPI percpu
  caches by bulks, not one at a time;
* patch 7-8: use that function in veth as well and remove the one that
  was now superseded by it.

My trafficgen UDP GRO tests, small frame sizes:

                GRO off    GRO on
baseline        2.7        N/A       Mpps
patch 3         2.3        4         Mpps
patch 8         2.4        4.7       Mpps

1...3 diff      -17        +48       %
1...8 diff      -11        +74       %

Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
tests. On my system however, the same test gave me up to +100%.

Note that there's a series from Lorenzo[4] which achieves the same, but
in a different way. During the discussions, the approach using a
standalone GRO instance was preferred over the threaded NAPI.

[0] https://github.com/alobakin/linux/tree/xdp_hints
[1] https://lore.kernel.org/bpf/cadda351-6e93-4568-ba26-21a760bf9a57@app.fastmail.com
[2] https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh
[3] https://lore.kernel.org/bpf/yzda66wro5twmzpmjoxvy4si5zvkehlmgtpi6brheek3sj73tj@o7kd6nurr3o6
[4] https://lore.kernel.org/bpf/20241130-cpumap-gro-v1-0-c1180b1b5758@kernel.org

Alexander Lobakin (8):
  net: gro: decouple GRO from the NAPI layer
  net: gro: expose GRO init/cleanup to use outside of NAPI
  bpf: cpumap: switch to GRO from netif_receive_skb_list()
  bpf: cpumap: reuse skb array instead of a linked list to chain skbs
  net: skbuff: introduce napi_skb_cache_get_bulk()
  bpf: cpumap: switch to napi_skb_cache_get_bulk()
  veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
  xdp: remove xdp_alloc_skb_bulk()

 include/linux/netdevice.h                  |  35 ++++--
 include/linux/skbuff.h                     |   1 +
 include/net/busy_poll.h                    |  11 +-
 include/net/gro.h                          |  38 ++++--
 include/net/xdp.h                          |   1 -
 drivers/net/ethernet/brocade/bna/bnad.c    |   1 +
 drivers/net/ethernet/cortina/gemini.c      |   1 +
 drivers/net/veth.c                         |   3 +-
 drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c |   1 +
 kernel/bpf/cpumap.c                        | 131 ++++++++++++++-------
 net/core/dev.c                             |  79 ++++---------
 net/core/gro.c                             | 103 ++++++++++------
 net/core/skbuff.c                          |  62 ++++++++++
 net/core/xdp.c                             |  10 --
 14 files changed, 306 insertions(+), 171 deletions(-)

---
From v1[5]:
* use a standalone GRO instance instead of the threaded NAPI (Jakub);
* rebase and send to net-next as it's now more networking than BPF.

[5] https://lore.kernel.org/bpf/20240830162508.1009458-1-aleksander.lobakin@intel.com

Comments

Jesper Dangaard Brouer Jan. 7, 2025, 5:17 p.m. UTC | #1
Awesome work! - some questions below

On 07/01/2025 16.29, Alexander Lobakin wrote:
> Several months ago, I had been looking through my old XDP hints tree[0]
> to check whether some patches not directly related to hints can be sent
> standalone. Roughly at the same time, Daniel appeared and asked[1] about
> GRO for cpumap from that tree.
> 
> Currently, cpumap uses its own kthread which processes cpumap-redirected
> frames by batches of 8, without any weighting (but with rescheduling
> points). The resulting skbs get passed to the stack via
> netif_receive_skb_list(), which means no GRO happens.
> Even though we can't currently pass checksum status from the drivers,
> in many cases GRO performs better than the listified Rx without the
> aggregation, confirmed by tests.
> 
> In order to enable GRO in cpumap, we need to do the following:
> 
> * patches 1-2: decouple the GRO struct from the NAPI struct and allow
>    using it out of a NAPI entity within the kernel core code;
> * patch 3: switch cpumap from netif_receive_skb_list() to
>    gro_receive_skb().
> 
> Additional improvements:
> 
> * patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
>    lists;
> * patch 5-6: introduce and use function do get skbs from the NAPI percpu
>    caches by bulks, not one at a time;
> * patch 7-8: use that function in veth as well and remove the one that
>    was now superseded by it.
> 
> My trafficgen UDP GRO tests, small frame sizes:
> 

How does your trafficgen UDP test manage to get UDP GRO working?
(Perhaps you can share test?)

What is the "small frame" size being used?

Is the UDP benchmark avoiding (re)calculating the RX checksum?
(via setting UDP csum to zero)

>                  GRO off    GRO on
> baseline        2.7        N/A       Mpps
> patch 3         2.3        4         Mpps
> patch 8         2.4        4.7       Mpps
> 
> 1...3 diff      -17        +48       %
> 1...8 diff      -11        +74       %
> 
> Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
> tests. On my system however, the same test gave me up to +100%.
> 

I can imagine that the TCP throughput tests will yield a huge
performance boost.

> Note that there's a series from Lorenzo[4] which achieves the same, but
> in a different way. During the discussions, the approach using a
> standalone GRO instance was preferred over the threaded NAPI.
> 

It looks like you are keeping the "remote" CPUMAP kthread process design
intact in this series, right?

I think this design works for our use-case. For our use-case, we want to
give "remote" CPU-thread higher scheduling priority.  It doesn't matter
if this is a kthread or threaded-NAPI thread, as long as we can see this
as a PID from userspace (by which we adjust the sched priority).

Great to see this work progressing again :-)))
--Jesper

> [0] https://github.com/alobakin/linux/tree/xdp_hints
> [1] https://lore.kernel.org/bpf/cadda351-6e93-4568-ba26-21a760bf9a57@app.fastmail.com
> [2] https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh
> [3] https://lore.kernel.org/bpf/yzda66wro5twmzpmjoxvy4si5zvkehlmgtpi6brheek3sj73tj@o7kd6nurr3o6
> [4] https://lore.kernel.org/bpf/20241130-cpumap-gro-v1-0-c1180b1b5758@kernel.org
> 
> Alexander Lobakin (8):
>    net: gro: decouple GRO from the NAPI layer
>    net: gro: expose GRO init/cleanup to use outside of NAPI
>    bpf: cpumap: switch to GRO from netif_receive_skb_list()
>    bpf: cpumap: reuse skb array instead of a linked list to chain skbs
>    net: skbuff: introduce napi_skb_cache_get_bulk()
>    bpf: cpumap: switch to napi_skb_cache_get_bulk()
>    veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
>    xdp: remove xdp_alloc_skb_bulk()
> 
>   include/linux/netdevice.h                  |  35 ++++--
>   include/linux/skbuff.h                     |   1 +
>   include/net/busy_poll.h                    |  11 +-
>   include/net/gro.h                          |  38 ++++--
>   include/net/xdp.h                          |   1 -
>   drivers/net/ethernet/brocade/bna/bnad.c    |   1 +
>   drivers/net/ethernet/cortina/gemini.c      |   1 +
>   drivers/net/veth.c                         |   3 +-
>   drivers/net/wwan/t7xx/t7xx_hif_dpmaif_rx.c |   1 +
>   kernel/bpf/cpumap.c                        | 131 ++++++++++++++-------
>   net/core/dev.c                             |  79 ++++---------
>   net/core/gro.c                             | 103 ++++++++++------
>   net/core/skbuff.c                          |  62 ++++++++++
>   net/core/xdp.c                             |  10 --
>   14 files changed, 306 insertions(+), 171 deletions(-)
> 
> ---
>  From v1[5]:
> * use a standalone GRO instance instead of the threaded NAPI (Jakub);
> * rebase and send to net-next as it's now more networking than BPF.
> 
> [5] https://lore.kernel.org/bpf/20240830162508.1009458-1-aleksander.lobakin@intel.com
Alexander Lobakin Jan. 8, 2025, 1:39 p.m. UTC | #2
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Tue, 7 Jan 2025 18:17:06 +0100

> Awesome work! - some questions below
> 
> On 07/01/2025 16.29, Alexander Lobakin wrote:
>> Several months ago, I had been looking through my old XDP hints tree[0]
>> to check whether some patches not directly related to hints can be sent
>> standalone. Roughly at the same time, Daniel appeared and asked[1] about
>> GRO for cpumap from that tree.
>>
>> Currently, cpumap uses its own kthread which processes cpumap-redirected
>> frames by batches of 8, without any weighting (but with rescheduling
>> points). The resulting skbs get passed to the stack via
>> netif_receive_skb_list(), which means no GRO happens.
>> Even though we can't currently pass checksum status from the drivers,
>> in many cases GRO performs better than the listified Rx without the
>> aggregation, confirmed by tests.
>>
>> In order to enable GRO in cpumap, we need to do the following:
>>
>> * patches 1-2: decouple the GRO struct from the NAPI struct and allow
>>    using it out of a NAPI entity within the kernel core code;
>> * patch 3: switch cpumap from netif_receive_skb_list() to
>>    gro_receive_skb().
>>
>> Additional improvements:
>>
>> * patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
>>    lists;
>> * patch 5-6: introduce and use function do get skbs from the NAPI percpu
>>    caches by bulks, not one at a time;
>> * patch 7-8: use that function in veth as well and remove the one that
>>    was now superseded by it.
>>
>> My trafficgen UDP GRO tests, small frame sizes:
>>
> 
> How does your trafficgen UDP test manage to get UDP GRO working?
> (Perhaps you can share test?)

I usually test as follows:

xdp-trafficgen from xdp-tools on the sender

then, on the receiver:

ethtool -K <iface> rx-udp-gro-forwarding on

No socket on the receiver, but this option enables GRO not only when
forwarding, but also when it's LOCAL_IN and there's just no socket.
Then, the UDP core drops the frame when doing sk lookup as there's no
socket.
IOW, I have the following:

* GRO gets performed
* Stack overhead is there, up to UDP lookup
* The final frame is dropped, so no userspace copy overhead.

> 
> What is the "small frame" size being used?

xdp-trafficgen currently hardcodes frame sizes to 64 bytes. I was
planning to add an option to configure frame size and send it upstream,
but never finished it yet unfortunately.

I realize that on bigger frames, the boosts won't be as big due to that
the CPU will have to calculate checksums for larger buffers. OTOH TCP
benches usually send MTU-sized buffers (+ TSO), but yet the perf is better.

> 
> Is the UDP benchmark avoiding (re)calculating the RX checksum?
> (via setting UDP csum to zero)

OH, I completely forgot about this one. I can imagine even bigger boosts
due to that CPU checksumming will disappear.

> 
>>                  GRO off    GRO on
>> baseline        2.7        N/A       Mpps
>> patch 3         2.3        4         Mpps
>> patch 8         2.4        4.7       Mpps
>>
>> 1...3 diff      -17        +48       %
>> 1...8 diff      -11        +74       %
>>
>> Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
>> tests. On my system however, the same test gave me up to +100%.
>>
> 
> I can imagine that the TCP throughput tests will yield a huge
> performance boost.
> 
>> Note that there's a series from Lorenzo[4] which achieves the same, but
>> in a different way. During the discussions, the approach using a
>> standalone GRO instance was preferred over the threaded NAPI.
>>
> 
> It looks like you are keeping the "remote" CPUMAP kthread process design
> intact in this series, right?

Right, the kthread logic remains the same as before.

> 
> I think this design works for our use-case. For our use-case, we want to
> give "remote" CPU-thread higher scheduling priority.  It doesn't matter
> if this is a kthread or threaded-NAPI thread, as long as we can see this
> as a PID from userspace (by which we adjust the sched priority).
> 
> Great to see this work progressing again :-)))
> --Jesper

Thanks,
Olek
Daniel Xu Jan. 9, 2025, 1:26 a.m. UTC | #3
On Tue, Jan 07, 2025 at 06:17:06PM +0100, Jesper Dangaard Brouer wrote:
> Awesome work! - some questions below
> 
> On 07/01/2025 16.29, Alexander Lobakin wrote:
> > Several months ago, I had been looking through my old XDP hints tree[0]
> > to check whether some patches not directly related to hints can be sent
> > standalone. Roughly at the same time, Daniel appeared and asked[1] about
> > GRO for cpumap from that tree.
> > 
> > Currently, cpumap uses its own kthread which processes cpumap-redirected
> > frames by batches of 8, without any weighting (but with rescheduling
> > points). The resulting skbs get passed to the stack via
> > netif_receive_skb_list(), which means no GRO happens.
> > Even though we can't currently pass checksum status from the drivers,
> > in many cases GRO performs better than the listified Rx without the
> > aggregation, confirmed by tests.
> > 
> > In order to enable GRO in cpumap, we need to do the following:
> > 
> > * patches 1-2: decouple the GRO struct from the NAPI struct and allow
> >    using it out of a NAPI entity within the kernel core code;
> > * patch 3: switch cpumap from netif_receive_skb_list() to
> >    gro_receive_skb().
> > 
> > Additional improvements:
> > 
> > * patch 4: optimize XDP_PASS in cpumap by using arrays instead of linked
> >    lists;
> > * patch 5-6: introduce and use function do get skbs from the NAPI percpu
> >    caches by bulks, not one at a time;
> > * patch 7-8: use that function in veth as well and remove the one that
> >    was now superseded by it.
> > 
> > My trafficgen UDP GRO tests, small frame sizes:
> > 
> 
> How does your trafficgen UDP test manage to get UDP GRO working?
> (Perhaps you can share test?)
> 
> What is the "small frame" size being used?
> 
> Is the UDP benchmark avoiding (re)calculating the RX checksum?
> (via setting UDP csum to zero)
> 
> >                  GRO off    GRO on
> > baseline        2.7        N/A       Mpps
> > patch 3         2.3        4         Mpps
> > patch 8         2.4        4.7       Mpps
> > 
> > 1...3 diff      -17        +48       %
> > 1...8 diff      -11        +74       %
> > 
> > Daniel reported from +14%[2] to +18%[3] of throughput in neper's TCP RR
> > tests. On my system however, the same test gave me up to +100%.
> > 
> 
> I can imagine that the TCP throughput tests will yield a huge
> performance boost.
> 
> > Note that there's a series from Lorenzo[4] which achieves the same, but
> > in a different way. During the discussions, the approach using a
> > standalone GRO instance was preferred over the threaded NAPI.
> > 
> 
> It looks like you are keeping the "remote" CPUMAP kthread process design
> intact in this series, right?
> 
> I think this design works for our use-case. For our use-case, we want to
> give "remote" CPU-thread higher scheduling priority.  It doesn't matter
> if this is a kthread or threaded-NAPI thread, as long as we can see this
> as a PID from userspace (by which we adjust the sched priority).
> 

Similiar for us as well - having a schedulable entity helps. I might
have mentioned it on an earlier thread, but with sched-ext, I think
things could get interesting for dynamically tuning the system. We've
got some vague ideas. Probably not this upcoming one, but maybe if any
of the ideas work we'll share them at netdev or something.

> Great to see this work progressing again :-)))

Agreed, thanks for continuing!

Daniel