[bpf-next,0/9] bpf: cpumap: enable GRO for XDP_PASS frames

Message ID	20240830162508.1009458-1-aleksander.lobakin@intel.com (mailing list archive)
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A0F121F942; Fri, 30 Aug 2024 16:25:38 +0000 (UTC) From: Alexander Lobakin <aleksander.lobakin@intel.com> To: Alexei Starovoitov <ast@kernel.org>, Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org> Cc: Alexander Lobakin <aleksander.lobakin@intel.com>, Lorenzo Bianconi <lorenzo@kernel.org>, Daniel Xu <dxu@dxuuu.xyz>, John Fastabend <john.fastabend@gmail.com>, Jesper Dangaard Brouer <hawk@kernel.org>, Martin KaFai Lau <martin.lau@linux.dev>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, bpf@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH bpf-next 0/9] bpf: cpumap: enable GRO for XDP_PASS frames Date: Fri, 30 Aug 2024 18:24:59 +0200 Message-ID: <20240830162508.1009458-1-aleksander.lobakin@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	bpf: cpumap: enable GRO for XDP_PASS frames \| expand [bpf-next,0/9] bpf: cpumap: enable GRO for XDP_PASS frames [bpf-next,1/9] firmware/psci: fix missing '%u' format literal in kthread_create_on_cpu() [bpf-next,2/9] kthread: allow vararg kthread_{create,run}_on_cpu() [bpf-next,3/9] net: napi: add ability to create CPU-pinned threaded NAPI [bpf-next,4/9] bpf: cpumap: use CPU-pinned threaded NAPI w/GRO instead of kthread [bpf-next,5/9] bpf: cpumap: reuse skb array instead of a linked list to chain skbs [bpf-next,6/9] net: skbuff: introduce napi_skb_cache_get_bulk() [bpf-next,7/9] bpf: cpumap: switch to napi_skb_cache_get_bulk() [bpf-next,8/9] veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk() [bpf-next,9/9] xdp: remove xdp_alloc_skb_bulk()

Alexander Lobakin Aug. 30, 2024, 4:24 p.m. UTC

Recently, I've been looking through my old XDP hints tree[0] to check
whether some patches not directly related to hints can be sent
standalone. Roughly at the same time, Daniel appeared and asked[1] about
GRO for cpumap from that tree.

Currently, cpumap uses its own kthread which processes cpumap-redirected
frames by batches of 8, without any weighting (but with rescheduling
points). The resulting skbs get passed to the stack via
netif_receive_skb_list(), which means no GRO happens.
Even though we can't currently pass checksum status from the drivers,
in many cases GRO performs better than the listified Rx without the
aggregation, confirmed by tests.

In order to enable GRO in cpumap, we need to do the following:

* patches 1-3: allow creating CPU-pinned threaded NAPIs;
* patch 4: switch cpumap from a custom kthread to a CPU-pinned
  threaded NAPI;

Additional improvements:

* patch 5: optimize XDP_PASS in cpumap by using arrays instead of linked
  lists;
* patch 6-7: introduce and use function do get skbs from the NAPI percpu
  caches by bulks, not one at a time;
* patch 8-9: use that function in veth and remove the one that was
  superseded by it.

My trafficgen UDP GRO tests, small frame sizes:

                GRO off    GRO on
baseline        2.7        N/A       Mpps
thread GRO      2.3        4         Mpps
thr bulk GRO    2.4        4.7       Mpps

1...2 diff      -17        +48       %
1...3 diff      -14        +75       %

Daniel reported +14% of throughput in neper's TCP RR tests[2].

[0] https://github.com/alobakin/linux/tree/xdp_hints
[1] https://lore.kernel.org/bpf/cadda351-6e93-4568-ba26-21a760bf9a57@app.fastmail.com
[2] https://lore.kernel.org/bpf/merfatcdvwpx2lj4j2pahhwp4vihstpidws3jwljwazhh76xkd@t5vsh4gvk4mh

Alexander Lobakin (7):
  firmware/psci: fix missing '%u' format literal in
    kthread_create_on_cpu()
  kthread: allow vararg kthread_{create,run}_on_cpu()
  bpf: cpumap: reuse skb array instead of a linked list to chain skbs
  net: skbuff: introduce napi_skb_cache_get_bulk()
  bpf: cpumap: switch to napi_skb_cache_get_bulk()
  veth: use napi_skb_cache_get_bulk() instead of xdp_alloc_skb_bulk()
  xdp: remove xdp_alloc_skb_bulk()

Lorenzo Bianconi (2):
  net: napi: add ability to create CPU-pinned threaded NAPI
  bpf: cpumap: use CPU-pinned threaded NAPI w/GRO instead of kthread

 include/linux/kthread.h              |  51 ++++---
 include/linux/netdevice.h            |  35 ++++-
 include/linux/skbuff.h               |   1 +
 include/net/xdp.h                    |   1 -
 drivers/firmware/psci/psci_checker.c |   2 +-
 drivers/net/veth.c                   |   3 +-
 kernel/bpf/cpumap.c                  | 210 ++++++++++++---------------
 kernel/kthread.c                     |  22 +--
 net/core/dev.c                       |  18 ++-
 net/core/skbuff.c                    |  62 ++++++++
 net/core/xdp.c                       |  10 --
 11 files changed, 251 insertions(+), 164 deletions(-)

Jakub Kicinski Sept. 3, 2024, 8:51 p.m. UTC | #1

On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
> * patch 4: switch cpumap from a custom kthread to a CPU-pinned
>   threaded NAPI;

Could you try to use the backlog NAPI? Allocating a fake netdev and
using NAPI as a threading abstraction feels like an abuse. Maybe try
to factor out the necessary bits? What we want is using the per-cpu 
caches, and feeding GRO. None of the IRQ related NAPI functionality
fits in here.

Lorenzo Bianconi Sept. 3, 2024, 9:33 p.m. UTC | #2

> On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
> > * patch 4: switch cpumap from a custom kthread to a CPU-pinned
> >   threaded NAPI;
> 
> Could you try to use the backlog NAPI? Allocating a fake netdev and
> using NAPI as a threading abstraction feels like an abuse. Maybe try
> to factor out the necessary bits? What we want is using the per-cpu 
> caches, and feeding GRO. None of the IRQ related NAPI functionality
> fits in here.

I was thinking allocating a fake netdev to use NAPI APIs is quite a common
approach, but sure, I will looking into it.

Regards,
Lorenzo

Alexander Lobakin Sept. 4, 2024, 1:13 p.m. UTC | #3

From: Jakub Kicinski <kuba@kernel.org>
Date: Tue, 3 Sep 2024 13:51:58 -0700

> On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
>> * patch 4: switch cpumap from a custom kthread to a CPU-pinned
>>   threaded NAPI;
> 
> Could you try to use the backlog NAPI? Allocating a fake netdev and
> using NAPI as a threading abstraction feels like an abuse. Maybe try
> to factor out the necessary bits? What we want is using the per-cpu 
> caches, and feeding GRO. None of the IRQ related NAPI functionality
> fits in here.

Lorenzo will try as he wrote. I can only add that in my old tree, I
factored out GRO bits and used them here just as you wrote. The perf was
the same, but the diffstat was several hundred lines only to factor out
stuff, while here the actual switch to NAPI removes more lines than
adds, also custom kthread logic is gone etc. It just looks way more
elegant and simple.
I could say that gro_cells also "abuses" NAPI the same way, don't you
think? But nobody ever objected :>

Thanks,
Olek

Jakub Kicinski Sept. 4, 2024, 2:50 p.m. UTC | #4

On Wed, 4 Sep 2024 15:13:54 +0200 Alexander Lobakin wrote:
> > Could you try to use the backlog NAPI? Allocating a fake netdev and
> > using NAPI as a threading abstraction feels like an abuse. Maybe try
> > to factor out the necessary bits? What we want is using the per-cpu 
> > caches, and feeding GRO. None of the IRQ related NAPI functionality
> > fits in here.  
> 
> Lorenzo will try as he wrote. I can only add that in my old tree, I
> factored out GRO bits and used them here just as you wrote. The perf was
> the same, but the diffstat was several hundred lines only to factor out
> stuff, while here the actual switch to NAPI removes more lines than
> adds, also custom kthread logic is gone etc. It just looks way more
> elegant and simple.

Once again we seem to be arguing whether lower LoC is equivalent to
better code? :) If we can use backlog NAPI it hopefully won't be as
long. Maybe other, better approaches are within reach, too.

> I could say that gro_cells also "abuses" NAPI the same way, don't you
> think?

"same way"? :] Does it allocate a fake netdev, use NAPI as a threading
abstraction or add extra fields to napi_struct ? 
If other maintainers disagree I won't be upset, but I'm worried
that letting NAPI grow into some generic SW abstraction with broad 
use cases will hinder the ongoing queue config efforts.

> But nobody ever objected :>

Alexander Lobakin Sept. 4, 2024, 3:13 p.m. UTC | #5

From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 4 Sep 2024 07:50:41 -0700

> On Wed, 4 Sep 2024 15:13:54 +0200 Alexander Lobakin wrote:
>>> Could you try to use the backlog NAPI? Allocating a fake netdev and
>>> using NAPI as a threading abstraction feels like an abuse. Maybe try
>>> to factor out the necessary bits? What we want is using the per-cpu 
>>> caches, and feeding GRO. None of the IRQ related NAPI functionality
>>> fits in here.  
>>
>> Lorenzo will try as he wrote. I can only add that in my old tree, I
>> factored out GRO bits and used them here just as you wrote. The perf was
>> the same, but the diffstat was several hundred lines only to factor out
>> stuff, while here the actual switch to NAPI removes more lines than
>> adds, also custom kthread logic is gone etc. It just looks way more
>> elegant and simple.
> 
> Once again we seem to be arguing whether lower LoC is equivalent to
> better code? :) If we can use backlog NAPI it hopefully won't be as

And once again I didn't say that explicitly :D When 2 patches work the
same way, but one has far shorter diffstat, we often prefer this one if
it's correct. This one for cpumap looked correct to me and Lorenzo and
we didn't have any other ideas, so I picked it.
I didn't say "it's better than backlog NAPI because it's shorter", the
only thing I said re backlog NAPI is that we'll try it. I didn't think
of this previously at all, I'm no backlog expert in general.

> long. Maybe other, better approaches are within reach, too.
> 
>> I could say that gro_cells also "abuses" NAPI the same way, don't you
>> think?
> 
> "same way"? :] Does it allocate a fake netdev, use NAPI as a threading
> abstraction or add extra fields to napi_struct ? 

Wait wait wait, you said "NAPI IRQ related logics doesn't fit here". I
could say the same for gro_cells -- IRQ related NAPI logics doesn't fit
there. gro_cells is an SW abstraction.
A fake netdev is used by multiple drivers to use GRO, you know that
(popular for wireless drivers). They also conflict with the queue config
effort.

> If other maintainers disagree I won't be upset, but I'm worried
> that letting NAPI grow into some generic SW abstraction with broad 
> use cases will hinder the ongoing queue config efforts.
> 
>> But nobody ever objected :>

Thanks,
Olek

Jakub Kicinski Sept. 4, 2024, 6:29 p.m. UTC | #6

On Wed, 4 Sep 2024 17:13:59 +0200 Alexander Lobakin wrote:
> >> I could say that gro_cells also "abuses" NAPI the same way, don't you
> >> think?  
> > 
> > "same way"? :] Does it allocate a fake netdev, use NAPI as a threading
> > abstraction or add extra fields to napi_struct ?   
> 
> Wait wait wait, you said "NAPI IRQ related logics doesn't fit here". I
> could say the same for gro_cells -- IRQ related NAPI logics doesn't fit
> there. gro_cells is an SW abstraction.

Yes, that 1/4th of my complaint does indeed apply :)

> A fake netdev is used by multiple drivers to use GRO, you know that
> (popular for wireless drivers). They also conflict with the queue config
> effort.

And it did cause us some issues when changing netdev_priv() already.

Jesper Dangaard Brouer Sept. 5, 2024, 11:53 a.m. UTC | #7

On 03/09/2024 23.33, Lorenzo Bianconi wrote:
>> On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
>>> * patch 4: switch cpumap from a custom kthread to a CPU-pinned
>>>    threaded NAPI;
>>
>> Could you try to use the backlog NAPI? Allocating a fake netdev and
>> using NAPI as a threading abstraction feels like an abuse. Maybe try
>> to factor out the necessary bits? What we want is using the per-cpu
>> caches, and feeding GRO. None of the IRQ related NAPI functionality
>> fits in here.
> 
> I was thinking allocating a fake netdev to use NAPI APIs is quite a common
> approach, but sure, I will looking into it.
> 

I have a use-case for cpumap where I adjust (increase) kthread priority.

Using backlog NAPI, will I still be able to change scheduling priority?

--Jesper

Lorenzo Bianconi Sept. 5, 2024, 5:01 p.m. UTC | #8

> > On Fri, 30 Aug 2024 18:24:59 +0200 Alexander Lobakin wrote:
> > > * patch 4: switch cpumap from a custom kthread to a CPU-pinned
> > >   threaded NAPI;
> > 
> > Could you try to use the backlog NAPI? Allocating a fake netdev and
> > using NAPI as a threading abstraction feels like an abuse. Maybe try
> > to factor out the necessary bits? What we want is using the per-cpu 
> > caches, and feeding GRO. None of the IRQ related NAPI functionality
> > fits in here.
> 
> I was thinking allocating a fake netdev to use NAPI APIs is quite a common
> approach, but sure, I will looking into it.

From a first glance I think we could use the backlog NAPI APIs here in
order to avoid allocating a dummy netdev. We could implement a similar
approach I used for the cpumap + gro_cell here [0].
In particular, the cpumap kthread pinned on cpu 'n' can schedule the
backlog NAPI associated to cpu 'n'. However according to my understanding
it seems the backlog NAPI APIs (in process_backlog()) do not support GRO,
right? Am I missing something?

Regards,
Lorenzo

[0] https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e

> 
> Regards,
> Lorenzo

Jakub Kicinski Sept. 6, 2024, 12:20 a.m. UTC | #9

On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote:
> In particular, the cpumap kthread pinned on cpu 'n' can schedule the
> backlog NAPI associated to cpu 'n'. However according to my understanding
> it seems the backlog NAPI APIs (in process_backlog()) do not support GRO,
> right? Am I missing something?

I meant to use the struct directly, not to schedule it. All you need
is GRO - feed it packets, flush it. 
But maybe you can avoid the netdev allocation and patch 3 in other ways.
Using backlog NAPI was just the first thing that came to mind.

Lorenzo Bianconi Sept. 6, 2024, 8:15 a.m. UTC | #10

> On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote:
> > In particular, the cpumap kthread pinned on cpu 'n' can schedule the
> > backlog NAPI associated to cpu 'n'. However according to my understanding
> > it seems the backlog NAPI APIs (in process_backlog()) do not support GRO,
> > right? Am I missing something?
> 
> I meant to use the struct directly, not to schedule it. All you need
> is GRO - feed it packets, flush it. 

ack, thx for pointing this out.

> But maybe you can avoid the netdev allocation and patch 3 in other ways.
> Using backlog NAPI was just the first thing that came to mind.

ack, I will look into it.

Regards,
Lorenzo

Lorenzo Bianconi Sept. 7, 2024, 1:22 p.m. UTC | #11

> > On Thu, 5 Sep 2024 19:01:42 +0200 Lorenzo Bianconi wrote:
> > > In particular, the cpumap kthread pinned on cpu 'n' can schedule the
> > > backlog NAPI associated to cpu 'n'. However according to my understanding
> > > it seems the backlog NAPI APIs (in process_backlog()) do not support GRO,
> > > right? Am I missing something?
> > 
> > I meant to use the struct directly, not to schedule it. All you need
> > is GRO - feed it packets, flush it. 
> 
> ack, thx for pointing this out.
> 
> > But maybe you can avoid the netdev allocation and patch 3 in other ways.
> > Using backlog NAPI was just the first thing that came to mind.
> 
> ack, I will look into it.
> 
> Regards,
> Lorenzo

Hi all,

I reworked my previous implementation to add GRO support to cpumap codebase, removing
the dummy netdev dependency and keeping most of the other logic. You can find
the codebase here:
- https://github.com/LorenzoBianconi/bpf-next/commit/e152cf8c212196fccece0b516190827430c0f5f8
I added to the two patches below in order to reuse some NAPI generic code:
- https://github.com/LorenzoBianconi/bpf-next/commit/3c73e9c2f07486590749e9b3bfb8a4b3df4cb5e0
- https://github.com/LorenzoBianconi/bpf-next/commit/d435ce2e1b6a991a6264a5aad4a0374a3ca86a51
I have not run any performance test yet, just functional one.

Regards,
Lorenzo

[bpf-next,0/9] bpf: cpumap: enable GRO for XDP_PASS frames

Message

Comments