diff mbox series

[RFC,bpf-next,32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()

Message ID 20220628194812.1453059-33-alexandr.lobakin@intel.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series bpf, xdp: introduce and use Generic Hints/metadata | expand

Checks

Context Check Description
bpf/vmtest-bpf-next-PR fail PR summary
bpf/vmtest-bpf-next-VM_Test-1 fail Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2 fail Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-VM_Test-3 fail Logs for Kernel LATEST on z15 with gcc
netdev/tree_selection success Clearly marked for bpf-next, async
netdev/apply fail Patch does not apply to bpf-next

Commit Message

Alexander Lobakin June 28, 2022, 7:47 p.m. UTC
cpumap has its own BH context based on kthread. It has a sane batch
size of 8 frames per one cycle.
GRO can be used on its own, adjust cpumap calls to the
upper stack to use GRO API instead of netif_receive_skb_list() which
processes skbs by batches, but doesn't involve GRO layer at all.
It is most beneficial when a NIC which frame come from is XDP
generic metadata-enabled, but in plenty of tests GRO performs better
than listed receiving even given that it has to calculate full frame
checksums on CPU.
As GRO passes the skbs to the upper stack in the batches of
@gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
device where the frame comes from, it is enough to disable GRO
netdev feature on it to completely restore the original behaviour:
untouched frames will be being bulked and passed to the upper stack
by 8, as it was with netif_receive_skb_list().

Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
 kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 38 insertions(+), 5 deletions(-)

Comments

Daniel Xu Aug. 7, 2024, 8:38 p.m. UTC | #1
Hi Alexander,

On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> cpumap has its own BH context based on kthread. It has a sane batch
> size of 8 frames per one cycle.
> GRO can be used on its own, adjust cpumap calls to the
> upper stack to use GRO API instead of netif_receive_skb_list() which
> processes skbs by batches, but doesn't involve GRO layer at all.
> It is most beneficial when a NIC which frame come from is XDP
> generic metadata-enabled, but in plenty of tests GRO performs better
> than listed receiving even given that it has to calculate full frame
> checksums on CPU.
> As GRO passes the skbs to the upper stack in the batches of
> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> device where the frame comes from, it is enough to disable GRO
> netdev feature on it to completely restore the original behaviour:
> untouched frames will be being bulked and passed to the upper stack
> by 8, as it was with netif_receive_skb_list().
>
> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> ---
>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 38 insertions(+), 5 deletions(-)
>

AFAICT the cpumap + GRO is a good standalone improvement. I think
cpumap is still missing this.

I have a production use case for this now. We want to do some intelligent
RX steering and I think GRO would help over list-ified receive in some cases.
We would prefer steer in HW (and thus get existing GRO support) but not all
our NICs support it. So we need a software fallback.

Are you still interested in merging the cpumap + GRO patches?

Thanks,
Daniel
Lorenzo Bianconi Aug. 8, 2024, 4:54 a.m. UTC | #2
> Hi Alexander,
> 
> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> > cpumap has its own BH context based on kthread. It has a sane batch
> > size of 8 frames per one cycle.
> > GRO can be used on its own, adjust cpumap calls to the
> > upper stack to use GRO API instead of netif_receive_skb_list() which
> > processes skbs by batches, but doesn't involve GRO layer at all.
> > It is most beneficial when a NIC which frame come from is XDP
> > generic metadata-enabled, but in plenty of tests GRO performs better
> > than listed receiving even given that it has to calculate full frame
> > checksums on CPU.
> > As GRO passes the skbs to the upper stack in the batches of
> > @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> > device where the frame comes from, it is enough to disable GRO
> > netdev feature on it to completely restore the original behaviour:
> > untouched frames will be being bulked and passed to the upper stack
> > by 8, as it was with netif_receive_skb_list().
> >
> > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> > ---
> >  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >  1 file changed, 38 insertions(+), 5 deletions(-)
> >
> 
> AFAICT the cpumap + GRO is a good standalone improvement. I think
> cpumap is still missing this.
> 
> I have a production use case for this now. We want to do some intelligent
> RX steering and I think GRO would help over list-ified receive in some cases.
> We would prefer steer in HW (and thus get existing GRO support) but not all
> our NICs support it. So we need a software fallback.
> 
> Are you still interested in merging the cpumap + GRO patches?

Hi Daniel and Alex,

Recently I worked on a PoC to add GRO support to cpumap codebase:
- https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
  Here I added GRO support to cpumap through gro-cells.
- https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
  Here I added GRO support to cpumap trough napi-threaded APIs (with a some
  changes to them).

Please note I have not run any performance tests so far, just verified it does
not crash (I was planning to resume this work soon). Please let me know if it
works for you.

Regards,
Lorenzo

> 
> Thanks,
> Daniel
>
Alexander Lobakin Aug. 8, 2024, 11:57 a.m. UTC | #3
From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Date: Thu, 8 Aug 2024 06:54:06 +0200

>> Hi Alexander,
>>
>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>> cpumap has its own BH context based on kthread. It has a sane batch
>>> size of 8 frames per one cycle.
>>> GRO can be used on its own, adjust cpumap calls to the
>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>> It is most beneficial when a NIC which frame come from is XDP
>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>> than listed receiving even given that it has to calculate full frame
>>> checksums on CPU.
>>> As GRO passes the skbs to the upper stack in the batches of
>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>> device where the frame comes from, it is enough to disable GRO
>>> netdev feature on it to completely restore the original behaviour:
>>> untouched frames will be being bulked and passed to the upper stack
>>> by 8, as it was with netif_receive_skb_list().
>>>
>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>> ---
>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>
>>
>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>> cpumap is still missing this.

The only concern for having GRO in cpumap without metadata from the NIC
descriptor was that when the checksum status is missing, GRO calculates
the checksum on CPU, which is not really fast.
But I remember sometimes GRO was faster despite that.

>>
>> I have a production use case for this now. We want to do some intelligent
>> RX steering and I think GRO would help over list-ified receive in some cases.
>> We would prefer steer in HW (and thus get existing GRO support) but not all
>> our NICs support it. So we need a software fallback.
>>
>> Are you still interested in merging the cpumap + GRO patches?

For sure I can revive this part. I was planning to get back to this
branch and pick patches which were not related to XDP hints and send
them separately.

> 
> Hi Daniel and Alex,
> 
> Recently I worked on a PoC to add GRO support to cpumap codebase:
> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>   Here I added GRO support to cpumap through gro-cells.
> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>   changes to them).

Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
overkill, that's why I separated GRO structure from &napi_struct.

Let me maybe find some free time, I would then test all 3 solutions
(mine, gro_cells, threaded NAPI) and pick/send the best?

> 
> Please note I have not run any performance tests so far, just verified it does
> not crash (I was planning to resume this work soon). Please let me know if it
> works for you.
> 
> Regards,
> Lorenzo
> 
>>
>> Thanks,
>> Daniel

Thanks,
Olek
Lorenzo Bianconi Aug. 8, 2024, 5:22 p.m. UTC | #4
> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Thu, 8 Aug 2024 06:54:06 +0200
> 
> >> Hi Alexander,
> >>
> >> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>> cpumap has its own BH context based on kthread. It has a sane batch
> >>> size of 8 frames per one cycle.
> >>> GRO can be used on its own, adjust cpumap calls to the
> >>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>> It is most beneficial when a NIC which frame come from is XDP
> >>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>> than listed receiving even given that it has to calculate full frame
> >>> checksums on CPU.
> >>> As GRO passes the skbs to the upper stack in the batches of
> >>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>> device where the frame comes from, it is enough to disable GRO
> >>> netdev feature on it to completely restore the original behaviour:
> >>> untouched frames will be being bulked and passed to the upper stack
> >>> by 8, as it was with netif_receive_skb_list().
> >>>
> >>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>> ---
> >>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>>  1 file changed, 38 insertions(+), 5 deletions(-)
> >>>
> >>
> >> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >> cpumap is still missing this.
> 
> The only concern for having GRO in cpumap without metadata from the NIC
> descriptor was that when the checksum status is missing, GRO calculates
> the checksum on CPU, which is not really fast.
> But I remember sometimes GRO was faster despite that.
> 
> >>
> >> I have a production use case for this now. We want to do some intelligent
> >> RX steering and I think GRO would help over list-ified receive in some cases.
> >> We would prefer steer in HW (and thus get existing GRO support) but not all
> >> our NICs support it. So we need a software fallback.
> >>
> >> Are you still interested in merging the cpumap + GRO patches?
> 
> For sure I can revive this part. I was planning to get back to this
> branch and pick patches which were not related to XDP hints and send
> them separately.
> 
> > 
> > Hi Daniel and Alex,
> > 
> > Recently I worked on a PoC to add GRO support to cpumap codebase:
> > - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> >   Here I added GRO support to cpumap through gro-cells.
> > - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> >   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> >   changes to them).
> 
> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> overkill, that's why I separated GRO structure from &napi_struct.

if we consider the NAPI-threaded implementation, we have the same architecture
we have in current cpumap codebase, a thread for each cpumap entry, the only
different is we can rely on GRO APIs.

Regards,
Lorenzo

> 
> Let me maybe find some free time, I would then test all 3 solutions
> (mine, gro_cells, threaded NAPI) and pick/send the best?
> 
> > 
> > Please note I have not run any performance tests so far, just verified it does
> > not crash (I was planning to resume this work soon). Please let me know if it
> > works for you.
> > 
> > Regards,
> > Lorenzo
> > 
> >>
> >> Thanks,
> >> Daniel
> 
> Thanks,
> Olek
>
Daniel Xu Aug. 8, 2024, 8:44 p.m. UTC | #5
Hi Lorenzo,

On Thu, Aug 8, 2024, at 12:54 AM, Lorenzo Bianconi wrote:
>> Hi Alexander,
>> 
>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>> > cpumap has its own BH context based on kthread. It has a sane batch
>> > size of 8 frames per one cycle.
>> > GRO can be used on its own, adjust cpumap calls to the
>> > upper stack to use GRO API instead of netif_receive_skb_list() which
>> > processes skbs by batches, but doesn't involve GRO layer at all.
>> > It is most beneficial when a NIC which frame come from is XDP
>> > generic metadata-enabled, but in plenty of tests GRO performs better
>> > than listed receiving even given that it has to calculate full frame
>> > checksums on CPU.
>> > As GRO passes the skbs to the upper stack in the batches of
>> > @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>> > device where the frame comes from, it is enough to disable GRO
>> > netdev feature on it to completely restore the original behaviour:
>> > untouched frames will be being bulked and passed to the upper stack
>> > by 8, as it was with netif_receive_skb_list().
>> >
>> > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>> > ---
>> >  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>> >  1 file changed, 38 insertions(+), 5 deletions(-)
>> >
>> 
>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>> cpumap is still missing this.
>> 
>> I have a production use case for this now. We want to do some intelligent
>> RX steering and I think GRO would help over list-ified receive in some cases.
>> We would prefer steer in HW (and thus get existing GRO support) but not all
>> our NICs support it. So we need a software fallback.
>> 
>> Are you still interested in merging the cpumap + GRO patches?
>
> Hi Daniel and Alex,
>
> Recently I worked on a PoC to add GRO support to cpumap codebase:
> - 
> https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>   Here I added GRO support to cpumap through gro-cells.
> - 
> https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>   Here I added GRO support to cpumap trough napi-threaded APIs (with a 
> some
>   changes to them).

Cool! 

>
> Please note I have not run any performance tests so far, just verified it does
> not crash (I was planning to resume this work soon). Please let me know if it
> works for you.

I’ll try to run an A/B test on your two approaches as well as Alex’s. I’ve still
got some testbeds with production traffic going thru them.

Thanks,
Daniel
Daniel Xu Aug. 8, 2024, 8:52 p.m. UTC | #6
Hi,

On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Thu, 8 Aug 2024 06:54:06 +0200
>
>>> Hi Alexander,
>>>
>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>> size of 8 frames per one cycle.
>>>> GRO can be used on its own, adjust cpumap calls to the
>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>> It is most beneficial when a NIC which frame come from is XDP
>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>> than listed receiving even given that it has to calculate full frame
>>>> checksums on CPU.
>>>> As GRO passes the skbs to the upper stack in the batches of
>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>> device where the frame comes from, it is enough to disable GRO
>>>> netdev feature on it to completely restore the original behaviour:
>>>> untouched frames will be being bulked and passed to the upper stack
>>>> by 8, as it was with netif_receive_skb_list().
>>>>
>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>> ---
>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>>
>>>
>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>> cpumap is still missing this.
>
> The only concern for having GRO in cpumap without metadata from the NIC
> descriptor was that when the checksum status is missing, GRO calculates
> the checksum on CPU, which is not really fast.
> But I remember sometimes GRO was faster despite that.

Good to know, thanks. IIUC some kind of XDP hint support landed already?

My use case could also use HW RSS hash to avoid a rehash in XDP prog.
And HW RX timestamp to not break SO_TIMESTAMPING. These two
are on one of my TODO lists. But I can’t get to them for at least
a few weeks. So free to take it if you’d like.

>
>>>
>>> I have a production use case for this now. We want to do some intelligent
>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>> our NICs support it. So we need a software fallback.
>>>
>>> Are you still interested in merging the cpumap + GRO patches?
>
> For sure I can revive this part. I was planning to get back to this
> branch and pick patches which were not related to XDP hints and send
> them separately.
>
>> 
>> Hi Daniel and Alex,
>> 
>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>   Here I added GRO support to cpumap through gro-cells.
>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>   changes to them).
>
> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> overkill, that's why I separated GRO structure from &napi_struct.
>
> Let me maybe find some free time, I would then test all 3 solutions
> (mine, gro_cells, threaded NAPI) and pick/send the best?

Sounds good. Would be good to compare results.

[…]

Thanks,
Daniel
Jesper Dangaard Brouer Aug. 9, 2024, 9:32 a.m. UTC | #7
On 08/08/2024 22.44, Daniel Xu wrote:
> Hi Lorenzo,
> 
> On Thu, Aug 8, 2024, at 12:54 AM, Lorenzo Bianconi wrote:
>>> Hi Alexander,
>>>
>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
[...]
>>>
>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>> cpumap is still missing this.
>>>
>>> I have a production use case for this now. We want to do some intelligent
>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>> our NICs support it. So we need a software fallback.
>>>
I want to state that Cloudflare is also planning to use cpumap in
production, and (one) blocker is that CPUMAP doesn't support GRO.


>>> Are you still interested in merging the cpumap + GRO patches?
>>
>> Hi Daniel and Alex,
>>
>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>> -
>> https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>    Here I added GRO support to cpumap through gro-cells.
>> -
>> https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>    Here I added GRO support to cpumap trough napi-threaded APIs (with a
>> some
>>    changes to them).
> 
> Cool!
> 
>>
>> Please note I have not run any performance tests so far, just verified it does
>> not crash (I was planning to resume this work soon). Please let me know if it
>> works for you.
> 
> I’ll try to run an A/B test on your two approaches as well as Alex’s. I’ve still
> got some testbeds with production traffic going thru them.
> 

It is awesome that both Olek and you are stepping up for testing this.
(I'm currently too busy on cgroup rstat lock related work, but more
people will be joining my team this month and I hope they have interest
in contributing to this effort).

--Jesper
Jesper Dangaard Brouer Aug. 9, 2024, 10:02 a.m. UTC | #8
On 08/08/2024 22.52, Daniel Xu wrote:
> 
> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>
[...]
>> The only concern for having GRO in cpumap without metadata from the NIC
>> descriptor was that when the checksum status is missing, GRO calculates
>> the checksum on CPU, which is not really fast.
>> But I remember sometimes GRO was faster despite that.
 >
> Good to know, thanks. IIUC some kind of XDP hint support landed already?
> 

The XDP-hints ended-up being called 'XDP RX metadata' in kernel docs[1],
which makes it difficult to talk about without talking past each-other.
The TX side only got implemented for AF_XDP.

  [1] https://www.kernel.org/doc/html/latest/networking/xdp-rx-metadata.html
  [2] https://www.kernel.org/doc/html/latest/networking/xsk-tx-metadata.html

What landed 'XDP RX metadata'[1] is that we (via kfunc calls)  get
access to reading hardware RX offloads/hints directly from the
RX-descriptor. This implies a limitation that we only have access to
this data in the running XDP-program as the RX-descriptor is short lived.

Thus, we need to store the RX-descriptor information somewhere, to make
it available to 'cpumap' on the remote CPU. After failing to standardize
formatting XDP metadata area. My "new" opinion is that we should simply
extend struct xdp_frame with the fields needed for SKB creation.  Then
we can create some kfunc helpers that allow XDP-prog stores this info.


> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
> And HW RX timestamp to not break SO_TIMESTAMPING. These two
> are on one of my TODO lists. But I can’t get to them for at least
> a few weeks. So free to take it if you’d like.

The kfuncs you need should be available:

  HW RSS hash = bpf_xdp_metadata_rx_hash()
  HW RX timestamp = bpf_xdp_metadata_rx_timestamp()

We just need to implement storing the information, such that it is
available to CPUMAP, and make it generic such that it also works for
veth when getting a XDP redirected xdp_frame.

Hoping someone can works on this soon,
--Jesper
Alexander Lobakin Aug. 9, 2024, 12:20 p.m. UTC | #9
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Thu, 08 Aug 2024 16:52:51 -0400

> Hi,
> 
> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>
>>>> Hi Alexander,
>>>>
>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>> size of 8 frames per one cycle.
>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>> than listed receiving even given that it has to calculate full frame
>>>>> checksums on CPU.
>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>> device where the frame comes from, it is enough to disable GRO
>>>>> netdev feature on it to completely restore the original behaviour:
>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>
>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>> ---
>>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>
>>>>
>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>> cpumap is still missing this.
>>
>> The only concern for having GRO in cpumap without metadata from the NIC
>> descriptor was that when the checksum status is missing, GRO calculates
>> the checksum on CPU, which is not really fast.
>> But I remember sometimes GRO was faster despite that.
> 
> Good to know, thanks. IIUC some kind of XDP hint support landed already?
> 
> My use case could also use HW RSS hash to avoid a rehash in XDP prog.

Unfortunately, for now it's impossible to get HW metadata such as RSS
hash and checksum status in cpumap. They're implemented via kfuncs
specific to a particular netdevice and this info is available only when
running XDP prog.

But I think one solution could be:

1. We create some generic structure for cpumap, like

struct cpumap_meta {
	u32 magic;
	u32 hash;
}

2. We add such check in the cpumap code

	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
	    <here we check magic>)
		skb->hash = meta->hash;

3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.

> And HW RX timestamp to not break SO_TIMESTAMPING. These two
> are on one of my TODO lists. But I can’t get to them for at least
> a few weeks. So free to take it if you’d like.
> 
>>
>>>>
>>>> I have a production use case for this now. We want to do some intelligent
>>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>>> our NICs support it. So we need a software fallback.
>>>>
>>>> Are you still interested in merging the cpumap + GRO patches?
>>
>> For sure I can revive this part. I was planning to get back to this
>> branch and pick patches which were not related to XDP hints and send
>> them separately.
>>
>>>
>>> Hi Daniel and Alex,
>>>
>>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>>   Here I added GRO support to cpumap through gro-cells.
>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>>   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>>   changes to them).
>>
>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
>> overkill, that's why I separated GRO structure from &napi_struct.
>>
>> Let me maybe find some free time, I would then test all 3 solutions
>> (mine, gro_cells, threaded NAPI) and pick/send the best?
> 
> Sounds good. Would be good to compare results.
> 
> […]
> 
> Thanks,
> Daniel

Thanks,
Olek
Toke Høiland-Jørgensen Aug. 9, 2024, 12:45 p.m. UTC | #10
Alexander Lobakin <aleksander.lobakin@intel.com> writes:

> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Thu, 08 Aug 2024 16:52:51 -0400
>
>> Hi,
>> 
>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>
>>>>> Hi Alexander,
>>>>>
>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>> size of 8 frames per one cycle.
>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>> checksums on CPU.
>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>
>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> ---
>>>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>
>>>>>
>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>> cpumap is still missing this.
>>>
>>> The only concern for having GRO in cpumap without metadata from the NIC
>>> descriptor was that when the checksum status is missing, GRO calculates
>>> the checksum on CPU, which is not really fast.
>>> But I remember sometimes GRO was faster despite that.
>> 
>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>> 
>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
>
> Unfortunately, for now it's impossible to get HW metadata such as RSS
> hash and checksum status in cpumap. They're implemented via kfuncs
> specific to a particular netdevice and this info is available only when
> running XDP prog.
>
> But I think one solution could be:
>
> 1. We create some generic structure for cpumap, like
>
> struct cpumap_meta {
> 	u32 magic;
> 	u32 hash;
> }
>
> 2. We add such check in the cpumap code
>
> 	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
> 	    <here we check magic>)
> 		skb->hash = meta->hash;
>
> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.

Yes, except don't make this cpumap-specific, make it generic for kernel
consumption of the metadata. That way it doesn't even have to be stored
in the xdp metadata area, it can be anywhere we want (and hence not
subject to ABI issues), and we can use it for skb creation after
redirect in other places than cpumap as well (say, on veth devices).

So it'll be:

struct kernel_meta {
	u32 hash;
	u32 timestamp;
        ...etc
}

and a kfunc:

void store_xdp_kernel_meta(struct kernel meta *meta);

which the XDP program can call to populate the metadata area.

-Toke
Alexander Lobakin Aug. 9, 2024, 12:56 p.m. UTC | #11
From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Fri, 09 Aug 2024 14:45:33 +0200

> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> 
>> From: Daniel Xu <dxu@dxuuu.xyz>
>> Date: Thu, 08 Aug 2024 16:52:51 -0400
>>
>>> Hi,
>>>
>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>
>>>>>> Hi Alexander,
>>>>>>
>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>>> size of 8 frames per one cycle.
>>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>>> checksums on CPU.
>>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>>
>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>> ---
>>>>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>
>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>>> cpumap is still missing this.
>>>>
>>>> The only concern for having GRO in cpumap without metadata from the NIC
>>>> descriptor was that when the checksum status is missing, GRO calculates
>>>> the checksum on CPU, which is not really fast.
>>>> But I remember sometimes GRO was faster despite that.
>>>
>>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>>>
>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
>>
>> Unfortunately, for now it's impossible to get HW metadata such as RSS
>> hash and checksum status in cpumap. They're implemented via kfuncs
>> specific to a particular netdevice and this info is available only when
>> running XDP prog.
>>
>> But I think one solution could be:
>>
>> 1. We create some generic structure for cpumap, like
>>
>> struct cpumap_meta {
>> 	u32 magic;
>> 	u32 hash;
>> }
>>
>> 2. We add such check in the cpumap code
>>
>> 	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
>> 	    <here we check magic>)
>> 		skb->hash = meta->hash;
>>
>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
> 
> Yes, except don't make this cpumap-specific, make it generic for kernel
> consumption of the metadata. That way it doesn't even have to be stored
> in the xdp metadata area, it can be anywhere we want (and hence not
> subject to ABI issues), and we can use it for skb creation after
> redirect in other places than cpumap as well (say, on veth devices).
> 
> So it'll be:
> 
> struct kernel_meta {
> 	u32 hash;
> 	u32 timestamp;
>         ...etc
> }
> 
> and a kfunc:
> 
> void store_xdp_kernel_meta(struct kernel meta *meta);
> 
> which the XDP program can call to populate the metadata area.

Hmm, nice!

But where to store this info in case of cpumap if not in xdp->data_meta?
When you convert XDP frames to skbs in the cpumap code, you only have
&xdp_frame and that's it. XDP prog was already run earlier from the
driver code at that point.

But yes, in general we still need some generic structure, so that
generic consumers like cpumap (but not only) could make use of it, not
only XDP programs.

> 
> -Toke

Thanks,
Olek
Toke Høiland-Jørgensen Aug. 9, 2024, 1:42 p.m. UTC | #12
Alexander Lobakin <aleksander.lobakin@intel.com> writes:

> From: Toke Høiland-Jørgensen <toke@redhat.com>
> Date: Fri, 09 Aug 2024 14:45:33 +0200
>
>> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>> 
>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>> Date: Thu, 08 Aug 2024 16:52:51 -0400
>>>
>>>> Hi,
>>>>
>>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>>
>>>>>>> Hi Alexander,
>>>>>>>
>>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>>>> size of 8 frames per one cycle.
>>>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>>>> checksums on CPU.
>>>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>>>
>>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>>> ---
>>>>>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>>>> cpumap is still missing this.
>>>>>
>>>>> The only concern for having GRO in cpumap without metadata from the NIC
>>>>> descriptor was that when the checksum status is missing, GRO calculates
>>>>> the checksum on CPU, which is not really fast.
>>>>> But I remember sometimes GRO was faster despite that.
>>>>
>>>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>>>>
>>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
>>>
>>> Unfortunately, for now it's impossible to get HW metadata such as RSS
>>> hash and checksum status in cpumap. They're implemented via kfuncs
>>> specific to a particular netdevice and this info is available only when
>>> running XDP prog.
>>>
>>> But I think one solution could be:
>>>
>>> 1. We create some generic structure for cpumap, like
>>>
>>> struct cpumap_meta {
>>> 	u32 magic;
>>> 	u32 hash;
>>> }
>>>
>>> 2. We add such check in the cpumap code
>>>
>>> 	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
>>> 	    <here we check magic>)
>>> 		skb->hash = meta->hash;
>>>
>>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
>>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
>> 
>> Yes, except don't make this cpumap-specific, make it generic for kernel
>> consumption of the metadata. That way it doesn't even have to be stored
>> in the xdp metadata area, it can be anywhere we want (and hence not
>> subject to ABI issues), and we can use it for skb creation after
>> redirect in other places than cpumap as well (say, on veth devices).
>> 
>> So it'll be:
>> 
>> struct kernel_meta {
>> 	u32 hash;
>> 	u32 timestamp;
>>         ...etc
>> }
>> 
>> and a kfunc:
>> 
>> void store_xdp_kernel_meta(struct kernel meta *meta);
>> 
>> which the XDP program can call to populate the metadata area.
>
> Hmm, nice!
>
> But where to store this info in case of cpumap if not in xdp->data_meta?
> When you convert XDP frames to skbs in the cpumap code, you only have
> &xdp_frame and that's it. XDP prog was already run earlier from the
> driver code at that point.

Well, we could put it in skb_shared_info? IIRC, some of the metadata
(timestamps?) end up there when building an skb anyway, so we won't even
have to copy it around...

-Toke
Martin KaFai Lau Aug. 10, 2024, 12:54 a.m. UTC | #13
On 8/9/24 6:42 AM, Toke Høiland-Jørgensen wrote:
> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> 
>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> Date: Fri, 09 Aug 2024 14:45:33 +0200
>>
>>> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>>>
>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>> Date: Thu, 08 Aug 2024 16:52:51 -0400
>>>>
>>>>> Hi,
>>>>>
>>>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>>>
>>>>>>>> Hi Alexander,
>>>>>>>>
>>>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>>>>> size of 8 frames per one cycle.
>>>>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>>>>> checksums on CPU.
>>>>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>>>> ---
>>>>>>>>>   kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>>>>   1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>>>>
>>>>>>>>
>>>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>>>>> cpumap is still missing this.
>>>>>>
>>>>>> The only concern for having GRO in cpumap without metadata from the NIC
>>>>>> descriptor was that when the checksum status is missing, GRO calculates
>>>>>> the checksum on CPU, which is not really fast.
>>>>>> But I remember sometimes GRO was faster despite that.
>>>>>
>>>>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>>>>>
>>>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
>>>>
>>>> Unfortunately, for now it's impossible to get HW metadata such as RSS
>>>> hash and checksum status in cpumap. They're implemented via kfuncs
>>>> specific to a particular netdevice and this info is available only when
>>>> running XDP prog.
>>>>
>>>> But I think one solution could be:
>>>>
>>>> 1. We create some generic structure for cpumap, like
>>>>
>>>> struct cpumap_meta {
>>>> 	u32 magic;
>>>> 	u32 hash;
>>>> }
>>>>
>>>> 2. We add such check in the cpumap code
>>>>
>>>> 	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
>>>> 	    <here we check magic>)
>>>> 		skb->hash = meta->hash;
>>>>
>>>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
>>>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
>>>
>>> Yes, except don't make this cpumap-specific, make it generic for kernel
>>> consumption of the metadata. That way it doesn't even have to be stored
>>> in the xdp metadata area, it can be anywhere we want (and hence not
>>> subject to ABI issues), and we can use it for skb creation after
>>> redirect in other places than cpumap as well (say, on veth devices).
>>>
>>> So it'll be:
>>>
>>> struct kernel_meta {
>>> 	u32 hash;
>>> 	u32 timestamp;
>>>          ...etc
>>> }
>>>
>>> and a kfunc:
>>>
>>> void store_xdp_kernel_meta(struct kernel meta *meta);
>>>
>>> which the XDP program can call to populate the metadata area.
>>
>> Hmm, nice!
>>
>> But where to store this info in case of cpumap if not in xdp->data_meta?

The cpumap has a xdp program. Instead of the kernel's cpumap code building the 
skb, the cpumap's xdp prog could build the skb itself and directly use the 
xdp->data_meta to init the skb.

I recall there was discussion about doing gro in a bpf prog (I may be 
mis-remembering though). If possible, then the cpumap's xdp prog can also do the 
gro?

>> When you convert XDP frames to skbs in the cpumap code, you only have
>> &xdp_frame and that's it. XDP prog was already run earlier from the
>> driver code at that point.
> 
> Well, we could put it in skb_shared_info? IIRC, some of the metadata
> (timestamps?) end up there when building an skb anyway, so we won't even
> have to copy it around...
Lorenzo Bianconi Aug. 10, 2024, 8 a.m. UTC | #14
On Aug 08, Alexander Lobakin wrote:
> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Thu, 8 Aug 2024 06:54:06 +0200
> 
> >> Hi Alexander,
> >>
> >> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>> cpumap has its own BH context based on kthread. It has a sane batch
> >>> size of 8 frames per one cycle.
> >>> GRO can be used on its own, adjust cpumap calls to the
> >>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>> It is most beneficial when a NIC which frame come from is XDP
> >>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>> than listed receiving even given that it has to calculate full frame
> >>> checksums on CPU.
> >>> As GRO passes the skbs to the upper stack in the batches of
> >>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>> device where the frame comes from, it is enough to disable GRO
> >>> netdev feature on it to completely restore the original behaviour:
> >>> untouched frames will be being bulked and passed to the upper stack
> >>> by 8, as it was with netif_receive_skb_list().
> >>>
> >>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>> ---
> >>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>>  1 file changed, 38 insertions(+), 5 deletions(-)
> >>>
> >>
> >> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >> cpumap is still missing this.
> 
> The only concern for having GRO in cpumap without metadata from the NIC
> descriptor was that when the checksum status is missing, GRO calculates
> the checksum on CPU, which is not really fast.
> But I remember sometimes GRO was faster despite that.

For the moment we could test it with UDP traffic with checksum disabled.

Regards,
Lorenzo

> 
> >>
> >> I have a production use case for this now. We want to do some intelligent
> >> RX steering and I think GRO would help over list-ified receive in some cases.
> >> We would prefer steer in HW (and thus get existing GRO support) but not all
> >> our NICs support it. So we need a software fallback.
> >>
> >> Are you still interested in merging the cpumap + GRO patches?
> 
> For sure I can revive this part. I was planning to get back to this
> branch and pick patches which were not related to XDP hints and send
> them separately.
> 
> > 
> > Hi Daniel and Alex,
> > 
> > Recently I worked on a PoC to add GRO support to cpumap codebase:
> > - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> >   Here I added GRO support to cpumap through gro-cells.
> > - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> >   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> >   changes to them).
> 
> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> overkill, that's why I separated GRO structure from &napi_struct.
> 
> Let me maybe find some free time, I would then test all 3 solutions
> (mine, gro_cells, threaded NAPI) and pick/send the best?
> 
> > 
> > Please note I have not run any performance tests so far, just verified it does
> > not crash (I was planning to resume this work soon). Please let me know if it
> > works for you.
> > 
> > Regards,
> > Lorenzo
> > 
> >>
> >> Thanks,
> >> Daniel
> 
> Thanks,
> Olek
>
Lorenzo Bianconi Aug. 10, 2024, 8:02 a.m. UTC | #15
On Aug 09, Toke wrote:
> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> 
> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> > Date: Fri, 09 Aug 2024 14:45:33 +0200
> >
> >> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> >> 
> >>> From: Daniel Xu <dxu@dxuuu.xyz>
> >>> Date: Thu, 08 Aug 2024 16:52:51 -0400
> >>>
> >>>> Hi,
> >>>>
> >>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
> >>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> >>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
> >>>>>
> >>>>>>> Hi Alexander,
> >>>>>>>
> >>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
> >>>>>>>> size of 8 frames per one cycle.
> >>>>>>>> GRO can be used on its own, adjust cpumap calls to the
> >>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>>>>>>> It is most beneficial when a NIC which frame come from is XDP
> >>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>>>>>>> than listed receiving even given that it has to calculate full frame
> >>>>>>>> checksums on CPU.
> >>>>>>>> As GRO passes the skbs to the upper stack in the batches of
> >>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>>>>>>> device where the frame comes from, it is enough to disable GRO
> >>>>>>>> netdev feature on it to completely restore the original behaviour:
> >>>>>>>> untouched frames will be being bulked and passed to the upper stack
> >>>>>>>> by 8, as it was with netif_receive_skb_list().
> >>>>>>>>
> >>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>>>>>>> ---
> >>>>>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>>>>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
> >>>>>>>>
> >>>>>>>
> >>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >>>>>>> cpumap is still missing this.
> >>>>>
> >>>>> The only concern for having GRO in cpumap without metadata from the NIC
> >>>>> descriptor was that when the checksum status is missing, GRO calculates
> >>>>> the checksum on CPU, which is not really fast.
> >>>>> But I remember sometimes GRO was faster despite that.
> >>>>
> >>>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
> >>>>
> >>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
> >>>
> >>> Unfortunately, for now it's impossible to get HW metadata such as RSS
> >>> hash and checksum status in cpumap. They're implemented via kfuncs
> >>> specific to a particular netdevice and this info is available only when
> >>> running XDP prog.
> >>>
> >>> But I think one solution could be:
> >>>
> >>> 1. We create some generic structure for cpumap, like
> >>>
> >>> struct cpumap_meta {
> >>> 	u32 magic;
> >>> 	u32 hash;
> >>> }
> >>>
> >>> 2. We add such check in the cpumap code
> >>>
> >>> 	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
> >>> 	    <here we check magic>)
> >>> 		skb->hash = meta->hash;
> >>>
> >>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
> >>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
> >> 
> >> Yes, except don't make this cpumap-specific, make it generic for kernel
> >> consumption of the metadata. That way it doesn't even have to be stored
> >> in the xdp metadata area, it can be anywhere we want (and hence not
> >> subject to ABI issues), and we can use it for skb creation after
> >> redirect in other places than cpumap as well (say, on veth devices).
> >> 
> >> So it'll be:
> >> 
> >> struct kernel_meta {
> >> 	u32 hash;
> >> 	u32 timestamp;
> >>         ...etc
> >> }
> >> 
> >> and a kfunc:
> >> 
> >> void store_xdp_kernel_meta(struct kernel meta *meta);
> >> 
> >> which the XDP program can call to populate the metadata area.
> >
> > Hmm, nice!
> >
> > But where to store this info in case of cpumap if not in xdp->data_meta?
> > When you convert XDP frames to skbs in the cpumap code, you only have
> > &xdp_frame and that's it. XDP prog was already run earlier from the
> > driver code at that point.
> 
> Well, we could put it in skb_shared_info? IIRC, some of the metadata
> (timestamps?) end up there when building an skb anyway, so we won't even
> have to copy it around...

Before vacation I started looking into it a bit, I will resume this work in one
week or so.

Regards,
Lorenzo

> 
> -Toke
>
Jakub Kicinski Aug. 13, 2024, 1:33 a.m. UTC | #16
On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote:
> But I think one solution could be:
> 
> 1. We create some generic structure for cpumap, like
> 
> struct cpumap_meta {
> 	u32 magic;
> 	u32 hash;
> }
> 
> 2. We add such check in the cpumap code
> 
> 	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
> 	    <here we check magic>)
> 		skb->hash = meta->hash;
> 
> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.

I wonder what the overhead of skb metadata allocation is in practice.
With Eric's "return skb to the CPU of origin" we can feed the lockless
skb cache one the right CPU, and also feed the lockless page pool
cache. I wonder if batched RFS wouldn't be faster than the XDP thing
that requires all the groundwork.
Jesper Dangaard Brouer Aug. 13, 2024, 9:51 a.m. UTC | #17
On 13/08/2024 03.33, Jakub Kicinski wrote:
> On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote:
>> But I think one solution could be:
>>
>> 1. We create some generic structure for cpumap, like
>>
>> struct cpumap_meta {
>> 	u32 magic;
>> 	u32 hash;
>> }
>>
>> 2. We add such check in the cpumap code
>>
>> 	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
>> 	    <here we check magic>)
>> 		skb->hash = meta->hash;
>>
>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
> 
> I wonder what the overhead of skb metadata allocation is in practice.
> With Eric's "return skb to the CPU of origin" we can feed the lockless
> skb cache one the right CPU, and also feed the lockless page pool
> cache. I wonder if batched RFS wouldn't be faster than the XDP thing
> that requires all the groundwork.

I explicitly developed CPUMAP because I was benchmarking Receive Flow
Steering (RFS) and Receive Packet Steering (RPS), which I observed was
the bottleneck.  The overhead was too large on the RX-CPU and bottleneck
due to RFS and RPS maintaining data structures to avoid Out-of-Order
packets.   The Flow Dissector step was also a limiting factor.

By bottleneck I mean it didn't scale, as RX-CPU packet per second
processing speeds was too low compared to the remote-CPU pps.
Digging in my old notes, I can see that RPS was limited to around 4.8
Mpps (and I have a weird disabling part of it showing 7.5Mpps).  In [1]
remote-CPU could process (starts at) 2.7 Mpps when dropping UDP packet
due to UdpNoPorts configured (and baseline 3.3 Mpps if not remote), thus
it only scales up-to 1.78 remote-CPUs.  [1] shows how optimizations
brings remote-CPU to handle 3.2Mpps (close non-remote to 3.3Mpps
baseline). In [2] those optimizations bring remote-CPU to 4Mpps (for
UdpNoPorts case).  XDP RX-redirect in [1]+[2] was around 19Mpps (which
might be lower today due to perf paper cuts).

  [1] 
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap02-optimizations.org
  [2] 
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org

The benefits Eric's "return skb to the CPU of origin" should help
improve the case for the remote-CPU, as I was seeing some bottlenecks in
how we returned the memory.

--Jesper
Alexander Lobakin Aug. 13, 2024, 2:09 p.m. UTC | #18
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Thu, 8 Aug 2024 13:57:00 +0200

> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Thu, 8 Aug 2024 06:54:06 +0200
> 
>>> Hi Alexander,
>>>
>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>> size of 8 frames per one cycle.
>>>> GRO can be used on its own, adjust cpumap calls to the
>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>> It is most beneficial when a NIC which frame come from is XDP
>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>> than listed receiving even given that it has to calculate full frame
>>>> checksums on CPU.
>>>> As GRO passes the skbs to the upper stack in the batches of
>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>> device where the frame comes from, it is enough to disable GRO
>>>> netdev feature on it to completely restore the original behaviour:
>>>> untouched frames will be being bulked and passed to the upper stack
>>>> by 8, as it was with netif_receive_skb_list().
>>>>
>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>> ---
>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>>
>>>
>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>> cpumap is still missing this.
> 
> The only concern for having GRO in cpumap without metadata from the NIC
> descriptor was that when the checksum status is missing, GRO calculates
> the checksum on CPU, which is not really fast.
> But I remember sometimes GRO was faster despite that.
> 
>>>
>>> I have a production use case for this now. We want to do some intelligent
>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>> our NICs support it. So we need a software fallback.
>>>
>>> Are you still interested in merging the cpumap + GRO patches?
> 
> For sure I can revive this part. I was planning to get back to this
> branch and pick patches which were not related to XDP hints and send
> them separately.
> 
>>
>> Hi Daniel and Alex,
>>
>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>   Here I added GRO support to cpumap through gro-cells.
>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>   changes to them).
> 
> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> overkill, that's why I separated GRO structure from &napi_struct.
> 
> Let me maybe find some free time, I would then test all 3 solutions
> (mine, gro_cells, threaded NAPI) and pick/send the best?
> 
>>
>> Please note I have not run any performance tests so far, just verified it does
>> not crash (I was planning to resume this work soon). Please let me know if it
>> works for you.

I did tests on both threaded NAPI for cpumap and my old implementation
with a traffic generator and I have the following (in Kpps):

            direct Rx    direct GRO    cpumap    cpumap GRO
baseline    2900         5800          2700      2700 (N/A)
threaded                               2300      4000
old GRO                                2300      4000

IOW,

1. There are no differences in perf between Lorenzo's threaded NAPI
   GRO implementation and my old implementation, but Lorenzo's is also
   a very nice cleanup as it switches cpumap to threaded NAPI completely
   and the final diffstat even removes more lines than adds, while mine
   adds a bunch of lines and refactors a couple hundred, so I'd go with
   his variant.

2. After switching to NAPI, the performance without GRO decreases (2.3
   Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
   (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
   manually.

Note that the code is not polished to the top and I also have a good
improvement for allocating skb heads from the percpu NAPI cache in my
old tree which I'm planning to add to the series, so the final
improvement will be even bigger.

+ after we find how to pass checksum hint to cpumap, it will be yet
another big improvement for GRO (current code won't benefit from
this at all)

To Lorenzo:

Would it be fine if I prepare a series containing your patch for
threaded NAPI for cpumap (I'd polish it and break into 2 or 3) +
skb allocation optimization and send it OR you wanted to send this
on your own? I'm fine with either, in the first case, everything
would land within one series with the respective credits; in case
of the latter, I'd need to send a followup :)

>>
>> Regards,
>> Lorenzo
>>
>>>
>>> Thanks,
>>> Daniel

Thanks,
Olek
Toke Høiland-Jørgensen Aug. 13, 2024, 2:54 p.m. UTC | #19
Alexander Lobakin <aleksander.lobakin@intel.com> writes:

> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Thu, 8 Aug 2024 13:57:00 +0200
>
>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>> 
>>>> Hi Alexander,
>>>>
>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>> size of 8 frames per one cycle.
>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>> than listed receiving even given that it has to calculate full frame
>>>>> checksums on CPU.
>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>> device where the frame comes from, it is enough to disable GRO
>>>>> netdev feature on it to completely restore the original behaviour:
>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>
>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>> ---
>>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>
>>>>
>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>> cpumap is still missing this.
>> 
>> The only concern for having GRO in cpumap without metadata from the NIC
>> descriptor was that when the checksum status is missing, GRO calculates
>> the checksum on CPU, which is not really fast.
>> But I remember sometimes GRO was faster despite that.
>> 
>>>>
>>>> I have a production use case for this now. We want to do some intelligent
>>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>>> our NICs support it. So we need a software fallback.
>>>>
>>>> Are you still interested in merging the cpumap + GRO patches?
>> 
>> For sure I can revive this part. I was planning to get back to this
>> branch and pick patches which were not related to XDP hints and send
>> them separately.
>> 
>>>
>>> Hi Daniel and Alex,
>>>
>>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>>   Here I added GRO support to cpumap through gro-cells.
>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>>   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>>   changes to them).
>> 
>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
>> overkill, that's why I separated GRO structure from &napi_struct.
>> 
>> Let me maybe find some free time, I would then test all 3 solutions
>> (mine, gro_cells, threaded NAPI) and pick/send the best?
>> 
>>>
>>> Please note I have not run any performance tests so far, just verified it does
>>> not crash (I was planning to resume this work soon). Please let me know if it
>>> works for you.
>
> I did tests on both threaded NAPI for cpumap and my old implementation
> with a traffic generator and I have the following (in Kpps):
>
>             direct Rx    direct GRO    cpumap    cpumap GRO
> baseline    2900         5800          2700      2700 (N/A)
> threaded                               2300      4000
> old GRO                                2300      4000
>
> IOW,
>
> 1. There are no differences in perf between Lorenzo's threaded NAPI
>    GRO implementation and my old implementation, but Lorenzo's is also
>    a very nice cleanup as it switches cpumap to threaded NAPI completely
>    and the final diffstat even removes more lines than adds, while mine
>    adds a bunch of lines and refactors a couple hundred, so I'd go with
>    his variant.
>
> 2. After switching to NAPI, the performance without GRO decreases (2.3
>    Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>    (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>    manually.

One question for this: IIUC, the benefit of GRO varies with the traffic
mix, depending on how much the GRO logic can actually aggregate. So did
you test the pathological case as well (spraying packets over so many
flows that there is basically no aggregation taking place)? Just to make
sure we don't accidentally screw up performance in that case while
optimising for the aggregating case :)

-Toke
Jesper Dangaard Brouer Aug. 13, 2024, 3:57 p.m. UTC | #20
On 13/08/2024 16.54, Toke Høiland-Jørgensen wrote:
> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> 
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Thu, 8 Aug 2024 13:57:00 +0200
>>
>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>
>>>>> Hi Alexander,
>>>>>
>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>> size of 8 frames per one cycle.
>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>> checksums on CPU.
>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>
>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> ---
>>>>>>   kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>   1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>
>>>>>
>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>> cpumap is still missing this.
>>>
>>> The only concern for having GRO in cpumap without metadata from the NIC
>>> descriptor was that when the checksum status is missing, GRO calculates
>>> the checksum on CPU, which is not really fast.
>>> But I remember sometimes GRO was faster despite that.
>>>
>>>>>
>>>>> I have a production use case for this now. We want to do some intelligent
>>>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>>>> our NICs support it. So we need a software fallback.
>>>>>
>>>>> Are you still interested in merging the cpumap + GRO patches?
>>>
>>> For sure I can revive this part. I was planning to get back to this
>>> branch and pick patches which were not related to XDP hints and send
>>> them separately.
>>>
>>>>
>>>> Hi Daniel and Alex,
>>>>
>>>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>>>    Here I added GRO support to cpumap through gro-cells.
>>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>>>    Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>>>    changes to them).
>>>
>>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
>>> overkill, that's why I separated GRO structure from &napi_struct.
>>>
>>> Let me maybe find some free time, I would then test all 3 solutions
>>> (mine, gro_cells, threaded NAPI) and pick/send the best?
>>>
>>>>
>>>> Please note I have not run any performance tests so far, just verified it does
>>>> not crash (I was planning to resume this work soon). Please let me know if it
>>>> works for you.
>>
>> I did tests on both threaded NAPI for cpumap and my old implementation
>> with a traffic generator and I have the following (in Kpps):
>>

What kind of traffic is the traffic generator sending?

E.g. is this a type of traffic that gets GRO aggregated?

>>              direct Rx    direct GRO    cpumap    cpumap GRO
>> baseline    2900         5800          2700      2700 (N/A)
>> threaded                               2300      4000
>> old GRO                                2300      4000
>>

Nice results. Just to confirm, the units are in Kpps.


>> IOW,
>>
>> 1. There are no differences in perf between Lorenzo's threaded NAPI
>>     GRO implementation and my old implementation, but Lorenzo's is also
>>     a very nice cleanup as it switches cpumap to threaded NAPI completely
>>     and the final diffstat even removes more lines than adds, while mine
>>     adds a bunch of lines and refactors a couple hundred, so I'd go with
>>     his variant.
>>
>> 2. After switching to NAPI, the performance without GRO decreases (2.3
>>     Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>>     (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>>     manually.
> 
> One question for this: IIUC, the benefit of GRO varies with the traffic
> mix, depending on how much the GRO logic can actually aggregate. So did
> you test the pathological case as well (spraying packets over so many
> flows that there is basically no aggregation taking place)? Just to make
> sure we don't accidentally screw up performance in that case while
> optimising for the aggregating case :)
> 

For the GRO use-case, I think a basic TCP stream throughput test (like
netperf) should show a benefit once cpumap enable GRO, Can you confirm this?
Or does the missing hardware RX-hash and RX-checksum cause TCP GRO not
to fully work, yet?

Thanks A LOT for doing this benchmarking!
--Jesper
Lorenzo Bianconi Aug. 13, 2024, 4:14 p.m. UTC | #21
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Thu, 8 Aug 2024 13:57:00 +0200
> 
> > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> > Date: Thu, 8 Aug 2024 06:54:06 +0200
> > 
> >>> Hi Alexander,
> >>>
> >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>>> cpumap has its own BH context based on kthread. It has a sane batch
> >>>> size of 8 frames per one cycle.
> >>>> GRO can be used on its own, adjust cpumap calls to the
> >>>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>>> It is most beneficial when a NIC which frame come from is XDP
> >>>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>>> than listed receiving even given that it has to calculate full frame
> >>>> checksums on CPU.
> >>>> As GRO passes the skbs to the upper stack in the batches of
> >>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>>> device where the frame comes from, it is enough to disable GRO
> >>>> netdev feature on it to completely restore the original behaviour:
> >>>> untouched frames will be being bulked and passed to the upper stack
> >>>> by 8, as it was with netif_receive_skb_list().
> >>>>
> >>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>>> ---
> >>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>>>  1 file changed, 38 insertions(+), 5 deletions(-)
> >>>>
> >>>
> >>> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >>> cpumap is still missing this.
> > 
> > The only concern for having GRO in cpumap without metadata from the NIC
> > descriptor was that when the checksum status is missing, GRO calculates
> > the checksum on CPU, which is not really fast.
> > But I remember sometimes GRO was faster despite that.
> > 
> >>>
> >>> I have a production use case for this now. We want to do some intelligent
> >>> RX steering and I think GRO would help over list-ified receive in some cases.
> >>> We would prefer steer in HW (and thus get existing GRO support) but not all
> >>> our NICs support it. So we need a software fallback.
> >>>
> >>> Are you still interested in merging the cpumap + GRO patches?
> > 
> > For sure I can revive this part. I was planning to get back to this
> > branch and pick patches which were not related to XDP hints and send
> > them separately.
> > 
> >>
> >> Hi Daniel and Alex,
> >>
> >> Recently I worked on a PoC to add GRO support to cpumap codebase:
> >> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> >>   Here I added GRO support to cpumap through gro-cells.
> >> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> >>   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> >>   changes to them).
> > 
> > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> > overkill, that's why I separated GRO structure from &napi_struct.
> > 
> > Let me maybe find some free time, I would then test all 3 solutions
> > (mine, gro_cells, threaded NAPI) and pick/send the best?
> > 
> >>
> >> Please note I have not run any performance tests so far, just verified it does
> >> not crash (I was planning to resume this work soon). Please let me know if it
> >> works for you.
> 
> I did tests on both threaded NAPI for cpumap and my old implementation
> with a traffic generator and I have the following (in Kpps):
> 
>             direct Rx    direct GRO    cpumap    cpumap GRO
> baseline    2900         5800          2700      2700 (N/A)
> threaded                               2300      4000
> old GRO                                2300      4000

cool, very nice improvement

> 
> IOW,
> 
> 1. There are no differences in perf between Lorenzo's threaded NAPI
>    GRO implementation and my old implementation, but Lorenzo's is also
>    a very nice cleanup as it switches cpumap to threaded NAPI completely
>    and the final diffstat even removes more lines than adds, while mine
>    adds a bunch of lines and refactors a couple hundred, so I'd go with
>    his variant.
> 
> 2. After switching to NAPI, the performance without GRO decreases (2.3
>    Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>    (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>    manually.
> 
> Note that the code is not polished to the top and I also have a good
> improvement for allocating skb heads from the percpu NAPI cache in my
> old tree which I'm planning to add to the series, so the final
> improvement will be even bigger.
> 
> + after we find how to pass checksum hint to cpumap, it will be yet
> another big improvement for GRO (current code won't benefit from
> this at all)
> 
> To Lorenzo:
> 
> Would it be fine if I prepare a series containing your patch for
> threaded NAPI for cpumap (I'd polish it and break into 2 or 3) +
> skb allocation optimization and send it OR you wanted to send this
> on your own? I'm fine with either, in the first case, everything
> would land within one series with the respective credits; in case
> of the latter, I'd need to send a followup :)

Sure, I am fine to send my codebase into a bigger series.
Thanks a lot for testing :)

Regards,
Lorenzo

> 
> >>
> >> Regards,
> >> Lorenzo
> >>
> >>>
> >>> Thanks,
> >>> Daniel
> 
> Thanks,
> Olek
Lorenzo Bianconi Aug. 13, 2024, 4:27 p.m. UTC | #22
On Aug 13, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Thu, 8 Aug 2024 13:57:00 +0200
> 
> > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> > Date: Thu, 8 Aug 2024 06:54:06 +0200
> > 
> >>> Hi Alexander,
> >>>
> >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>>> cpumap has its own BH context based on kthread. It has a sane batch
> >>>> size of 8 frames per one cycle.
> >>>> GRO can be used on its own, adjust cpumap calls to the
> >>>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>>> It is most beneficial when a NIC which frame come from is XDP
> >>>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>>> than listed receiving even given that it has to calculate full frame
> >>>> checksums on CPU.
> >>>> As GRO passes the skbs to the upper stack in the batches of
> >>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>>> device where the frame comes from, it is enough to disable GRO
> >>>> netdev feature on it to completely restore the original behaviour:
> >>>> untouched frames will be being bulked and passed to the upper stack
> >>>> by 8, as it was with netif_receive_skb_list().
> >>>>
> >>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>>> ---
> >>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>>>  1 file changed, 38 insertions(+), 5 deletions(-)
> >>>>
> >>>
> >>> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >>> cpumap is still missing this.
> > 
> > The only concern for having GRO in cpumap without metadata from the NIC
> > descriptor was that when the checksum status is missing, GRO calculates
> > the checksum on CPU, which is not really fast.
> > But I remember sometimes GRO was faster despite that.
> > 
> >>>
> >>> I have a production use case for this now. We want to do some intelligent
> >>> RX steering and I think GRO would help over list-ified receive in some cases.
> >>> We would prefer steer in HW (and thus get existing GRO support) but not all
> >>> our NICs support it. So we need a software fallback.
> >>>
> >>> Are you still interested in merging the cpumap + GRO patches?
> > 
> > For sure I can revive this part. I was planning to get back to this
> > branch and pick patches which were not related to XDP hints and send
> > them separately.
> > 
> >>
> >> Hi Daniel and Alex,
> >>
> >> Recently I worked on a PoC to add GRO support to cpumap codebase:
> >> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> >>   Here I added GRO support to cpumap through gro-cells.
> >> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> >>   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> >>   changes to them).
> > 
> > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> > overkill, that's why I separated GRO structure from &napi_struct.
> > 
> > Let me maybe find some free time, I would then test all 3 solutions
> > (mine, gro_cells, threaded NAPI) and pick/send the best?
> > 
> >>
> >> Please note I have not run any performance tests so far, just verified it does
> >> not crash (I was planning to resume this work soon). Please let me know if it
> >> works for you.
> 
> I did tests on both threaded NAPI for cpumap and my old implementation
> with a traffic generator and I have the following (in Kpps):
> 
>             direct Rx    direct GRO    cpumap    cpumap GRO
> baseline    2900         5800          2700      2700 (N/A)
> threaded                               2300      4000
> old GRO                                2300      4000

out of my curiority, have you tested even the gro_cells one?

Lorenzo

> 
> IOW,
> 
> 1. There are no differences in perf between Lorenzo's threaded NAPI
>    GRO implementation and my old implementation, but Lorenzo's is also
>    a very nice cleanup as it switches cpumap to threaded NAPI completely
>    and the final diffstat even removes more lines than adds, while mine
>    adds a bunch of lines and refactors a couple hundred, so I'd go with
>    his variant.
> 
> 2. After switching to NAPI, the performance without GRO decreases (2.3
>    Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>    (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>    manually.
> 
> Note that the code is not polished to the top and I also have a good
> improvement for allocating skb heads from the percpu NAPI cache in my
> old tree which I'm planning to add to the series, so the final
> improvement will be even bigger.
> 
> + after we find how to pass checksum hint to cpumap, it will be yet
> another big improvement for GRO (current code won't benefit from
> this at all)
> 
> To Lorenzo:
> 
> Would it be fine if I prepare a series containing your patch for
> threaded NAPI for cpumap (I'd polish it and break into 2 or 3) +
> skb allocation optimization and send it OR you wanted to send this
> on your own? I'm fine with either, in the first case, everything
> would land within one series with the respective credits; in case
> of the latter, I'd need to send a followup :)
> 
> >>
> >> Regards,
> >> Lorenzo
> >>
> >>>
> >>> Thanks,
> >>> Daniel
> 
> Thanks,
> Olek
>
Alexander Lobakin Aug. 13, 2024, 4:31 p.m. UTC | #23
From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Date: Tue, 13 Aug 2024 18:27:41 +0200

> On Aug 13, Alexander Lobakin wrote:
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Thu, 8 Aug 2024 13:57:00 +0200
>>
>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>
>>>>> Hi Alexander,
>>>>>
>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>> size of 8 frames per one cycle.
>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>> checksums on CPU.
>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>
>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> ---
>>>>>>  kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>  1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>
>>>>>
>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>> cpumap is still missing this.
>>>
>>> The only concern for having GRO in cpumap without metadata from the NIC
>>> descriptor was that when the checksum status is missing, GRO calculates
>>> the checksum on CPU, which is not really fast.
>>> But I remember sometimes GRO was faster despite that.
>>>
>>>>>
>>>>> I have a production use case for this now. We want to do some intelligent
>>>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>>>> our NICs support it. So we need a software fallback.
>>>>>
>>>>> Are you still interested in merging the cpumap + GRO patches?
>>>
>>> For sure I can revive this part. I was planning to get back to this
>>> branch and pick patches which were not related to XDP hints and send
>>> them separately.
>>>
>>>>
>>>> Hi Daniel and Alex,
>>>>
>>>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>>>   Here I added GRO support to cpumap through gro-cells.
>>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>>>   Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>>>   changes to them).
>>>
>>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
>>> overkill, that's why I separated GRO structure from &napi_struct.
>>>
>>> Let me maybe find some free time, I would then test all 3 solutions
>>> (mine, gro_cells, threaded NAPI) and pick/send the best?
>>>
>>>>
>>>> Please note I have not run any performance tests so far, just verified it does
>>>> not crash (I was planning to resume this work soon). Please let me know if it
>>>> works for you.
>>
>> I did tests on both threaded NAPI for cpumap and my old implementation
>> with a traffic generator and I have the following (in Kpps):
>>
>>             direct Rx    direct GRO    cpumap    cpumap GRO
>> baseline    2900         5800          2700      2700 (N/A)
>> threaded                               2300      4000
>> old GRO                                2300      4000
> 
> out of my curiority, have you tested even the gro_cells one?

I haven't. I mean I could, but I don't feel like cpumap's kthread +
separate NAPI then could give better results than merged NAPI + kthread.

> 
> Lorenzo

Thanks,
Olek
Alexander Lobakin Aug. 19, 2024, 2:50 p.m. UTC | #24
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Tue, 13 Aug 2024 17:57:44 +0200

> 
> 
> On 13/08/2024 16.54, Toke Høiland-Jørgensen wrote:
>> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>>
>>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> Date: Thu, 8 Aug 2024 13:57:00 +0200
>>>
>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>
>>>>>> Hi Alexander,

[...]

>>> I did tests on both threaded NAPI for cpumap and my old implementation
>>> with a traffic generator and I have the following (in Kpps):
>>>
> 
> What kind of traffic is the traffic generator sending?
> 
> E.g. is this a type of traffic that gets GRO aggregated?

Yes. It's UDP, with the UDP GRO enabled on the receiver.

> 
>>>              direct Rx    direct GRO    cpumap    cpumap GRO
>>> baseline    2900         5800          2700      2700 (N/A)
>>> threaded                               2300      4000
>>> old GRO                                2300      4000
>>>
> 
> Nice results. Just to confirm, the units are in Kpps.

Yes. I.e. cpumap was giving 2.7 Mpps without GRO, then 4.0 Mpps with it.

> 
> 
>>> IOW,
>>>
>>> 1. There are no differences in perf between Lorenzo's threaded NAPI
>>>     GRO implementation and my old implementation, but Lorenzo's is also
>>>     a very nice cleanup as it switches cpumap to threaded NAPI
>>> completely
>>>     and the final diffstat even removes more lines than adds, while mine
>>>     adds a bunch of lines and refactors a couple hundred, so I'd go with
>>>     his variant.
>>>
>>> 2. After switching to NAPI, the performance without GRO decreases (2.3
>>>     Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>>>     (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>>>     manually.
>>
>> One question for this: IIUC, the benefit of GRO varies with the traffic
>> mix, depending on how much the GRO logic can actually aggregate. So did
>> you test the pathological case as well (spraying packets over so many
>> flows that there is basically no aggregation taking place)? Just to make
>> sure we don't accidentally screw up performance in that case while
>> optimising for the aggregating case :)
>>
> 
> For the GRO use-case, I think a basic TCP stream throughput test (like
> netperf) should show a benefit once cpumap enable GRO, Can you confirm
> this?

Yes, TCP benefits as well.

> Or does the missing hardware RX-hash and RX-checksum cause TCP GRO not
> to fully work, yet?

GRO works well for both TCP and UDP. The main bottleneck is that GRO
calculates the checksum manually on the CPU now, since there's no
checksum status from the NIC.
Also, missing Rx hash means GRO will place packets from every flow into
the same bucket, but it's not a big deal (they get compared layer by
layer anyway).

> 
> Thanks A LOT for doing this benchmarking!

I optimized the code a bit and picked my old patches for bulk NAPI skb
cache allocation and today I got 4.7 Mpps 
Daniel Xu Aug. 21, 2024, 12:29 a.m. UTC | #25
Hi Olek,

On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
[..]
> > Thanks A LOT for doing this benchmarking!
> 
> I optimized the code a bit and picked my old patches for bulk NAPI skb
> cache allocation and today I got 4.7 Mpps 
Alexander Lobakin Aug. 21, 2024, 1:16 p.m. UTC | #26
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Tue, 20 Aug 2024 17:29:45 -0700

> Hi Olek,
> 
> On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
> [..]
>>> Thanks A LOT for doing this benchmarking!
>>
>> I optimized the code a bit and picked my old patches for bulk NAPI skb
>> cache allocation and today I got 4.7 Mpps 
Daniel Xu Aug. 21, 2024, 4:36 p.m. UTC | #27
On Wed, Aug 21, 2024 at 03:16:51PM GMT, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Tue, 20 Aug 2024 17:29:45 -0700
> 
> > Hi Olek,
> > 
> > On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
> > [..]
> >>> Thanks A LOT for doing this benchmarking!
> >>
> >> I optimized the code a bit and picked my old patches for bulk NAPI skb
> >> cache allocation and today I got 4.7 Mpps 
diff mbox series

Patch

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index f4860ac756cd..2d0edf8f6a05 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -29,8 +29,8 @@ 
 #include <trace/events/xdp.h>
 #include <linux/btf_ids.h>
 
-#include <linux/netdevice.h>   /* netif_receive_skb_list */
-#include <linux/etherdevice.h> /* eth_type_trans */
+#include <linux/netdevice.h>
+#include <net/gro.h>
 
 /* General idea: XDP packets getting XDP redirected to another CPU,
  * will maximum be stored/queued for one driver ->poll() call.  It is
@@ -67,6 +67,8 @@  struct bpf_cpu_map_entry {
 	struct bpf_cpumap_val value;
 	struct bpf_prog *prog;
 
+	struct gro_node gro;
+
 	atomic_t refcnt; /* Control when this struct can be free'ed */
 	struct rcu_head rcu;
 
@@ -162,6 +164,7 @@  static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 	if (atomic_dec_and_test(&rcpu->refcnt)) {
 		if (rcpu->prog)
 			bpf_prog_put(rcpu->prog);
+		gro_cleanup(&rcpu->gro);
 		/* The queue should be empty at this point */
 		__cpu_map_ring_cleanup(rcpu->queue);
 		ptr_ring_cleanup(rcpu->queue, NULL);
@@ -295,6 +298,33 @@  static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
 	return nframes;
 }
 
+static void cpu_map_gro_flush(struct bpf_cpu_map_entry *rcpu,
+			      struct list_head *list)
+{
+	bool new = !list_empty(list);
+
+	if (likely(new))
+		gro_receive_skb_list(&rcpu->gro, list);
+
+	if (rcpu->gro.bitmask) {
+		bool flush_old = HZ >= 1000;
+
+		/* If the ring is not empty, there'll be a new iteration
+		 * soon, and we only need to do a full flush if a tick is
+		 * long (> 1 ms).
+		 * If the ring is empty, to not hold GRO packets in the
+		 * stack for too long, do a full flush.
+		 * This is equivalent to how NAPI decides whether to perform
+		 * a full flush (by batches of up to 64 frames tho).
+		 */
+		if (__ptr_ring_empty(rcpu->queue))
+			flush_old = false;
+
+		__gro_flush(&rcpu->gro, flush_old);
+	}
+
+	gro_normal_list(&rcpu->gro);
+}
 
 static int cpu_map_kthread_run(void *data)
 {
@@ -384,7 +414,7 @@  static int cpu_map_kthread_run(void *data)
 
 			list_add_tail(&skb->list, &list);
 		}
-		netif_receive_skb_list(&list);
+		cpu_map_gro_flush(rcpu, &list);
 
 		/* Feedback loop via tracepoint */
 		trace_xdp_cpumap_kthread(rcpu->map_id, n, kmem_alloc_drops,
@@ -460,8 +490,10 @@  __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 	rcpu->map_id = map->id;
 	rcpu->value.qsize  = value->qsize;
 
+	gro_init(&rcpu->gro, NULL);
+
 	if (fd > 0 && __cpu_map_load_bpf_program(rcpu, map, fd))
-		goto free_ptr_ring;
+		goto free_gro;
 
 	/* Setup kthread */
 	rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
@@ -482,7 +514,8 @@  __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
 free_prog:
 	if (rcpu->prog)
 		bpf_prog_put(rcpu->prog);
-free_ptr_ring:
+free_gro:
+	gro_cleanup(&rcpu->gro);
 	ptr_ring_cleanup(rcpu->queue, NULL);
 free_queue:
 	kfree(rcpu->queue);