[net-next,2/4] net: gro: change skb_gro_network_header()

Message ID	20240301193740.3436871-3-edumazet@google.com (mailing list archive)
State	Accepted
Commit	bd56a29c7a4ebcd3ca69505a2e676449e60965f3
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D0D43B2A6 for <netdev@vger.kernel.org>; Fri, 1 Mar 2024 19:37:48 +0000 (UTC) Date: Fri, 1 Mar 2024 19:37:38 +0000 In-Reply-To: <20240301193740.3436871-1-edumazet@google.com> Precedence: bulk Mime-Version: 1.0 References: <20240301193740.3436871-1-edumazet@google.com> Message-ID: <20240301193740.3436871-3-edumazet@google.com> Subject: [PATCH net-next 2/4] net: gro: change skb_gro_network_header() From: Eric Dumazet <edumazet@google.com> To: "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com> Cc: Richard Gobert <richardbgobert@gmail.com>, netdev@vger.kernel.org, eric.dumazet@gmail.com, Eric Dumazet <edumazet@google.com> Content-Type: text/plain; charset="UTF-8"
Series	net: gro: cleanups and fast path refinement \| expand [net-next,0/4] net: gro: cleanups and fast path refinement [net-next,1/4] net: gro: rename skb_gro_header_hard() [net-next,2/4] net: gro: change skb_gro_network_header() [net-next,3/4] net: gro: enable fast path for more cases [net-next,4/4] tcp: gro: micro optimizations in tcp[4]_gro_complete()

Context	Check	Description
netdev/series_format	success	Posting correctly formatted
netdev/tree_selection	success	Clearly marked for net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 968 this patch: 968
netdev/build_tools	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	success	CCed 4 of 4 maintainers
netdev/build_clang	success	Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 986 this patch: 986
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 63 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
netdev/contest	success	net-next-2024-03-05--06-00 (tests: 891)

Eric Dumazet March 1, 2024, 7:37 p.m. UTC

Change skb_gro_network_header() to accept a const sk_buff
and to no longer check if frag0 is NULL or not.

This allows to remove skb_gro_frag0_invalidate()
which is seen in profiles when header-split is enabled.

sk_buff parameter is constified for skb_gro_header_fast(),
inet_gro_compute_pseudo() and ip6_gro_compute_pseudo().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/gro.h | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

Paolo Abeni March 4, 2024, 8:28 a.m. UTC | #1

On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote:
> Change skb_gro_network_header() to accept a const sk_buff
> and to no longer check if frag0 is NULL or not.
> 
> This allows to remove skb_gro_frag0_invalidate()
> which is seen in profiles when header-split is enabled.

I have a few questions to help me understanding this patchset better:

skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb
callers?) because it's called multiple times for each aggregate packet,
right? I guessed writing the same cacheline multiple times per-se
should not be too much expansive.

perf here did not allow me to easily observed the mentioned cost,
because the function is inlined in many different places, I'm wondering
how you noticed?

Thanks!

Paolo

Eric Dumazet March 4, 2024, 9:06 a.m. UTC | #2

On Mon, Mar 4, 2024 at 9:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote:
> > Change skb_gro_network_header() to accept a const sk_buff
> > and to no longer check if frag0 is NULL or not.
> >
> > This allows to remove skb_gro_frag0_invalidate()
> > which is seen in profiles when header-split is enabled.
>
> I have a few questions to help me understanding this patchset better:
>
> skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb
> callers?) because it's called multiple times for each aggregate packet,
> right? I guessed writing the same cacheline multiple times per-se
> should not be too much expansive.

Apparently some (not very recent) intel cpus have issues (at least
with clang generated code) with
immediate reloads after a write.

I also saw some strange artifacts on ARM64 cpus, but it is hard to say,
I found perf to be not very precise on them.

>
> perf here did not allow me to easily observed the mentioned cost,
> because the function is inlined in many different places, I'm wondering
> how you noticed?

It is more about the whole patchset really, this gave me about 4%
improvement on saturated cpu
(RFS enabled, Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz)

One TCP flow : (1500 MTU)

New profile (6,233,000 pkts per second )
    19.76%  [kernel]       [k] gq_rx_napi_handler
    11.19%  [kernel]       [k] dev_gro_receive
     8.05%  [kernel]       [k] ipv6_gro_receive
     7.98%  [kernel]       [k] tcp_gro_receive
     7.25%  [kernel]       [k] skb_gro_receive
     5.47%  [kernel]       [k] gq_rx_prep_buffers
     4.39%  [kernel]       [k] skb_release_data
     3.91%  [kernel]       [k] tcp6_gro_receive
     3.55%  [kernel]       [k] csum_ipv6_magic
     3.06%  [kernel]       [k] napi_gro_frags
     2.76%  [kernel]       [k] napi_reuse_skb

Old profile (5,950,000 pkts per second)
    17.92%  [kernel]       [k] gq_rx_napi_handler
    10.22%  [kernel]       [k] dev_gro_receive
     8.60%  [kernel]       [k] tcp_gro_receive
     8.09%  [kernel]       [k] ipv6_gro_receive
     8.06%  [kernel]       [k] skb_gro_receive
     6.74%  [kernel]       [k] gq_rx_prep_buffers
     4.82%  [kernel]       [k] skb_release_data
     3.82%  [kernel]       [k] tcp6_gro_receive
     3.76%  [kernel]       [k] csum_ipv6_magic
     2.97%  [kernel]       [k] napi_gro_frags
     2.57%  [kernel]       [k] napi_reuse_skb

Paolo Abeni March 4, 2024, 10:29 a.m. UTC | #3

On Mon, 2024-03-04 at 10:06 +0100, Eric Dumazet wrote:
> On Mon, Mar 4, 2024 at 9:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote:
> > > Change skb_gro_network_header() to accept a const sk_buff
> > > and to no longer check if frag0 is NULL or not.
> > > 
> > > This allows to remove skb_gro_frag0_invalidate()
> > > which is seen in profiles when header-split is enabled.
> > 
> > I have a few questions to help me understanding this patchset better:
> > 
> > skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb
> > callers?) because it's called multiple times for each aggregate packet,
> > right? I guessed writing the same cacheline multiple times per-se
> > should not be too much expansive.
> 
> Apparently some (not very recent) intel cpus have issues (at least
> with clang generated code) with
> immediate reloads after a write.

Ah! I *think* I have observed the same even with gcc (accessing some CB
fields just after the initial zeroing popped-up in perf report. 

> I also saw some strange artifacts on ARM64 cpus, but it is hard to say,
> I found perf to be not very precise on them.
> 
> > 
> > perf here did not allow me to easily observed the mentioned cost,
> > because the function is inlined in many different places, I'm wondering
> > how you noticed?
> 
> It is more about the whole patchset really, this gave me about 4%
> improvement on saturated cpu
> (RFS enabled, Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz)
> 
> One TCP flow : (1500 MTU)
> 
> New profile (6,233,000 pkts per second )
>     19.76%  [kernel]       [k] gq_rx_napi_handler
>     11.19%  [kernel]       [k] dev_gro_receive
>      8.05%  [kernel]       [k] ipv6_gro_receive
>      7.98%  [kernel]       [k] tcp_gro_receive
>      7.25%  [kernel]       [k] skb_gro_receive
>      5.47%  [kernel]       [k] gq_rx_prep_buffers
>      4.39%  [kernel]       [k] skb_release_data
>      3.91%  [kernel]       [k] tcp6_gro_receive
>      3.55%  [kernel]       [k] csum_ipv6_magic
>      3.06%  [kernel]       [k] napi_gro_frags
>      2.76%  [kernel]       [k] napi_reuse_skb
> 
> Old profile (5,950,000 pkts per second)
>     17.92%  [kernel]       [k] gq_rx_napi_handler
>     10.22%  [kernel]       [k] dev_gro_receive
>      8.60%  [kernel]       [k] tcp_gro_receive
>      8.09%  [kernel]       [k] ipv6_gro_receive
>      8.06%  [kernel]       [k] skb_gro_receive
>      6.74%  [kernel]       [k] gq_rx_prep_buffers
>      4.82%  [kernel]       [k] skb_release_data
>      3.82%  [kernel]       [k] tcp6_gro_receive
>      3.76%  [kernel]       [k] csum_ipv6_magic
>      2.97%  [kernel]       [k] napi_gro_frags
>      2.57%  [kernel]       [k] napi_reuse_skb

Thanks for the detailed info! I'll try to benchmark this on non
napi_gro_frags enabled driver, but don't hold your breath meanwhile!

Cheers,

Paolo

Richard Gobert March 4, 2024, 1:04 p.m. UTC | #4

Eric Dumazet wrote:
> New profile (6,233,000 pkts per second )
>     19.76%  [kernel]       [k] gq_rx_napi_handler
>     11.19%  [kernel]       [k] dev_gro_receive
>      8.05%  [kernel]       [k] ipv6_gro_receive
>      7.98%  [kernel]       [k] tcp_gro_receive
>      7.25%  [kernel]       [k] skb_gro_receive
>      5.47%  [kernel]       [k] gq_rx_prep_buffers
>      4.39%  [kernel]       [k] skb_release_data
>      3.91%  [kernel]       [k] tcp6_gro_receive
>      3.55%  [kernel]       [k] csum_ipv6_magic
>      3.06%  [kernel]       [k] napi_gro_frags
>      2.76%  [kernel]       [k] napi_reuse_skb
> 
> Old profile (5,950,000 pkts per second)
>     17.92%  [kernel]       [k] gq_rx_napi_handler
>     10.22%  [kernel]       [k] dev_gro_receive
>      8.60%  [kernel]       [k] tcp_gro_receive
>      8.09%  [kernel]       [k] ipv6_gro_receive
>      8.06%  [kernel]       [k] skb_gro_receive
>      6.74%  [kernel]       [k] gq_rx_prep_buffers
>      4.82%  [kernel]       [k] skb_release_data
>      3.82%  [kernel]       [k] tcp6_gro_receive
>      3.76%  [kernel]       [k] csum_ipv6_magic
>      2.97%  [kernel]       [k] napi_gro_frags
>      2.57%  [kernel]       [k] napi_reuse_skb

Overall looks like a great gain for GRO, less code for handling frag0 :)

Could you please share how to measure a <10% gain in pps in a stable
manner? While perf top is stable for me when testing CPU-bound tasks,
netperf pps measurements between 2 physical machines generate ~5-7%
noise when I try to measure.

Thanks

Eric Dumazet March 4, 2024, 1:14 p.m. UTC | #5

On Mon, Mar 4, 2024 at 2:04 PM Richard Gobert <richardbgobert@gmail.com> wrote:

> Overall looks like a great gain for GRO, less code for handling frag0 :)
>
> Could you please share how to measure a <10% gain in pps in a stable
> manner? While perf top is stable for me when testing CPU-bound tasks,
> netperf pps measurements between 2 physical machines generate ~5-7%
> noise when I try to measure.

The pps are measured without "perf top" running.
"sar -n DEV 5 5" , or other non intrusive monitoring tool.

Quite often, noise comes from NUMA hosts, because the NIC has direct
attachment to one of them.

[net-next,2/4] net: gro: change skb_gro_network_header()

Checks

Commit Message

Comments

Patch