diff mbox series

[net-next,2/4] net: gro: change skb_gro_network_header()

Message ID 20240301193740.3436871-3-edumazet@google.com (mailing list archive)
State Accepted
Commit bd56a29c7a4ebcd3ca69505a2e676449e60965f3
Delegated to: Netdev Maintainers
Headers show
Series net: gro: cleanups and fast path refinement | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 968 this patch: 968
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 986 this patch: 986
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 63 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-05--06-00 (tests: 891)

Commit Message

Eric Dumazet March 1, 2024, 7:37 p.m. UTC
Change skb_gro_network_header() to accept a const sk_buff
and to no longer check if frag0 is NULL or not.

This allows to remove skb_gro_frag0_invalidate()
which is seen in profiles when header-split is enabled.

sk_buff parameter is constified for skb_gro_header_fast(),
inet_gro_compute_pseudo() and ip6_gro_compute_pseudo().

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/gro.h | 27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

Comments

Paolo Abeni March 4, 2024, 8:28 a.m. UTC | #1
On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote:
> Change skb_gro_network_header() to accept a const sk_buff
> and to no longer check if frag0 is NULL or not.
> 
> This allows to remove skb_gro_frag0_invalidate()
> which is seen in profiles when header-split is enabled.

I have a few questions to help me understanding this patchset better:

skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb
callers?) because it's called multiple times for each aggregate packet,
right? I guessed writing the same cacheline multiple times per-se
should not be too much expansive.

perf here did not allow me to easily observed the mentioned cost,
because the function is inlined in many different places, I'm wondering
how you noticed?

Thanks!

Paolo
Eric Dumazet March 4, 2024, 9:06 a.m. UTC | #2
On Mon, Mar 4, 2024 at 9:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote:
> > Change skb_gro_network_header() to accept a const sk_buff
> > and to no longer check if frag0 is NULL or not.
> >
> > This allows to remove skb_gro_frag0_invalidate()
> > which is seen in profiles when header-split is enabled.
>
> I have a few questions to help me understanding this patchset better:
>
> skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb
> callers?) because it's called multiple times for each aggregate packet,
> right? I guessed writing the same cacheline multiple times per-se
> should not be too much expansive.

Apparently some (not very recent) intel cpus have issues (at least
with clang generated code) with
immediate reloads after a write.

I also saw some strange artifacts on ARM64 cpus, but it is hard to say,
I found perf to be not very precise on them.

>
> perf here did not allow me to easily observed the mentioned cost,
> because the function is inlined in many different places, I'm wondering
> how you noticed?

It is more about the whole patchset really, this gave me about 4%
improvement on saturated cpu
(RFS enabled, Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz)

One TCP flow : (1500 MTU)

New profile (6,233,000 pkts per second )
    19.76%  [kernel]       [k] gq_rx_napi_handler
    11.19%  [kernel]       [k] dev_gro_receive
     8.05%  [kernel]       [k] ipv6_gro_receive
     7.98%  [kernel]       [k] tcp_gro_receive
     7.25%  [kernel]       [k] skb_gro_receive
     5.47%  [kernel]       [k] gq_rx_prep_buffers
     4.39%  [kernel]       [k] skb_release_data
     3.91%  [kernel]       [k] tcp6_gro_receive
     3.55%  [kernel]       [k] csum_ipv6_magic
     3.06%  [kernel]       [k] napi_gro_frags
     2.76%  [kernel]       [k] napi_reuse_skb

Old profile (5,950,000 pkts per second)
    17.92%  [kernel]       [k] gq_rx_napi_handler
    10.22%  [kernel]       [k] dev_gro_receive
     8.60%  [kernel]       [k] tcp_gro_receive
     8.09%  [kernel]       [k] ipv6_gro_receive
     8.06%  [kernel]       [k] skb_gro_receive
     6.74%  [kernel]       [k] gq_rx_prep_buffers
     4.82%  [kernel]       [k] skb_release_data
     3.82%  [kernel]       [k] tcp6_gro_receive
     3.76%  [kernel]       [k] csum_ipv6_magic
     2.97%  [kernel]       [k] napi_gro_frags
     2.57%  [kernel]       [k] napi_reuse_skb
Paolo Abeni March 4, 2024, 10:29 a.m. UTC | #3
On Mon, 2024-03-04 at 10:06 +0100, Eric Dumazet wrote:
> On Mon, Mar 4, 2024 at 9:28 AM Paolo Abeni <pabeni@redhat.com> wrote:
> > 
> > On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote:
> > > Change skb_gro_network_header() to accept a const sk_buff
> > > and to no longer check if frag0 is NULL or not.
> > > 
> > > This allows to remove skb_gro_frag0_invalidate()
> > > which is seen in profiles when header-split is enabled.
> > 
> > I have a few questions to help me understanding this patchset better:
> > 
> > skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb
> > callers?) because it's called multiple times for each aggregate packet,
> > right? I guessed writing the same cacheline multiple times per-se
> > should not be too much expansive.
> 
> Apparently some (not very recent) intel cpus have issues (at least
> with clang generated code) with
> immediate reloads after a write.

Ah! I *think* I have observed the same even with gcc (accessing some CB
fields just after the initial zeroing popped-up in perf report. 

> I also saw some strange artifacts on ARM64 cpus, but it is hard to say,
> I found perf to be not very precise on them.
> 
> > 
> > perf here did not allow me to easily observed the mentioned cost,
> > because the function is inlined in many different places, I'm wondering
> > how you noticed?
> 
> It is more about the whole patchset really, this gave me about 4%
> improvement on saturated cpu
> (RFS enabled, Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz)
> 
> One TCP flow : (1500 MTU)
> 
> New profile (6,233,000 pkts per second )
>     19.76%  [kernel]       [k] gq_rx_napi_handler
>     11.19%  [kernel]       [k] dev_gro_receive
>      8.05%  [kernel]       [k] ipv6_gro_receive
>      7.98%  [kernel]       [k] tcp_gro_receive
>      7.25%  [kernel]       [k] skb_gro_receive
>      5.47%  [kernel]       [k] gq_rx_prep_buffers
>      4.39%  [kernel]       [k] skb_release_data
>      3.91%  [kernel]       [k] tcp6_gro_receive
>      3.55%  [kernel]       [k] csum_ipv6_magic
>      3.06%  [kernel]       [k] napi_gro_frags
>      2.76%  [kernel]       [k] napi_reuse_skb
> 
> Old profile (5,950,000 pkts per second)
>     17.92%  [kernel]       [k] gq_rx_napi_handler
>     10.22%  [kernel]       [k] dev_gro_receive
>      8.60%  [kernel]       [k] tcp_gro_receive
>      8.09%  [kernel]       [k] ipv6_gro_receive
>      8.06%  [kernel]       [k] skb_gro_receive
>      6.74%  [kernel]       [k] gq_rx_prep_buffers
>      4.82%  [kernel]       [k] skb_release_data
>      3.82%  [kernel]       [k] tcp6_gro_receive
>      3.76%  [kernel]       [k] csum_ipv6_magic
>      2.97%  [kernel]       [k] napi_gro_frags
>      2.57%  [kernel]       [k] napi_reuse_skb

Thanks for the detailed info! I'll try to benchmark this on non
napi_gro_frags enabled driver, but don't hold your breath meanwhile!

Cheers,

Paolo
Richard Gobert March 4, 2024, 1:04 p.m. UTC | #4
Eric Dumazet wrote:
> New profile (6,233,000 pkts per second )
>     19.76%  [kernel]       [k] gq_rx_napi_handler
>     11.19%  [kernel]       [k] dev_gro_receive
>      8.05%  [kernel]       [k] ipv6_gro_receive
>      7.98%  [kernel]       [k] tcp_gro_receive
>      7.25%  [kernel]       [k] skb_gro_receive
>      5.47%  [kernel]       [k] gq_rx_prep_buffers
>      4.39%  [kernel]       [k] skb_release_data
>      3.91%  [kernel]       [k] tcp6_gro_receive
>      3.55%  [kernel]       [k] csum_ipv6_magic
>      3.06%  [kernel]       [k] napi_gro_frags
>      2.76%  [kernel]       [k] napi_reuse_skb
> 
> Old profile (5,950,000 pkts per second)
>     17.92%  [kernel]       [k] gq_rx_napi_handler
>     10.22%  [kernel]       [k] dev_gro_receive
>      8.60%  [kernel]       [k] tcp_gro_receive
>      8.09%  [kernel]       [k] ipv6_gro_receive
>      8.06%  [kernel]       [k] skb_gro_receive
>      6.74%  [kernel]       [k] gq_rx_prep_buffers
>      4.82%  [kernel]       [k] skb_release_data
>      3.82%  [kernel]       [k] tcp6_gro_receive
>      3.76%  [kernel]       [k] csum_ipv6_magic
>      2.97%  [kernel]       [k] napi_gro_frags
>      2.57%  [kernel]       [k] napi_reuse_skb

Overall looks like a great gain for GRO, less code for handling frag0 :)

Could you please share how to measure a <10% gain in pps in a stable
manner? While perf top is stable for me when testing CPU-bound tasks,
netperf pps measurements between 2 physical machines generate ~5-7%
noise when I try to measure.

Thanks
Eric Dumazet March 4, 2024, 1:14 p.m. UTC | #5
On Mon, Mar 4, 2024 at 2:04 PM Richard Gobert <richardbgobert@gmail.com> wrote:

> Overall looks like a great gain for GRO, less code for handling frag0 :)
>
> Could you please share how to measure a <10% gain in pps in a stable
> manner? While perf top is stable for me when testing CPU-bound tasks,
> netperf pps measurements between 2 physical machines generate ~5-7%
> noise when I try to measure.

The pps are measured without "perf top" running.
"sar -n DEV 5 5" , or other non intrusive monitoring tool.

Quite often, noise comes from NUMA hosts, because the NIC has direct
attachment to one of them.
diff mbox series

Patch

diff --git a/include/net/gro.h b/include/net/gro.h
index ffc2c96d263b0399a81465d903a6181271b4a3f7..3c3666e46b3055caa619f2da0b6b8b20192a03b4 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -139,7 +139,7 @@  static inline void skb_gro_pull(struct sk_buff *skb, unsigned int len)
 	NAPI_GRO_CB(skb)->data_offset += len;
 }
 
-static inline void *skb_gro_header_fast(struct sk_buff *skb,
+static inline void *skb_gro_header_fast(const struct sk_buff *skb,
 					unsigned int offset)
 {
 	return NAPI_GRO_CB(skb)->frag0 + offset;
@@ -151,24 +151,17 @@  static inline bool skb_gro_may_pull(const struct sk_buff *skb,
 	return hlen <= NAPI_GRO_CB(skb)->frag0_len;
 }
 
-static inline void skb_gro_frag0_invalidate(struct sk_buff *skb)
-{
-	NAPI_GRO_CB(skb)->frag0 = NULL;
-	NAPI_GRO_CB(skb)->frag0_len = 0;
-}
-
 static inline void *skb_gro_header_slow(struct sk_buff *skb, unsigned int hlen,
 					unsigned int offset)
 {
 	if (!pskb_may_pull(skb, hlen))
 		return NULL;
 
-	skb_gro_frag0_invalidate(skb);
 	return skb->data + offset;
 }
 
-static inline void *skb_gro_header(struct sk_buff *skb,
-					unsigned int hlen, unsigned int offset)
+static inline void *skb_gro_header(struct sk_buff *skb, unsigned int hlen,
+				   unsigned int offset)
 {
 	void *ptr;
 
@@ -178,13 +171,16 @@  static inline void *skb_gro_header(struct sk_buff *skb,
 	return ptr;
 }
 
-static inline void *skb_gro_network_header(struct sk_buff *skb)
+static inline void *skb_gro_network_header(const struct sk_buff *skb)
 {
-	return (NAPI_GRO_CB(skb)->frag0 ?: skb->data) +
-	       skb_network_offset(skb);
+	if (skb_gro_may_pull(skb, skb_gro_offset(skb)))
+		return skb_gro_header_fast(skb, skb_network_offset(skb));
+
+	return skb_network_header(skb);
 }
 
-static inline __wsum inet_gro_compute_pseudo(struct sk_buff *skb, int proto)
+static inline __wsum inet_gro_compute_pseudo(const struct sk_buff *skb,
+					     int proto)
 {
 	const struct iphdr *iph = skb_gro_network_header(skb);
 
@@ -422,7 +418,8 @@  static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 	return uh;
 }
 
-static inline __wsum ip6_gro_compute_pseudo(struct sk_buff *skb, int proto)
+static inline __wsum ip6_gro_compute_pseudo(const struct sk_buff *skb,
+					    int proto)
 {
 	const struct ipv6hdr *iph = skb_gro_network_header(skb);