Message ID | 20240301193740.3436871-3-edumazet@google.com (mailing list archive) |
---|---|
State | Accepted |
Commit | bd56a29c7a4ebcd3ca69505a2e676449e60965f3 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: gro: cleanups and fast path refinement | expand |
On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote: > Change skb_gro_network_header() to accept a const sk_buff > and to no longer check if frag0 is NULL or not. > > This allows to remove skb_gro_frag0_invalidate() > which is seen in profiles when header-split is enabled. I have a few questions to help me understanding this patchset better: skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb callers?) because it's called multiple times for each aggregate packet, right? I guessed writing the same cacheline multiple times per-se should not be too much expansive. perf here did not allow me to easily observed the mentioned cost, because the function is inlined in many different places, I'm wondering how you noticed? Thanks! Paolo
On Mon, Mar 4, 2024 at 9:28 AM Paolo Abeni <pabeni@redhat.com> wrote: > > On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote: > > Change skb_gro_network_header() to accept a const sk_buff > > and to no longer check if frag0 is NULL or not. > > > > This allows to remove skb_gro_frag0_invalidate() > > which is seen in profiles when header-split is enabled. > > I have a few questions to help me understanding this patchset better: > > skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb > callers?) because it's called multiple times for each aggregate packet, > right? I guessed writing the same cacheline multiple times per-se > should not be too much expansive. Apparently some (not very recent) intel cpus have issues (at least with clang generated code) with immediate reloads after a write. I also saw some strange artifacts on ARM64 cpus, but it is hard to say, I found perf to be not very precise on them. > > perf here did not allow me to easily observed the mentioned cost, > because the function is inlined in many different places, I'm wondering > how you noticed? It is more about the whole patchset really, this gave me about 4% improvement on saturated cpu (RFS enabled, Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz) One TCP flow : (1500 MTU) New profile (6,233,000 pkts per second ) 19.76% [kernel] [k] gq_rx_napi_handler 11.19% [kernel] [k] dev_gro_receive 8.05% [kernel] [k] ipv6_gro_receive 7.98% [kernel] [k] tcp_gro_receive 7.25% [kernel] [k] skb_gro_receive 5.47% [kernel] [k] gq_rx_prep_buffers 4.39% [kernel] [k] skb_release_data 3.91% [kernel] [k] tcp6_gro_receive 3.55% [kernel] [k] csum_ipv6_magic 3.06% [kernel] [k] napi_gro_frags 2.76% [kernel] [k] napi_reuse_skb Old profile (5,950,000 pkts per second) 17.92% [kernel] [k] gq_rx_napi_handler 10.22% [kernel] [k] dev_gro_receive 8.60% [kernel] [k] tcp_gro_receive 8.09% [kernel] [k] ipv6_gro_receive 8.06% [kernel] [k] skb_gro_receive 6.74% [kernel] [k] gq_rx_prep_buffers 4.82% [kernel] [k] skb_release_data 3.82% [kernel] [k] tcp6_gro_receive 3.76% [kernel] [k] csum_ipv6_magic 2.97% [kernel] [k] napi_gro_frags 2.57% [kernel] [k] napi_reuse_skb
On Mon, 2024-03-04 at 10:06 +0100, Eric Dumazet wrote: > On Mon, Mar 4, 2024 at 9:28 AM Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Fri, 2024-03-01 at 19:37 +0000, Eric Dumazet wrote: > > > Change skb_gro_network_header() to accept a const sk_buff > > > and to no longer check if frag0 is NULL or not. > > > > > > This allows to remove skb_gro_frag0_invalidate() > > > which is seen in profiles when header-split is enabled. > > > > I have a few questions to help me understanding this patchset better: > > > > skb_gro_frag0_invalidate() shows in profiles (for non napi_frags_skb > > callers?) because it's called multiple times for each aggregate packet, > > right? I guessed writing the same cacheline multiple times per-se > > should not be too much expansive. > > Apparently some (not very recent) intel cpus have issues (at least > with clang generated code) with > immediate reloads after a write. Ah! I *think* I have observed the same even with gcc (accessing some CB fields just after the initial zeroing popped-up in perf report. > I also saw some strange artifacts on ARM64 cpus, but it is hard to say, > I found perf to be not very precise on them. > > > > > perf here did not allow me to easily observed the mentioned cost, > > because the function is inlined in many different places, I'm wondering > > how you noticed? > > It is more about the whole patchset really, this gave me about 4% > improvement on saturated cpu > (RFS enabled, Intel(R) Xeon(R) Gold 6268L CPU @ 2.80GHz) > > One TCP flow : (1500 MTU) > > New profile (6,233,000 pkts per second ) > 19.76% [kernel] [k] gq_rx_napi_handler > 11.19% [kernel] [k] dev_gro_receive > 8.05% [kernel] [k] ipv6_gro_receive > 7.98% [kernel] [k] tcp_gro_receive > 7.25% [kernel] [k] skb_gro_receive > 5.47% [kernel] [k] gq_rx_prep_buffers > 4.39% [kernel] [k] skb_release_data > 3.91% [kernel] [k] tcp6_gro_receive > 3.55% [kernel] [k] csum_ipv6_magic > 3.06% [kernel] [k] napi_gro_frags > 2.76% [kernel] [k] napi_reuse_skb > > Old profile (5,950,000 pkts per second) > 17.92% [kernel] [k] gq_rx_napi_handler > 10.22% [kernel] [k] dev_gro_receive > 8.60% [kernel] [k] tcp_gro_receive > 8.09% [kernel] [k] ipv6_gro_receive > 8.06% [kernel] [k] skb_gro_receive > 6.74% [kernel] [k] gq_rx_prep_buffers > 4.82% [kernel] [k] skb_release_data > 3.82% [kernel] [k] tcp6_gro_receive > 3.76% [kernel] [k] csum_ipv6_magic > 2.97% [kernel] [k] napi_gro_frags > 2.57% [kernel] [k] napi_reuse_skb Thanks for the detailed info! I'll try to benchmark this on non napi_gro_frags enabled driver, but don't hold your breath meanwhile! Cheers, Paolo
Eric Dumazet wrote: > New profile (6,233,000 pkts per second ) > 19.76% [kernel] [k] gq_rx_napi_handler > 11.19% [kernel] [k] dev_gro_receive > 8.05% [kernel] [k] ipv6_gro_receive > 7.98% [kernel] [k] tcp_gro_receive > 7.25% [kernel] [k] skb_gro_receive > 5.47% [kernel] [k] gq_rx_prep_buffers > 4.39% [kernel] [k] skb_release_data > 3.91% [kernel] [k] tcp6_gro_receive > 3.55% [kernel] [k] csum_ipv6_magic > 3.06% [kernel] [k] napi_gro_frags > 2.76% [kernel] [k] napi_reuse_skb > > Old profile (5,950,000 pkts per second) > 17.92% [kernel] [k] gq_rx_napi_handler > 10.22% [kernel] [k] dev_gro_receive > 8.60% [kernel] [k] tcp_gro_receive > 8.09% [kernel] [k] ipv6_gro_receive > 8.06% [kernel] [k] skb_gro_receive > 6.74% [kernel] [k] gq_rx_prep_buffers > 4.82% [kernel] [k] skb_release_data > 3.82% [kernel] [k] tcp6_gro_receive > 3.76% [kernel] [k] csum_ipv6_magic > 2.97% [kernel] [k] napi_gro_frags > 2.57% [kernel] [k] napi_reuse_skb Overall looks like a great gain for GRO, less code for handling frag0 :) Could you please share how to measure a <10% gain in pps in a stable manner? While perf top is stable for me when testing CPU-bound tasks, netperf pps measurements between 2 physical machines generate ~5-7% noise when I try to measure. Thanks
On Mon, Mar 4, 2024 at 2:04 PM Richard Gobert <richardbgobert@gmail.com> wrote: > Overall looks like a great gain for GRO, less code for handling frag0 :) > > Could you please share how to measure a <10% gain in pps in a stable > manner? While perf top is stable for me when testing CPU-bound tasks, > netperf pps measurements between 2 physical machines generate ~5-7% > noise when I try to measure. The pps are measured without "perf top" running. "sar -n DEV 5 5" , or other non intrusive monitoring tool. Quite often, noise comes from NUMA hosts, because the NIC has direct attachment to one of them.
diff --git a/include/net/gro.h b/include/net/gro.h index ffc2c96d263b0399a81465d903a6181271b4a3f7..3c3666e46b3055caa619f2da0b6b8b20192a03b4 100644 --- a/include/net/gro.h +++ b/include/net/gro.h @@ -139,7 +139,7 @@ static inline void skb_gro_pull(struct sk_buff *skb, unsigned int len) NAPI_GRO_CB(skb)->data_offset += len; } -static inline void *skb_gro_header_fast(struct sk_buff *skb, +static inline void *skb_gro_header_fast(const struct sk_buff *skb, unsigned int offset) { return NAPI_GRO_CB(skb)->frag0 + offset; @@ -151,24 +151,17 @@ static inline bool skb_gro_may_pull(const struct sk_buff *skb, return hlen <= NAPI_GRO_CB(skb)->frag0_len; } -static inline void skb_gro_frag0_invalidate(struct sk_buff *skb) -{ - NAPI_GRO_CB(skb)->frag0 = NULL; - NAPI_GRO_CB(skb)->frag0_len = 0; -} - static inline void *skb_gro_header_slow(struct sk_buff *skb, unsigned int hlen, unsigned int offset) { if (!pskb_may_pull(skb, hlen)) return NULL; - skb_gro_frag0_invalidate(skb); return skb->data + offset; } -static inline void *skb_gro_header(struct sk_buff *skb, - unsigned int hlen, unsigned int offset) +static inline void *skb_gro_header(struct sk_buff *skb, unsigned int hlen, + unsigned int offset) { void *ptr; @@ -178,13 +171,16 @@ static inline void *skb_gro_header(struct sk_buff *skb, return ptr; } -static inline void *skb_gro_network_header(struct sk_buff *skb) +static inline void *skb_gro_network_header(const struct sk_buff *skb) { - return (NAPI_GRO_CB(skb)->frag0 ?: skb->data) + - skb_network_offset(skb); + if (skb_gro_may_pull(skb, skb_gro_offset(skb))) + return skb_gro_header_fast(skb, skb_network_offset(skb)); + + return skb_network_header(skb); } -static inline __wsum inet_gro_compute_pseudo(struct sk_buff *skb, int proto) +static inline __wsum inet_gro_compute_pseudo(const struct sk_buff *skb, + int proto) { const struct iphdr *iph = skb_gro_network_header(skb); @@ -422,7 +418,8 @@ static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb) return uh; } -static inline __wsum ip6_gro_compute_pseudo(struct sk_buff *skb, int proto) +static inline __wsum ip6_gro_compute_pseudo(const struct sk_buff *skb, + int proto) { const struct ipv6hdr *iph = skb_gro_network_header(skb);
Change skb_gro_network_header() to accept a const sk_buff and to no longer check if frag0 is NULL or not. This allows to remove skb_gro_frag0_invalidate() which is seen in profiles when header-split is enabled. sk_buff parameter is constified for skb_gro_header_fast(), inet_gro_compute_pseudo() and ip6_gro_compute_pseudo(). Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/net/gro.h | 27 ++++++++++++--------------- 1 file changed, 12 insertions(+), 15 deletions(-)