Message ID | 7a01e4c7ddb84292cc284b6664c794b9a6e713a8.1707759574.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2] net: cache for same cpu skb_attempt_defer_free | expand |
On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > > Optimise skb_attempt_defer_free() executed by the CPU the skb was > allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can > disable softirqs and put the buffer into cpu local caches. > > Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it > showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking > with profiles, the total CPU share of skb_attempt_defer_free() dropped by > 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the > optimisation is for the receive path, but the test spends >55% of CPU > doing writes. > > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > > v2: remove in_hardirq() > > net/core/skbuff.c | 16 +++++++++++++++- > 1 file changed, 15 insertions(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 9b790994da0c..f32f358ef1d8 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext) > EXPORT_SYMBOL(__skb_ext_put); > #endif /* CONFIG_SKB_EXTENSIONS */ > > +static void kfree_skb_napi_cache(struct sk_buff *skb) > +{ > + /* if SKB is a clone, don't handle this case */ > + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { > + __kfree_skb(skb); > + return; > + } > + > + local_bh_disable(); > + skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false); I am trying to understand why we use false instead of true here ? Or if you prefer: local_bh_disable(); __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); local_bh_enable(); > + napi_skb_cache_put(skb); > + local_bh_enable(); > +} > + > /** > * skb_attempt_defer_free - queue skb for remote freeing > * @skb: buffer > @@ -6965,7 +6979,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) > if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || > !cpu_online(cpu) || > cpu == raw_smp_processor_id()) { > -nodefer: __kfree_skb(skb); > +nodefer: kfree_skb_napi_cache(skb); > return; > } > > -- > 2.43.0 >
On 2/13/24 13:53, Eric Dumazet wrote: > On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote: >> >> Optimise skb_attempt_defer_free() executed by the CPU the skb was >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can >> disable softirqs and put the buffer into cpu local caches. >> >> Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it >> showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking >> with profiles, the total CPU share of skb_attempt_defer_free() dropped by >> 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the >> optimisation is for the receive path, but the test spends >55% of CPU >> doing writes. >> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> >> --- >> >> v2: remove in_hardirq() >> >> net/core/skbuff.c | 16 +++++++++++++++- >> 1 file changed, 15 insertions(+), 1 deletion(-) >> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> index 9b790994da0c..f32f358ef1d8 100644 >> --- a/net/core/skbuff.c >> +++ b/net/core/skbuff.c >> @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext) >> EXPORT_SYMBOL(__skb_ext_put); >> #endif /* CONFIG_SKB_EXTENSIONS */ >> >> +static void kfree_skb_napi_cache(struct sk_buff *skb) >> +{ >> + /* if SKB is a clone, don't handle this case */ >> + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { >> + __kfree_skb(skb); >> + return; >> + } >> + >> + local_bh_disable(); >> + skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false); > > I am trying to understand why we use false instead of true here ? > Or if you prefer: > local_bh_disable(); > __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); > local_bh_enable(); Maybe it's my misunderstanding but disabled bh != "napi safe", e.g. the napi_struct we're interested in might be scheduled for another CPU. Which is also why "napi" prefix in percpu napi_alloc_cache sounds a bit misleading to me. The second reason is that it shouldn't change anything performance wise napi_pp_put_page(napi_safe) { ... if (napi_safe || in_softirq()) { ... } } >> + napi_skb_cache_put(skb); >> + local_bh_enable(); >> +} >> + >> /** >> * skb_attempt_defer_free - queue skb for remote freeing >> * @skb: buffer >> @@ -6965,7 +6979,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) >> if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || >> !cpu_online(cpu) || >> cpu == raw_smp_processor_id()) { >> -nodefer: __kfree_skb(skb); >> +nodefer: kfree_skb_napi_cache(skb); >> return; >> } >> >> -- >> 2.43.0 >>
On Tue, Feb 13, 2024 at 3:17 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > > On 2/13/24 13:53, Eric Dumazet wrote: > > On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote: > >> > >> Optimise skb_attempt_defer_free() executed by the CPU the skb was > >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can > >> disable softirqs and put the buffer into cpu local caches. > >> > >> Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it > >> showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking > >> with profiles, the total CPU share of skb_attempt_defer_free() dropped by > >> 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the > >> optimisation is for the receive path, but the test spends >55% of CPU > >> doing writes. > >> > >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > >> --- > >> > >> v2: remove in_hardirq() > >> > >> net/core/skbuff.c | 16 +++++++++++++++- > >> 1 file changed, 15 insertions(+), 1 deletion(-) > >> > >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c > >> index 9b790994da0c..f32f358ef1d8 100644 > >> --- a/net/core/skbuff.c > >> +++ b/net/core/skbuff.c > >> @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext) > >> EXPORT_SYMBOL(__skb_ext_put); > >> #endif /* CONFIG_SKB_EXTENSIONS */ > >> > >> +static void kfree_skb_napi_cache(struct sk_buff *skb) > >> +{ > >> + /* if SKB is a clone, don't handle this case */ > >> + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { > >> + __kfree_skb(skb); > >> + return; > >> + } > >> + > >> + local_bh_disable(); > >> + skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false); > > > > I am trying to understand why we use false instead of true here ? > > Or if you prefer: > > local_bh_disable(); > > __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); > > local_bh_enable(); > > Maybe it's my misunderstanding but disabled bh != "napi safe", > e.g. the napi_struct we're interested in might be scheduled for > another CPU. Which is also why "napi" prefix in percpu > napi_alloc_cache sounds a bit misleading to me. Indeed, this is very misleading. napi_skb_cache_put() & napi_skb_cache_get() should be renamed eventually. Thanks.
On Tue, 13 Feb 2024 14:07:24 +0000 Pavel Begunkov wrote: > >> + local_bh_disable(); > >> + skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false); > > > > I am trying to understand why we use false instead of true here ? > > Or if you prefer: > > local_bh_disable(); > > __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); > > local_bh_enable(); FWIW I had the same reaction. napi_safe = false followed by napi_skb_cache_put() looks sus. No argument that naming is bad, not the first time it comes up :( > Maybe it's my misunderstanding but disabled bh != "napi safe", > e.g. the napi_struct we're interested in might be scheduled for > another CPU. Which is also why "napi" prefix in percpu > napi_alloc_cache sounds a bit misleading to me. FWIW the skb recycling is called napi_* to hint to driver authors that if they are in NAPI context this is a better function to call. The connection to a particular NAPI instance matters only for the page pool recycling, but that's handled. The conditions you actually need to look out for are hardware IRQs and whatever async paths which can trigger trigger while NAPI is half way thru touching the cache of the local CPU. > The second reason is that it shouldn't change anything > performance wise > > napi_pp_put_page(napi_safe) { > ... > if (napi_safe || in_softirq()) { ... } > }
On 2/14/24 03:13, Jakub Kicinski wrote: > On Tue, 13 Feb 2024 14:07:24 +0000 Pavel Begunkov wrote: >>>> + local_bh_disable(); >>>> + skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false); >>> >>> I am trying to understand why we use false instead of true here ? >>> Or if you prefer: >>> local_bh_disable(); >>> __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED); >>> local_bh_enable(); > > FWIW I had the same reaction. napi_safe = false followed by > napi_skb_cache_put() looks sus. No argument that naming is bad, > not the first time it comes up :( > >> Maybe it's my misunderstanding but disabled bh != "napi safe", >> e.g. the napi_struct we're interested in might be scheduled for >> another CPU. Which is also why "napi" prefix in percpu >> napi_alloc_cache sounds a bit misleading to me. > > FWIW the skb recycling is called napi_* to hint to driver authors that > if they are in NAPI context this is a better function to call. Which is absolutely reasonable, napi_skb_cache_put() on the other hand is rather internal and wouldn't be used by drivers directly. I guess I'll just do a little bit of renaming later hopefully after this patch is taken in, unless there are other comments / objections. > The connection to a particular NAPI instance matters only for the page > pool recycling, but that's handled. The conditions you actually > need to look out for are hardware IRQs and whatever async paths which > can trigger trigger while NAPI is half way thru touching the cache of > the local CPU. Yeah the usual percpu (bh protected) caching stuff and likes, and the question is rather about the semantics. To draw some conclusion, do you suggest to change anything in the patch? >> The second reason is that it shouldn't change anything >> performance wise >> >> napi_pp_put_page(napi_safe) { >> ... >> if (napi_safe || in_softirq()) { ... } >> }
On Wed, 14 Feb 2024 16:37:41 +0000 Pavel Begunkov wrote: > To draw some conclusion, do you suggest to change anything > in the patch? Yes, Eric's suggestion of __napi_kfree_skb() is correct.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9b790994da0c..f32f358ef1d8 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext) EXPORT_SYMBOL(__skb_ext_put); #endif /* CONFIG_SKB_EXTENSIONS */ +static void kfree_skb_napi_cache(struct sk_buff *skb) +{ + /* if SKB is a clone, don't handle this case */ + if (skb->fclone != SKB_FCLONE_UNAVAILABLE) { + __kfree_skb(skb); + return; + } + + local_bh_disable(); + skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false); + napi_skb_cache_put(skb); + local_bh_enable(); +} + /** * skb_attempt_defer_free - queue skb for remote freeing * @skb: buffer @@ -6965,7 +6979,7 @@ void skb_attempt_defer_free(struct sk_buff *skb) if (WARN_ON_ONCE(cpu >= nr_cpu_ids) || !cpu_online(cpu) || cpu == raw_smp_processor_id()) { -nodefer: __kfree_skb(skb); +nodefer: kfree_skb_napi_cache(skb); return; }
Optimise skb_attempt_defer_free() executed by the CPU the skb was allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can disable softirqs and put the buffer into cpu local caches. Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking with profiles, the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the optimisation is for the receive path, but the test spends >55% of CPU doing writes. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- v2: remove in_hardirq() net/core/skbuff.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-)