[v2] net: cache for same cpu skb_attempt_defer_free

Message ID	7a01e4c7ddb84292cc284b6664c794b9a6e713a8.1707759574.git.asml.silence@gmail.com (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BDD0A56444 for <netdev@vger.kernel.org>; Tue, 13 Feb 2024 13:42:54 +0000 (UTC) From: Pavel Begunkov <asml.silence@gmail.com> To: netdev@vger.kernel.org, edumazet@google.com, davem@davemloft.net, dsahern@kernel.org, pabeni@redhat.com, kuba@kernel.org Cc: Pavel Begunkov <asml.silence@gmail.com> Subject: [PATCH v2] net: cache for same cpu skb_attempt_defer_free Date: Tue, 13 Feb 2024 13:33:08 +0000 Message-ID: <7a01e4c7ddb84292cc284b6664c794b9a6e713a8.1707759574.git.asml.silence@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v2] net: cache for same cpu skb_attempt_defer_free \| expand [v2] net: cache for same cpu skb_attempt_defer_free

Context	Check	Description
netdev/series_format	warning	Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection	success	Guessed tree name to be net-next
netdev/ynl	success	Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 991 this patch: 991
netdev/build_tools	success	No tools touched, skip
netdev/cc_maintainers	success	CCed 4 of 4 maintainers
netdev/build_clang	success	Errors and warnings before: 1007 this patch: 1007
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/deprecated_api	success	None detected
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1008 this patch: 1008
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 28 lines checked
netdev/build_clang_rust	success	No Rust files in patch. Skipping build
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
netdev/contest	success	net-next-2024-02-13--21-00 (tests: 1437)

Pavel Begunkov Feb. 13, 2024, 1:33 p.m. UTC

Optimise skb_attempt_defer_free() executed by the CPU the skb was
allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
disable softirqs and put the buffer into cpu local caches.

Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it
showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking
with profiles, the total CPU share of skb_attempt_defer_free() dropped by
0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the
optimisation is for the receive path, but the test spends >55% of CPU
doing writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

v2: remove in_hardirq()

 net/core/skbuff.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Eric Dumazet Feb. 13, 2024, 1:53 p.m. UTC | #1

On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> Optimise skb_attempt_defer_free() executed by the CPU the skb was
> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> disable softirqs and put the buffer into cpu local caches.
>
> Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it
> showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking
> with profiles, the total CPU share of skb_attempt_defer_free() dropped by
> 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the
> optimisation is for the receive path, but the test spends >55% of CPU
> doing writes.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>
> v2: remove in_hardirq()
>
>  net/core/skbuff.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 9b790994da0c..f32f358ef1d8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext)
>  EXPORT_SYMBOL(__skb_ext_put);
>  #endif /* CONFIG_SKB_EXTENSIONS */
>
> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> +{
> +       /* if SKB is a clone, don't handle this case */
> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> +               __kfree_skb(skb);
> +               return;
> +       }
> +
> +       local_bh_disable();
> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);

I am trying to understand why we use false instead of true here ?
Or if you prefer:
local_bh_disable();
__napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
local_bh_enable();


> +       napi_skb_cache_put(skb);
> +       local_bh_enable();
> +}
> +
>  /**
>   * skb_attempt_defer_free - queue skb for remote freeing
>   * @skb: buffer
> @@ -6965,7 +6979,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>         if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>             !cpu_online(cpu) ||
>             cpu == raw_smp_processor_id()) {
> -nodefer:       __kfree_skb(skb);
> +nodefer:       kfree_skb_napi_cache(skb);
>                 return;
>         }
>
> --
> 2.43.0
>

Pavel Begunkov Feb. 13, 2024, 2:07 p.m. UTC | #2

On 2/13/24 13:53, Eric Dumazet wrote:
> On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> Optimise skb_attempt_defer_free() executed by the CPU the skb was
>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>> disable softirqs and put the buffer into cpu local caches.
>>
>> Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it
>> showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking
>> with profiles, the total CPU share of skb_attempt_defer_free() dropped by
>> 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the
>> optimisation is for the receive path, but the test spends >55% of CPU
>> doing writes.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> v2: remove in_hardirq()
>>
>>   net/core/skbuff.c | 16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 9b790994da0c..f32f358ef1d8 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext)
>>   EXPORT_SYMBOL(__skb_ext_put);
>>   #endif /* CONFIG_SKB_EXTENSIONS */
>>
>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>> +{
>> +       /* if SKB is a clone, don't handle this case */
>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>> +               __kfree_skb(skb);
>> +               return;
>> +       }
>> +
>> +       local_bh_disable();
>> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);
> 
> I am trying to understand why we use false instead of true here ?
> Or if you prefer:
> local_bh_disable();
> __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> local_bh_enable();

Maybe it's my misunderstanding but disabled bh != "napi safe",
e.g. the napi_struct we're interested in might be scheduled for
another CPU. Which is also why "napi" prefix in percpu
napi_alloc_cache sounds a bit misleading to me.

The second reason is that it shouldn't change anything
performance wise

napi_pp_put_page(napi_safe) {
     ...
     if (napi_safe || in_softirq()) { ... }
}


>> +       napi_skb_cache_put(skb);
>> +       local_bh_enable();
>> +}
>> +
>>   /**
>>    * skb_attempt_defer_free - queue skb for remote freeing
>>    * @skb: buffer
>> @@ -6965,7 +6979,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>          if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>              !cpu_online(cpu) ||
>>              cpu == raw_smp_processor_id()) {
>> -nodefer:       __kfree_skb(skb);
>> +nodefer:       kfree_skb_napi_cache(skb);
>>                  return;
>>          }
>>
>> --
>> 2.43.0
>>

Eric Dumazet Feb. 13, 2024, 2:28 p.m. UTC | #3

On Tue, Feb 13, 2024 at 3:17 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 2/13/24 13:53, Eric Dumazet wrote:
> > On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> Optimise skb_attempt_defer_free() executed by the CPU the skb was
> >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> >> disable softirqs and put the buffer into cpu local caches.
> >>
> >> Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it
> >> showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking
> >> with profiles, the total CPU share of skb_attempt_defer_free() dropped by
> >> 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the
> >> optimisation is for the receive path, but the test spends >55% of CPU
> >> doing writes.
> >>
> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >> ---
> >>
> >> v2: remove in_hardirq()
> >>
> >>   net/core/skbuff.c | 16 +++++++++++++++-
> >>   1 file changed, 15 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index 9b790994da0c..f32f358ef1d8 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext)
> >>   EXPORT_SYMBOL(__skb_ext_put);
> >>   #endif /* CONFIG_SKB_EXTENSIONS */
> >>
> >> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> >> +{
> >> +       /* if SKB is a clone, don't handle this case */
> >> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> >> +               __kfree_skb(skb);
> >> +               return;
> >> +       }
> >> +
> >> +       local_bh_disable();
> >> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);
> >
> > I am trying to understand why we use false instead of true here ?
> > Or if you prefer:
> > local_bh_disable();
> > __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> > local_bh_enable();
>
> Maybe it's my misunderstanding but disabled bh != "napi safe",
> e.g. the napi_struct we're interested in might be scheduled for
> another CPU. Which is also why "napi" prefix in percpu
> napi_alloc_cache sounds a bit misleading to me.

Indeed, this is very misleading.

napi_skb_cache_put() & napi_skb_cache_get() should be renamed eventually.

Thanks.

Jakub Kicinski Feb. 14, 2024, 3:13 a.m. UTC | #4

On Tue, 13 Feb 2024 14:07:24 +0000 Pavel Begunkov wrote:
> >> +       local_bh_disable();
> >> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);  
> > 
> > I am trying to understand why we use false instead of true here ?
> > Or if you prefer:
> > local_bh_disable();
> > __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> > local_bh_enable();  

FWIW I had the same reaction. napi_safe = false followed by
napi_skb_cache_put() looks sus. No argument that naming is bad,
not the first time it comes up :(

> Maybe it's my misunderstanding but disabled bh != "napi safe",
> e.g. the napi_struct we're interested in might be scheduled for
> another CPU. Which is also why "napi" prefix in percpu
> napi_alloc_cache sounds a bit misleading to me.

FWIW the skb recycling is called napi_* to hint to driver authors that
if they are in NAPI context this is a better function to call.

The connection to a particular NAPI instance matters only for the page
pool recycling, but that's handled. The conditions you actually
need to look out for are hardware IRQs and whatever async paths which
can trigger trigger while NAPI is half way thru touching the cache of
the local CPU.

> The second reason is that it shouldn't change anything
> performance wise
> 
> napi_pp_put_page(napi_safe) {
>      ...
>      if (napi_safe || in_softirq()) { ... }
> }

Pavel Begunkov Feb. 14, 2024, 4:37 p.m. UTC | #5

On 2/14/24 03:13, Jakub Kicinski wrote:
> On Tue, 13 Feb 2024 14:07:24 +0000 Pavel Begunkov wrote:
>>>> +       local_bh_disable();
>>>> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);
>>>
>>> I am trying to understand why we use false instead of true here ?
>>> Or if you prefer:
>>> local_bh_disable();
>>> __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>>> local_bh_enable();
> 
> FWIW I had the same reaction. napi_safe = false followed by
> napi_skb_cache_put() looks sus. No argument that naming is bad,
> not the first time it comes up :(
> 
>> Maybe it's my misunderstanding but disabled bh != "napi safe",
>> e.g. the napi_struct we're interested in might be scheduled for
>> another CPU. Which is also why "napi" prefix in percpu
>> napi_alloc_cache sounds a bit misleading to me.
> 
> FWIW the skb recycling is called napi_* to hint to driver authors that
> if they are in NAPI context this is a better function to call.

Which is absolutely reasonable, napi_skb_cache_put() on the
other hand is rather internal and wouldn't be used by drivers
directly.
I guess I'll just do a little bit of renaming later hopefully after
this patch is taken in, unless there are other comments / objections.

> The connection to a particular NAPI instance matters only for the page
> pool recycling, but that's handled. The conditions you actually
> need to look out for are hardware IRQs and whatever async paths which
> can trigger trigger while NAPI is half way thru touching the cache of
> the local CPU.

Yeah the usual percpu (bh protected) caching stuff and likes, and
the question is rather about the semantics.

To draw some conclusion, do you suggest to change anything
in the patch?

>> The second reason is that it shouldn't change anything
>> performance wise
>>
>> napi_pp_put_page(napi_safe) {
>>       ...
>>       if (napi_safe || in_softirq()) { ... }
>> }

Jakub Kicinski Feb. 14, 2024, 4:49 p.m. UTC | #6

On Wed, 14 Feb 2024 16:37:41 +0000 Pavel Begunkov wrote:
> To draw some conclusion, do you suggest to change anything
> in the patch?

Yes, Eric's suggestion of __napi_kfree_skb() is correct.

[v2] net: cache for same cpu skb_attempt_defer_free

Checks

Commit Message

Comments

Patch