diff mbox series

[v2] net: cache for same cpu skb_attempt_defer_free

Message ID 7a01e4c7ddb84292cc284b6664c794b9a6e713a8.1707759574.git.asml.silence@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v2] net: cache for same cpu skb_attempt_defer_free | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 991 this patch: 991
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/build_clang success Errors and warnings before: 1007 this patch: 1007
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 1008 this patch: 1008
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 28 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-13--21-00 (tests: 1437)

Commit Message

Pavel Begunkov Feb. 13, 2024, 1:33 p.m. UTC
Optimise skb_attempt_defer_free() executed by the CPU the skb was
allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
disable softirqs and put the buffer into cpu local caches.

Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it
showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking
with profiles, the total CPU share of skb_attempt_defer_free() dropped by
0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the
optimisation is for the receive path, but the test spends >55% of CPU
doing writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

v2: remove in_hardirq()

 net/core/skbuff.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Feb. 13, 2024, 1:53 p.m. UTC | #1
On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> Optimise skb_attempt_defer_free() executed by the CPU the skb was
> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> disable softirqs and put the buffer into cpu local caches.
>
> Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it
> showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking
> with profiles, the total CPU share of skb_attempt_defer_free() dropped by
> 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the
> optimisation is for the receive path, but the test spends >55% of CPU
> doing writes.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>
> v2: remove in_hardirq()
>
>  net/core/skbuff.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 9b790994da0c..f32f358ef1d8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext)
>  EXPORT_SYMBOL(__skb_ext_put);
>  #endif /* CONFIG_SKB_EXTENSIONS */
>
> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> +{
> +       /* if SKB is a clone, don't handle this case */
> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> +               __kfree_skb(skb);
> +               return;
> +       }
> +
> +       local_bh_disable();
> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);

I am trying to understand why we use false instead of true here ?
Or if you prefer:
local_bh_disable();
__napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
local_bh_enable();


> +       napi_skb_cache_put(skb);
> +       local_bh_enable();
> +}
> +
>  /**
>   * skb_attempt_defer_free - queue skb for remote freeing
>   * @skb: buffer
> @@ -6965,7 +6979,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>         if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>             !cpu_online(cpu) ||
>             cpu == raw_smp_processor_id()) {
> -nodefer:       __kfree_skb(skb);
> +nodefer:       kfree_skb_napi_cache(skb);
>                 return;
>         }
>
> --
> 2.43.0
>
Pavel Begunkov Feb. 13, 2024, 2:07 p.m. UTC | #2
On 2/13/24 13:53, Eric Dumazet wrote:
> On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> Optimise skb_attempt_defer_free() executed by the CPU the skb was
>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>> disable softirqs and put the buffer into cpu local caches.
>>
>> Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it
>> showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking
>> with profiles, the total CPU share of skb_attempt_defer_free() dropped by
>> 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the
>> optimisation is for the receive path, but the test spends >55% of CPU
>> doing writes.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> v2: remove in_hardirq()
>>
>>   net/core/skbuff.c | 16 +++++++++++++++-
>>   1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 9b790994da0c..f32f358ef1d8 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext)
>>   EXPORT_SYMBOL(__skb_ext_put);
>>   #endif /* CONFIG_SKB_EXTENSIONS */
>>
>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>> +{
>> +       /* if SKB is a clone, don't handle this case */
>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>> +               __kfree_skb(skb);
>> +               return;
>> +       }
>> +
>> +       local_bh_disable();
>> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);
> 
> I am trying to understand why we use false instead of true here ?
> Or if you prefer:
> local_bh_disable();
> __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> local_bh_enable();

Maybe it's my misunderstanding but disabled bh != "napi safe",
e.g. the napi_struct we're interested in might be scheduled for
another CPU. Which is also why "napi" prefix in percpu
napi_alloc_cache sounds a bit misleading to me.

The second reason is that it shouldn't change anything
performance wise

napi_pp_put_page(napi_safe) {
     ...
     if (napi_safe || in_softirq()) { ... }
}


>> +       napi_skb_cache_put(skb);
>> +       local_bh_enable();
>> +}
>> +
>>   /**
>>    * skb_attempt_defer_free - queue skb for remote freeing
>>    * @skb: buffer
>> @@ -6965,7 +6979,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>          if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>              !cpu_online(cpu) ||
>>              cpu == raw_smp_processor_id()) {
>> -nodefer:       __kfree_skb(skb);
>> +nodefer:       kfree_skb_napi_cache(skb);
>>                  return;
>>          }
>>
>> --
>> 2.43.0
>>
Eric Dumazet Feb. 13, 2024, 2:28 p.m. UTC | #3
On Tue, Feb 13, 2024 at 3:17 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 2/13/24 13:53, Eric Dumazet wrote:
> > On Tue, Feb 13, 2024 at 2:42 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> Optimise skb_attempt_defer_free() executed by the CPU the skb was
> >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> >> disable softirqs and put the buffer into cpu local caches.
> >>
> >> Trying it with a TCP CPU bound ping pong benchmark (i.e. netbench), it
> >> showed a 1% throughput improvement (392.2 -> 396.4 Krps). Cross checking
> >> with profiles, the total CPU share of skb_attempt_defer_free() dropped by
> >> 0.6%. Note, I'd expect the win doubled with rx only benchmarks, as the
> >> optimisation is for the receive path, but the test spends >55% of CPU
> >> doing writes.
> >>
> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >> ---
> >>
> >> v2: remove in_hardirq()
> >>
> >>   net/core/skbuff.c | 16 +++++++++++++++-
> >>   1 file changed, 15 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index 9b790994da0c..f32f358ef1d8 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -6947,6 +6947,20 @@ void __skb_ext_put(struct skb_ext *ext)
> >>   EXPORT_SYMBOL(__skb_ext_put);
> >>   #endif /* CONFIG_SKB_EXTENSIONS */
> >>
> >> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> >> +{
> >> +       /* if SKB is a clone, don't handle this case */
> >> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> >> +               __kfree_skb(skb);
> >> +               return;
> >> +       }
> >> +
> >> +       local_bh_disable();
> >> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);
> >
> > I am trying to understand why we use false instead of true here ?
> > Or if you prefer:
> > local_bh_disable();
> > __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> > local_bh_enable();
>
> Maybe it's my misunderstanding but disabled bh != "napi safe",
> e.g. the napi_struct we're interested in might be scheduled for
> another CPU. Which is also why "napi" prefix in percpu
> napi_alloc_cache sounds a bit misleading to me.

Indeed, this is very misleading.

napi_skb_cache_put() & napi_skb_cache_get() should be renamed eventually.

Thanks.
Jakub Kicinski Feb. 14, 2024, 3:13 a.m. UTC | #4
On Tue, 13 Feb 2024 14:07:24 +0000 Pavel Begunkov wrote:
> >> +       local_bh_disable();
> >> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);  
> > 
> > I am trying to understand why we use false instead of true here ?
> > Or if you prefer:
> > local_bh_disable();
> > __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> > local_bh_enable();  

FWIW I had the same reaction. napi_safe = false followed by
napi_skb_cache_put() looks sus. No argument that naming is bad,
not the first time it comes up :(

> Maybe it's my misunderstanding but disabled bh != "napi safe",
> e.g. the napi_struct we're interested in might be scheduled for
> another CPU. Which is also why "napi" prefix in percpu
> napi_alloc_cache sounds a bit misleading to me.

FWIW the skb recycling is called napi_* to hint to driver authors that
if they are in NAPI context this is a better function to call.

The connection to a particular NAPI instance matters only for the page
pool recycling, but that's handled. The conditions you actually
need to look out for are hardware IRQs and whatever async paths which
can trigger trigger while NAPI is half way thru touching the cache of
the local CPU.

> The second reason is that it shouldn't change anything
> performance wise
> 
> napi_pp_put_page(napi_safe) {
>      ...
>      if (napi_safe || in_softirq()) { ... }
> }
Pavel Begunkov Feb. 14, 2024, 4:37 p.m. UTC | #5
On 2/14/24 03:13, Jakub Kicinski wrote:
> On Tue, 13 Feb 2024 14:07:24 +0000 Pavel Begunkov wrote:
>>>> +       local_bh_disable();
>>>> +       skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);
>>>
>>> I am trying to understand why we use false instead of true here ?
>>> Or if you prefer:
>>> local_bh_disable();
>>> __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>>> local_bh_enable();
> 
> FWIW I had the same reaction. napi_safe = false followed by
> napi_skb_cache_put() looks sus. No argument that naming is bad,
> not the first time it comes up :(
> 
>> Maybe it's my misunderstanding but disabled bh != "napi safe",
>> e.g. the napi_struct we're interested in might be scheduled for
>> another CPU. Which is also why "napi" prefix in percpu
>> napi_alloc_cache sounds a bit misleading to me.
> 
> FWIW the skb recycling is called napi_* to hint to driver authors that
> if they are in NAPI context this is a better function to call.

Which is absolutely reasonable, napi_skb_cache_put() on the
other hand is rather internal and wouldn't be used by drivers
directly.
I guess I'll just do a little bit of renaming later hopefully after
this patch is taken in, unless there are other comments / objections.

> The connection to a particular NAPI instance matters only for the page
> pool recycling, but that's handled. The conditions you actually
> need to look out for are hardware IRQs and whatever async paths which
> can trigger trigger while NAPI is half way thru touching the cache of
> the local CPU.

Yeah the usual percpu (bh protected) caching stuff and likes, and
the question is rather about the semantics.

To draw some conclusion, do you suggest to change anything
in the patch?

>> The second reason is that it shouldn't change anything
>> performance wise
>>
>> napi_pp_put_page(napi_safe) {
>>       ...
>>       if (napi_safe || in_softirq()) { ... }
>> }
Jakub Kicinski Feb. 14, 2024, 4:49 p.m. UTC | #6
On Wed, 14 Feb 2024 16:37:41 +0000 Pavel Begunkov wrote:
> To draw some conclusion, do you suggest to change anything
> in the patch?

Yes, Eric's suggestion of __napi_kfree_skb() is correct.
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9b790994da0c..f32f358ef1d8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6947,6 +6947,20 @@  void __skb_ext_put(struct skb_ext *ext)
 EXPORT_SYMBOL(__skb_ext_put);
 #endif /* CONFIG_SKB_EXTENSIONS */
 
+static void kfree_skb_napi_cache(struct sk_buff *skb)
+{
+	/* if SKB is a clone, don't handle this case */
+	if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	local_bh_disable();
+	skb_release_all(skb, SKB_DROP_REASON_NOT_SPECIFIED, false);
+	napi_skb_cache_put(skb);
+	local_bh_enable();
+}
+
 /**
  * skb_attempt_defer_free - queue skb for remote freeing
  * @skb: buffer
@@ -6965,7 +6979,7 @@  void skb_attempt_defer_free(struct sk_buff *skb)
 	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
 	    !cpu_online(cpu) ||
 	    cpu == raw_smp_processor_id()) {
-nodefer:	__kfree_skb(skb);
+nodefer:	kfree_skb_napi_cache(skb);
 		return;
 	}