diff mbox series

[net-next,v2] net: cache for same cpu skb_attempt_defer_free

Message ID 1a4e901d6ecb9b091888c4d92256fa4a56cb83a4.1710715791.git.asml.silence@gmail.com (mailing list archive)
State Deferred
Delegated to: Netdev Maintainers
Headers show
Series [net-next,v2] net: cache for same cpu skb_attempt_defer_free | expand

Checks

Context Check Description
netdev/series_format success Single patches do not need cover letters
netdev/tree_selection success Clearly marked for net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 941 this patch: 941
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 4 of 4 maintainers
netdev/build_clang success Errors and warnings before: 957 this patch: 957
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 958 this patch: 958
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 27 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-03-20--03-00 (tests: 908)

Commit Message

Pavel Begunkov March 18, 2024, 12:44 a.m. UTC
Optimise skb_attempt_defer_free() when run by the same CPU the skb was
allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
disable softirqs and put the buffer into cpu local caches.

CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
I'd expect the win doubled with rx only benchmarks, as the optimisation
is for the receive path, but the test spends >55% of CPU doing writes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---

v2: pass @napi_safe=true by using __napi_kfree_skb()

 net/core/skbuff.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Jason Xing March 18, 2024, 2:44 a.m. UTC | #1
On Mon, Mar 18, 2024 at 8:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> disable softirqs and put the buffer into cpu local caches.
>
> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,

I suspect that we can stably gain this improvement. The reason why I
ask is because it might be caused by some factor of chance.

> I'd expect the win doubled with rx only benchmarks, as the optimisation
> is for the receive path, but the test spends >55% of CPU doing writes.

I wonder how you did this test? Could you tell us more, please.

>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>
> v2: pass @napi_safe=true by using __napi_kfree_skb()
>
>  net/core/skbuff.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b99127712e67..35d37ae70a3d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>  EXPORT_SYMBOL(__skb_ext_put);
>  #endif /* CONFIG_SKB_EXTENSIONS */
>
> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> +{
> +       /* if SKB is a clone, don't handle this case */
> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> +               __kfree_skb(skb);
> +               return;
> +       }
> +
> +       local_bh_disable();
> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);

__napi_kfree_skb() doesn't care much about why we drop in the rx path,
I think. How about replacing it with SKB_CONSUMED like
napi_skb_finish() does?

Thanks,
Jason

> +       local_bh_enable();
> +}
> +
>  /**
>   * skb_attempt_defer_free - queue skb for remote freeing
>   * @skb: buffer
> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>         if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>             !cpu_online(cpu) ||
>             cpu == raw_smp_processor_id()) {
> -nodefer:       __kfree_skb(skb);
> +nodefer:       kfree_skb_napi_cache(skb);
>                 return;
>         }
>
> --
> 2.44.0
>
>
Pavel Begunkov March 18, 2024, 3:20 a.m. UTC | #2
On 3/18/24 02:44, Jason Xing wrote:
> On Mon, Mar 18, 2024 at 8:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>> disable softirqs and put the buffer into cpu local caches.
>>
>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
> 
> I suspect that we can stably gain this improvement. The reason why I
> ask is because it might be caused by some factor of chance.

I guess it might be the kernel is even booting only by
some factor of chance, but no, the testing was quite stable :)

>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>> is for the receive path, but the test spends >55% of CPU doing writes.
> 
> I wonder how you did this test? Could you tell us more, please.

Well, the way I did it: choose a NIC, redirect all its IRQs
to a single CPU of your choice, taskset your rx side to that
CPU. You might want to make sure there is no much traffic
apart from the test program, but I was lucky to have 2 NICs
in the system. Instead of IRQs you can also try to configure
steering.

I used netbench [1] for both rx and tx, but that shouldn't be
important as long as you drive enough traffic, you can try
iperf or something else.

[1] https://github.com/DylanZA/netbench


>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>
>>   net/core/skbuff.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index b99127712e67..35d37ae70a3d 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>   EXPORT_SYMBOL(__skb_ext_put);
>>   #endif /* CONFIG_SKB_EXTENSIONS */
>>
>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>> +{
>> +       /* if SKB is a clone, don't handle this case */
>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>> +               __kfree_skb(skb);
>> +               return;
>> +       }
>> +
>> +       local_bh_disable();
>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> 
> __napi_kfree_skb() doesn't care much about why we drop in the rx path,
> I think. How about replacing it with SKB_CONSUMED like
> napi_skb_finish() does?

I'm sticking here with the current behaviour, __kfree_skb(),
which it replaces, passes SKB_DROP_REASON_NOT_SPECIFIED.

I can make the change if maintainers ack it, but maybe
it should better be done in a separate patch with a proper
explanation.
Eric Dumazet March 18, 2024, 10:11 a.m. UTC | #3
On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> disable softirqs and put the buffer into cpu local caches.
>
> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
> I'd expect the win doubled with rx only benchmarks, as the optimisation
> is for the receive path, but the test spends >55% of CPU doing writes.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>
> v2: pass @napi_safe=true by using __napi_kfree_skb()
>
>  net/core/skbuff.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b99127712e67..35d37ae70a3d 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>  EXPORT_SYMBOL(__skb_ext_put);
>  #endif /* CONFIG_SKB_EXTENSIONS */
>
> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> +{
> +       /* if SKB is a clone, don't handle this case */
> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> +               __kfree_skb(skb);
> +               return;
> +       }
> +
> +       local_bh_disable();
> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> +       local_bh_enable();
> +}
> +
>  /**
>   * skb_attempt_defer_free - queue skb for remote freeing
>   * @skb: buffer
> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>         if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>             !cpu_online(cpu) ||
>             cpu == raw_smp_processor_id()) {
> -nodefer:       __kfree_skb(skb);
> +nodefer:       kfree_skb_napi_cache(skb);
>                 return;
>         }
>
> --
> 2.44.0
>

1) net-next is currently closed.
2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
correct node.
3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
why trying to cache the sk_buff in this particular path is needed.

Why not change __kfree_skb() instead ?

All these helpers are becoming a maze.
Pavel Begunkov March 18, 2024, 11:41 a.m. UTC | #4
On 3/18/24 10:11, Eric Dumazet wrote:
> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>> disable softirqs and put the buffer into cpu local caches.
>>
>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>> is for the receive path, but the test spends >55% of CPU doing writes.
>>
>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>> ---
>>
>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>
>>   net/core/skbuff.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index b99127712e67..35d37ae70a3d 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>   EXPORT_SYMBOL(__skb_ext_put);
>>   #endif /* CONFIG_SKB_EXTENSIONS */
>>
>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>> +{
>> +       /* if SKB is a clone, don't handle this case */
>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>> +               __kfree_skb(skb);
>> +               return;
>> +       }
>> +
>> +       local_bh_disable();
>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>> +       local_bh_enable();
>> +}
>> +
>>   /**
>>    * skb_attempt_defer_free - queue skb for remote freeing
>>    * @skb: buffer
>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>          if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>              !cpu_online(cpu) ||
>>              cpu == raw_smp_processor_id()) {
>> -nodefer:       __kfree_skb(skb);
>> +nodefer:       kfree_skb_napi_cache(skb);
>>                  return;
>>          }
>>
>> --
>> 2.44.0
>>
> 
> 1) net-next is currently closed.

Ok

> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
> correct node.

Let me see if I read you right. You're saying that SLUB can
allocate an skb from a different node, so skb->alloc_cpu
might be not indicative of the node, and so we might locally
cache an skb of a foreign numa node?

If that's the case I don't see how it's different from the
cpu != raw_smp_processor_id() path, which will transfer the
skb to another cpu and still put it in the local cache in
softirq.


> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
> why trying to cache the sk_buff in this particular path is needed.
> 
> Why not change __kfree_skb() instead ?

IIRC kfree_skb() can be called from any context including irqoff,
it's convenient to have a function that just does the job without
too much of extra care. Theoretically it can have a separate path
inside based on irqs_disabled(), but that would be ugly.
Pavel Begunkov March 18, 2024, 2:14 p.m. UTC | #5
On 3/18/24 11:41, Pavel Begunkov wrote:
> On 3/18/24 10:11, Eric Dumazet wrote:
>> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>
>>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>>> disable softirqs and put the buffer into cpu local caches.
>>>
>>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
>>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>>> is for the receive path, but the test spends >55% of CPU doing writes.
>>>
>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>> ---
>>>
>>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>>
>>>   net/core/skbuff.c | 15 ++++++++++++++-
>>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index b99127712e67..35d37ae70a3d 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>>   EXPORT_SYMBOL(__skb_ext_put);
>>>   #endif /* CONFIG_SKB_EXTENSIONS */
>>>
>>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>>> +{
>>> +       /* if SKB is a clone, don't handle this case */
>>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>>> +               __kfree_skb(skb);
>>> +               return;
>>> +       }
>>> +
>>> +       local_bh_disable();
>>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>>> +       local_bh_enable();
>>> +}
>>> +
>>>   /**
>>>    * skb_attempt_defer_free - queue skb for remote freeing
>>>    * @skb: buffer
>>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>>          if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>>              !cpu_online(cpu) ||
>>>              cpu == raw_smp_processor_id()) {
>>> -nodefer:       __kfree_skb(skb);
>>> +nodefer:       kfree_skb_napi_cache(skb);
>>>                  return;
>>>          }
>>>
>>> -- 
>>> 2.44.0
>>>
>>
>> 1) net-next is currently closed.
> 
> Ok
> 
>> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
>> correct node.
> 
> Let me see if I read you right. You're saying that SLUB can
> allocate an skb from a different node, so skb->alloc_cpu
> might be not indicative of the node, and so we might locally
> cache an skb of a foreign numa node?
> 
> If that's the case I don't see how it's different from the
> cpu != raw_smp_processor_id() path, which will transfer the
> skb to another cpu and still put it in the local cache in
> softirq.
> 
> 
>> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
>> why trying to cache the sk_buff in this particular path is needed.
>>
>> Why not change __kfree_skb() instead ?
> 
> IIRC kfree_skb() can be called from any context including irqoff,

On the other hand there is napi_pp_put_page() inside which
assumes hardirq off, so maybe not, I'd appreciate if someone
can clarify it. I was sure that drivers allocating in hardirq
via e.g. __netdev_alloc_skb() might want to use kfree_skb(),
but maybe they're consistently sticking to dev_kfree_sk*().


> it's convenient to have a function that just does the job without
> too much of extra care. Theoretically it can have a separate path
> inside based on irqs_disabled(), but that would be ugly.
Eric Dumazet March 18, 2024, 2:33 p.m. UTC | #6
On Mon, Mar 18, 2024 at 12:43 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 3/18/24 10:11, Eric Dumazet wrote:
> > On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
> >> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
> >> disable softirqs and put the buffer into cpu local caches.
> >>
> >> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
> >> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
> >> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
> >> I'd expect the win doubled with rx only benchmarks, as the optimisation
> >> is for the receive path, but the test spends >55% of CPU doing writes.
> >>
> >> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> >> ---
> >>
> >> v2: pass @napi_safe=true by using __napi_kfree_skb()
> >>
> >>   net/core/skbuff.c | 15 ++++++++++++++-
> >>   1 file changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >> index b99127712e67..35d37ae70a3d 100644
> >> --- a/net/core/skbuff.c
> >> +++ b/net/core/skbuff.c
> >> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
> >>   EXPORT_SYMBOL(__skb_ext_put);
> >>   #endif /* CONFIG_SKB_EXTENSIONS */
> >>
> >> +static void kfree_skb_napi_cache(struct sk_buff *skb)
> >> +{
> >> +       /* if SKB is a clone, don't handle this case */
> >> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
> >> +               __kfree_skb(skb);
> >> +               return;
> >> +       }
> >> +
> >> +       local_bh_disable();
> >> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
> >> +       local_bh_enable();
> >> +}
> >> +
> >>   /**
> >>    * skb_attempt_defer_free - queue skb for remote freeing
> >>    * @skb: buffer
> >> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
> >>          if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
> >>              !cpu_online(cpu) ||
> >>              cpu == raw_smp_processor_id()) {
> >> -nodefer:       __kfree_skb(skb);
> >> +nodefer:       kfree_skb_napi_cache(skb);
> >>                  return;
> >>          }
> >>
> >> --
> >> 2.44.0
> >>
> >
> > 1) net-next is currently closed.
>
> Ok
>
> > 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
> > correct node.
>
> Let me see if I read you right. You're saying that SLUB can
> allocate an skb from a different node, so skb->alloc_cpu
> might be not indicative of the node, and so we might locally
> cache an skb of a foreign numa node?
>
> If that's the case I don't see how it's different from the
> cpu != raw_smp_processor_id() path, which will transfer the
> skb to another cpu and still put it in the local cache in
> softirq.

The big win for skb_attempt_defer_free() is for the many pages that are attached
to TCP incoming GRO packets.

A typical GRO packet can have 45 page fragments.

Pages are not recycled by a NIC if the NUMA node does not match.

If the win was only for sk_buff itself, I think we should have asked
SLUB maintainers
why SLUB needs another cache to optimize SLUB fast cache !


>
>
> > 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
> > why trying to cache the sk_buff in this particular path is needed.
> >
> > Why not change __kfree_skb() instead ?
>
> IIRC kfree_skb() can be called from any context including irqoff,
> it's convenient to have a function that just does the job without
> too much of extra care. Theoretically it can have a separate path
> inside based on irqs_disabled(), but that would be ugly.

Well, adding one case here, another here, and another here is also ugly.
Pavel Begunkov March 18, 2024, 11:12 p.m. UTC | #7
On 3/18/24 14:33, Eric Dumazet wrote:
> On Mon, Mar 18, 2024 at 12:43 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 3/18/24 10:11, Eric Dumazet wrote:
>>> On Mon, Mar 18, 2024 at 1:46 AM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>>>
>>>> Optimise skb_attempt_defer_free() when run by the same CPU the skb was
>>>> allocated on. Instead of __kfree_skb() -> kmem_cache_free() we can
>>>> disable softirqs and put the buffer into cpu local caches.
>>>>
>>>> CPU bound TCP ping pong style benchmarking (i.e. netbench) showed a 1%
>>>> throughput increase (392.2 -> 396.4 Krps). Cross checking with profiles,
>>>> the total CPU share of skb_attempt_defer_free() dropped by 0.6%. Note,
>>>> I'd expect the win doubled with rx only benchmarks, as the optimisation
>>>> is for the receive path, but the test spends >55% of CPU doing writes.
>>>>
>>>> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
>>>> ---
>>>>
>>>> v2: pass @napi_safe=true by using __napi_kfree_skb()
>>>>
>>>>    net/core/skbuff.c | 15 ++++++++++++++-
>>>>    1 file changed, 14 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>> index b99127712e67..35d37ae70a3d 100644
>>>> --- a/net/core/skbuff.c
>>>> +++ b/net/core/skbuff.c
>>>> @@ -6995,6 +6995,19 @@ void __skb_ext_put(struct skb_ext *ext)
>>>>    EXPORT_SYMBOL(__skb_ext_put);
>>>>    #endif /* CONFIG_SKB_EXTENSIONS */
>>>>
>>>> +static void kfree_skb_napi_cache(struct sk_buff *skb)
>>>> +{
>>>> +       /* if SKB is a clone, don't handle this case */
>>>> +       if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
>>>> +               __kfree_skb(skb);
>>>> +               return;
>>>> +       }
>>>> +
>>>> +       local_bh_disable();
>>>> +       __napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
>>>> +       local_bh_enable();
>>>> +}
>>>> +
>>>>    /**
>>>>     * skb_attempt_defer_free - queue skb for remote freeing
>>>>     * @skb: buffer
>>>> @@ -7013,7 +7026,7 @@ void skb_attempt_defer_free(struct sk_buff *skb)
>>>>           if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
>>>>               !cpu_online(cpu) ||
>>>>               cpu == raw_smp_processor_id()) {
>>>> -nodefer:       __kfree_skb(skb);
>>>> +nodefer:       kfree_skb_napi_cache(skb);
>>>>                   return;
>>>>           }
>>>>
>>>> --
>>>> 2.44.0
>>>>
>>>
>>> 1) net-next is currently closed.
>>
>> Ok
>>
>>> 2) No NUMA awareness. SLUB does not guarantee the sk_buff was on the
>>> correct node.
>>
>> Let me see if I read you right. You're saying that SLUB can
>> allocate an skb from a different node, so skb->alloc_cpu
>> might be not indicative of the node, and so we might locally
>> cache an skb of a foreign numa node?
>>
>> If that's the case I don't see how it's different from the
>> cpu != raw_smp_processor_id() path, which will transfer the
>> skb to another cpu and still put it in the local cache in
>> softirq.
> 
> The big win for skb_attempt_defer_free() is for the many pages that are attached
> to TCP incoming GRO packets.
> 
> A typical GRO packet can have 45 page fragments.

That's great and probably a dominating optimisation for
GRO, however it's a path that all TCP reads would go through,
large GRO'ed skbs or not. Even if there is a single frag, the
patch at least enables the skb cache and gives the frag a
chance to hit the pp's lockless cache.

> Pages are not recycled by a NIC if the NUMA node does not match.

Great, looking it up it seems that it's only true for pages
ending up in the pp's ptr_ring but not recycled directly.

> If the win was only for sk_buff itself, I think we should have asked
> SLUB maintainers
> why SLUB needs another cache to optimize SLUB fast cache !

Well, it's just slow for hot paths, not awfully slow but enough
to be noticeable and take 1-2% per allocation per request. There
are caches in io_uring, because it was slow, there are cache
in the block layer. And there are skb and other caches in net,
I assume all for the same reasons. Not like I'm adding a new
one.

I remember someone was poking into similar questions, but I
haven't seen any drastic SLUB performance changes.

Eric, I'm seriously getting lost in the arguments and don't
really see what you're ultimately against. Summing up topics:

1) Is there a NUMA awareness problem in the patch that I missed?
If so, can you elaborate? I'll try to address it.
Is it a regression comparing to the current state or something
that would be nice to have?

2) Do you trust the numbers for the test described? I.e. the
rx server have multiple connections, each does small packet
(<100B) ping pong style traffic. It's pinned to a CPU and all
completions are ensured to run on that CPU.

3) Do you agree that it's a valid use case? Or are you shrugging
it off as something that nobody cares about?

4) Maybe you're against the design?


>>> 3) Given that many skbs (like TCP ACK) are freed using __kfree_skb(),  I wonder
>>> why trying to cache the sk_buff in this particular path is needed.
>>>
>>> Why not change __kfree_skb() instead ?
>>
>> IIRC kfree_skb() can be called from any context including irqoff,
>> it's convenient to have a function that just does the job without
>> too much of extra care. Theoretically it can have a separate path
>> inside based on irqs_disabled(), but that would be ugly.
> 
> Well, adding one case here, another here, and another here is also ugly.

And fwiw, irqs_disabled() is not great for hot paths. Not to the
extent of irq_disable(), but it still trashes up a bit the pipeline
or so IIRC.
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b99127712e67..35d37ae70a3d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -6995,6 +6995,19 @@  void __skb_ext_put(struct skb_ext *ext)
 EXPORT_SYMBOL(__skb_ext_put);
 #endif /* CONFIG_SKB_EXTENSIONS */
 
+static void kfree_skb_napi_cache(struct sk_buff *skb)
+{
+	/* if SKB is a clone, don't handle this case */
+	if (skb->fclone != SKB_FCLONE_UNAVAILABLE) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	local_bh_disable();
+	__napi_kfree_skb(skb, SKB_DROP_REASON_NOT_SPECIFIED);
+	local_bh_enable();
+}
+
 /**
  * skb_attempt_defer_free - queue skb for remote freeing
  * @skb: buffer
@@ -7013,7 +7026,7 @@  void skb_attempt_defer_free(struct sk_buff *skb)
 	if (WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
 	    !cpu_online(cpu) ||
 	    cpu == raw_smp_processor_id()) {
-nodefer:	__kfree_skb(skb);
+nodefer:	kfree_skb_napi_cache(skb);
 		return;
 	}