diff mbox series

net: skbuff: allocate the fclone in the current NUMA node

Message ID 20240220021804.9541-1-shijie@os.amperecomputing.com (mailing list archive)
State Changes Requested
Headers show
Series net: skbuff: allocate the fclone in the current NUMA node | expand

Checks

Context Check Description
netdev/series_format warning Single patches do not need cover letters; Target tree name not specified in the subject
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5948 this patch: 5948
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 2 of 2 maintainers
netdev/build_clang success Errors and warnings before: 2074 this patch: 2074
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 6264 this patch: 6264
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-02-21--12-00 (tests: 1454)

Commit Message

Huang Shijie Feb. 20, 2024, 2:18 a.m. UTC
The current code passes NUMA_NO_NODE to __alloc_skb(), we found
it may creates fclone SKB in remote NUMA node.

So use numa_node_id() to limit the allocation to current NUMA node.

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 include/linux/skbuff.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet Feb. 20, 2024, 5:32 a.m. UTC | #1
On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
<shijie@os.amperecomputing.com> wrote:
>
> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> it may creates fclone SKB in remote NUMA node.

This is intended (WAI)

What about the NUMA policies of the current thread ?

Has NUMA_NO_NODE behavior changed recently?

What means : "it may creates" ? Please be more specific.

>
> So use numa_node_id() to limit the allocation to current NUMA node.

We prefer the allocation to succeed, instead of failing if the current
NUMA node has no available memory.

Please check:

grep . /sys/devices/system/node/node*/numastat

Are you going to change ~700 uses of  NUMA_NO_NODE in the kernel ?

Just curious.

>
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
>  include/linux/skbuff.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2dde34c29203..ebc42b2604ad 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1343,7 +1343,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>                                                gfp_t priority)
>  {
> -       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
> +       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, numa_node_id());
>  }
>
>  struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
> --
> 2.40.1
>
Shijie Huang Feb. 20, 2024, 6:26 a.m. UTC | #2
在 2024/2/20 13:32, Eric Dumazet 写道:
> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> <shijie@os.amperecomputing.com> wrote:
>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
>> it may creates fclone SKB in remote NUMA node.
> This is intended (WAI)

Okay. thanks a lot.

It seems I should fix the issue in other code, not the networking.

>
> What about the NUMA policies of the current thread ?

We use "numactl -m 0" for memcached, the NUMA policy should allocate 
fclone in

node 0, but we can see many fclones were allocated in node 1.

We have enough memory to allocate these fclones in node 0.

>
> Has NUMA_NO_NODE behavior changed recently?
I guess not.
>
> What means : "it may creates" ? Please be more specific.

When we use the memcached for testing in NUMA, there are maybe 20% ~ 30% 
fclones were allocated in

remote NUMA node.

After this patch, all the fclones are allocated correctly.


>> So use numa_node_id() to limit the allocation to current NUMA node.
> We prefer the allocation to succeed, instead of failing if the current
> NUMA node has no available memory.

Got it.


Thanks

Huang Shijie

>
> Please check:
>
> grep . /sys/devices/system/node/node*/numastat
>
> Are you going to change ~700 uses of  NUMA_NO_NODE in the kernel ?
>
> Just curious.
>
>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>> ---
>>   include/linux/skbuff.h | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 2dde34c29203..ebc42b2604ad 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -1343,7 +1343,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
>>   static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>>                                                 gfp_t priority)
>>   {
>> -       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
>> +       return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, numa_node_id());
>>   }
>>
>>   struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
>> --
>> 2.40.1
>>
Eric Dumazet Feb. 20, 2024, 8:17 a.m. UTC | #3
On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> 在 2024/2/20 13:32, Eric Dumazet 写道:
> > On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> > <shijie@os.amperecomputing.com> wrote:
> >> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> >> it may creates fclone SKB in remote NUMA node.
> > This is intended (WAI)
>
> Okay. thanks a lot.
>
> It seems I should fix the issue in other code, not the networking.
>
> >
> > What about the NUMA policies of the current thread ?
>
> We use "numactl -m 0" for memcached, the NUMA policy should allocate
> fclone in
>
> node 0, but we can see many fclones were allocated in node 1.
>
> We have enough memory to allocate these fclones in node 0.
>
> >
> > Has NUMA_NO_NODE behavior changed recently?
> I guess not.
> >
> > What means : "it may creates" ? Please be more specific.
>
> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
> fclones were allocated in
>
> remote NUMA node.

Interesting, how was it measured exactly ?
Are you using SLUB or SLAB ?

>
> After this patch, all the fclones are allocated correctly.

Note that skbs for TCP have three memory components (or more for large packets)

sk_buff
skb->head
page frags (see sk_page_frag_refill() for non zero copy payload)

The payload should be following NUMA policy of current thread, that is
really what matters.
Shijie Huang Feb. 20, 2024, 8:37 a.m. UTC | #4
在 2024/2/20 16:17, Eric Dumazet 写道:
> On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
> <shijie@amperemail.onmicrosoft.com> wrote:
>>
>> 在 2024/2/20 13:32, Eric Dumazet 写道:
>>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
>>> <shijie@os.amperecomputing.com> wrote:
>>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
>>>> it may creates fclone SKB in remote NUMA node.
>>> This is intended (WAI)
>> Okay. thanks a lot.
>>
>> It seems I should fix the issue in other code, not the networking.
>>
>>> What about the NUMA policies of the current thread ?
>> We use "numactl -m 0" for memcached, the NUMA policy should allocate
>> fclone in
>>
>> node 0, but we can see many fclones were allocated in node 1.
>>
>> We have enough memory to allocate these fclones in node 0.
>>
>>> Has NUMA_NO_NODE behavior changed recently?
>> I guess not.
>>> What means : "it may creates" ? Please be more specific.
>> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
>> fclones were allocated in
>>
>> remote NUMA node.
> Interesting, how was it measured exactly ?

I created a private patch to record the status for each fclone allocation.


> Are you using SLUB or SLAB ?

I think I use SLUB. (CONFIG_SLUB=y, 
CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)


Thanks

Huang Shijie
Eric Dumazet Feb. 24, 2024, 7:07 p.m. UTC | #5
On Tue, Feb 20, 2024 at 9:37 AM Shijie Huang
<shijie@amperemail.onmicrosoft.com> wrote:
>
>
> 在 2024/2/20 16:17, Eric Dumazet 写道:
> > On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
> > <shijie@amperemail.onmicrosoft.com> wrote:
> >>
> >> 在 2024/2/20 13:32, Eric Dumazet 写道:
> >>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> >>> <shijie@os.amperecomputing.com> wrote:
> >>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> >>>> it may creates fclone SKB in remote NUMA node.
> >>> This is intended (WAI)
> >> Okay. thanks a lot.
> >>
> >> It seems I should fix the issue in other code, not the networking.
> >>
> >>> What about the NUMA policies of the current thread ?
> >> We use "numactl -m 0" for memcached, the NUMA policy should allocate
> >> fclone in
> >>
> >> node 0, but we can see many fclones were allocated in node 1.
> >>
> >> We have enough memory to allocate these fclones in node 0.
> >>
> >>> Has NUMA_NO_NODE behavior changed recently?
> >> I guess not.
> >>> What means : "it may creates" ? Please be more specific.
> >> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
> >> fclones were allocated in
> >>
> >> remote NUMA node.
> > Interesting, how was it measured exactly ?
>
> I created a private patch to record the status for each fclone allocation.
>
>
> > Are you using SLUB or SLAB ?
>
> I think I use SLUB. (CONFIG_SLUB=y,
> CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)
>

A similar issue comes from tx_action() calling __napi_kfree_skb() on
arbitrary skbs
including ones that were allocated on a different NUMA node.

This pollutes per-cpu caches with not optimally placed sk_buff :/

Although this should not impact fclones, __napi_kfree_skb() only ?

commit 15fad714be86eab13e7568fecaf475b2a9730d3e
Author: Jesper Dangaard Brouer <brouer@redhat.com>
Date:   Mon Feb 8 13:15:04 2016 +0100

    net: bulk free SKBs that were delay free'ed due to IRQ context

What about :

diff --git a/net/core/dev.c b/net/core/dev.c
index c588808be77f563c429eb4a2eaee5c8062d99582..63165138c6f690e14520f11e32dc16f2845abad4
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5162,11 +5162,7 @@ static __latent_entropy void
net_tx_action(struct softirq_action *h)
                                trace_kfree_skb(skb, net_tx_action,
                                                get_kfree_skb_cb(skb)->reason);

-                       if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
-                               __kfree_skb(skb);
-                       else
-                               __napi_kfree_skb(skb,
-                                                get_kfree_skb_cb(skb)->reason);
+                       __kfree_skb(skb);
                }
        }
Alexander Lobakin Feb. 26, 2024, 10:10 a.m. UTC | #6
From: Huang Shijie <shijie@os.amperecomputing.com>
Date: Tue, 20 Feb 2024 10:18:04 +0800

> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> it may creates fclone SKB in remote NUMA node.
> 
> So use numa_node_id() to limit the allocation to current NUMA node.
> 
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
>  include/linux/skbuff.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2dde34c29203..ebc42b2604ad 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1343,7 +1343,7 @@ static inline bool skb_fclone_busy(const struct sock *sk,
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>  					       gfp_t priority)
>  {
> -	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
> +	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, numa_node_id());

Because it tries to defragment the memory and pick an optimal node.

__alloc_skb() and skb clones aren't anyway something very hotpathish, do
you have any particular perf numbers and/or usecases where %NUMA_NO_NODE
really hurts?

>  }
>  
>  struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);

Thanks,
Olek
Jesper Dangaard Brouer Feb. 26, 2024, 10:18 a.m. UTC | #7
On 24/02/2024 20.07, Eric Dumazet wrote:
> On Tue, Feb 20, 2024 at 9:37 AM Shijie Huang
> <shijie@amperemail.onmicrosoft.com> wrote:
>>
>>
>> 在 2024/2/20 16:17, Eric Dumazet 写道:
>>> On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
>>> <shijie@amperemail.onmicrosoft.com> wrote:
>>>>
>>>> 在 2024/2/20 13:32, Eric Dumazet 写道:
>>>>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
>>>>> <shijie@os.amperecomputing.com> wrote:
>>>>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
>>>>>> it may creates fclone SKB in remote NUMA node.
>>>>> This is intended (WAI)
>>>> Okay. thanks a lot.
>>>>
>>>> It seems I should fix the issue in other code, not the networking.
>>>>
>>>>> What about the NUMA policies of the current thread ?
>>>> We use "numactl -m 0" for memcached, the NUMA policy should allocate
>>>> fclone in
>>>>
>>>> node 0, but we can see many fclones were allocated in node 1.
>>>>
>>>> We have enough memory to allocate these fclones in node 0.
>>>>
>>>>> Has NUMA_NO_NODE behavior changed recently?
>>>> I guess not.
>>>>> What means : "it may creates" ? Please be more specific.
>>>> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
>>>> fclones were allocated in
>>>>
>>>> remote NUMA node.
>>> Interesting, how was it measured exactly ?
>>
>> I created a private patch to record the status for each fclone allocation.
>>
>>
>>> Are you using SLUB or SLAB ?
>>
>> I think I use SLUB. (CONFIG_SLUB=y,
>> CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)
>>
> 
> A similar issue comes from tx_action() calling __napi_kfree_skb() on
> arbitrary skbs
> including ones that were allocated on a different NUMA node.
> 
> This pollutes per-cpu caches with not optimally placed sk_buff :/
> 
> Although this should not impact fclones, __napi_kfree_skb() only ?
> 
> commit 15fad714be86eab13e7568fecaf475b2a9730d3e
> Author: Jesper Dangaard Brouer <brouer@redhat.com>
> Date:   Mon Feb 8 13:15:04 2016 +0100
> 
>      net: bulk free SKBs that were delay free'ed due to IRQ context
> 
> What about :
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index c588808be77f563c429eb4a2eaee5c8062d99582..63165138c6f690e14520f11e32dc16f2845abad4
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5162,11 +5162,7 @@ static __latent_entropy void
> net_tx_action(struct softirq_action *h)
>                                  trace_kfree_skb(skb, net_tx_action,
>                                                  get_kfree_skb_cb(skb)->reason);
> 
> -                       if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
> -                               __kfree_skb(skb);
> -                       else
> -                               __napi_kfree_skb(skb,
> -                                                get_kfree_skb_cb(skb)->reason);

Yes, I think it makes sense to avoid calling __napi_kfree_skb here.
The __napi_kfree_skb call will cache SKB slub-allocation (but "release"
data) on a per CPU napi_alloc_cache (see code napi_skb_cache_put()).
In net_tx_action() there is a chance this could originate from another
CPU or even NUMA node.  I notice this is only for SKBs on the
softnet_data->completion_queue, which have a high chance of being cache
cold.  My patch 15fad714be86e only made sense when we bulk freed these
SKBs, but after Olek's changes to cache freed SKBs, then this shouldn't
be calling __napi_kfree_skb() (previously named __kfree_skb_defer).

I support this RFC patch from Eric.

Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

> +                       __kfree_skb(skb);
>                  }
>          }
Eric Dumazet Feb. 26, 2024, 10:29 a.m. UTC | #8
On Mon, Feb 26, 2024 at 11:18 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 24/02/2024 20.07, Eric Dumazet wrote:
> > On Tue, Feb 20, 2024 at 9:37 AM Shijie Huang
> > <shijie@amperemail.onmicrosoft.com> wrote:
> >>
> >>
> >> 在 2024/2/20 16:17, Eric Dumazet 写道:
> >>> On Tue, Feb 20, 2024 at 7:26 AM Shijie Huang
> >>> <shijie@amperemail.onmicrosoft.com> wrote:
> >>>>
> >>>> 在 2024/2/20 13:32, Eric Dumazet 写道:
> >>>>> On Tue, Feb 20, 2024 at 3:18 AM Huang Shijie
> >>>>> <shijie@os.amperecomputing.com> wrote:
> >>>>>> The current code passes NUMA_NO_NODE to __alloc_skb(), we found
> >>>>>> it may creates fclone SKB in remote NUMA node.
> >>>>> This is intended (WAI)
> >>>> Okay. thanks a lot.
> >>>>
> >>>> It seems I should fix the issue in other code, not the networking.
> >>>>
> >>>>> What about the NUMA policies of the current thread ?
> >>>> We use "numactl -m 0" for memcached, the NUMA policy should allocate
> >>>> fclone in
> >>>>
> >>>> node 0, but we can see many fclones were allocated in node 1.
> >>>>
> >>>> We have enough memory to allocate these fclones in node 0.
> >>>>
> >>>>> Has NUMA_NO_NODE behavior changed recently?
> >>>> I guess not.
> >>>>> What means : "it may creates" ? Please be more specific.
> >>>> When we use the memcached for testing in NUMA, there are maybe 20% ~ 30%
> >>>> fclones were allocated in
> >>>>
> >>>> remote NUMA node.
> >>> Interesting, how was it measured exactly ?
> >>
> >> I created a private patch to record the status for each fclone allocation.
> >>
> >>
> >>> Are you using SLUB or SLAB ?
> >>
> >> I think I use SLUB. (CONFIG_SLUB=y,
> >> CONFIG_SLAB_MERGE_DEFAULT=y,CONFIG_SLUB_CPU_PARTIAL=y)
> >>
> >
> > A similar issue comes from tx_action() calling __napi_kfree_skb() on
> > arbitrary skbs
> > including ones that were allocated on a different NUMA node.
> >
> > This pollutes per-cpu caches with not optimally placed sk_buff :/
> >
> > Although this should not impact fclones, __napi_kfree_skb() only ?
> >
> > commit 15fad714be86eab13e7568fecaf475b2a9730d3e
> > Author: Jesper Dangaard Brouer <brouer@redhat.com>
> > Date:   Mon Feb 8 13:15:04 2016 +0100
> >
> >      net: bulk free SKBs that were delay free'ed due to IRQ context
> >
> > What about :
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index c588808be77f563c429eb4a2eaee5c8062d99582..63165138c6f690e14520f11e32dc16f2845abad4
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -5162,11 +5162,7 @@ static __latent_entropy void
> > net_tx_action(struct softirq_action *h)
> >                                  trace_kfree_skb(skb, net_tx_action,
> >                                                  get_kfree_skb_cb(skb)->reason);
> >
> > -                       if (skb->fclone != SKB_FCLONE_UNAVAILABLE)
> > -                               __kfree_skb(skb);
> > -                       else
> > -                               __napi_kfree_skb(skb,
> > -                                                get_kfree_skb_cb(skb)->reason);
>
> Yes, I think it makes sense to avoid calling __napi_kfree_skb here.
> The __napi_kfree_skb call will cache SKB slub-allocation (but "release"
> data) on a per CPU napi_alloc_cache (see code napi_skb_cache_put()).
> In net_tx_action() there is a chance this could originate from another
> CPU or even NUMA node.  I notice this is only for SKBs on the
> softnet_data->completion_queue, which have a high chance of being cache
> cold.  My patch 15fad714be86e only made sense when we bulk freed these
> SKBs, but after Olek's changes to cache freed SKBs, then this shouldn't
> be calling __napi_kfree_skb() (previously named __kfree_skb_defer).
>
> I support this RFC patch from Eric.
>
> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

Note that this should not matter for most NIC, because their drivers
perform TX completion from NAPI context, we do not hit this path.

It seems that switching to SLUB instead of SLAB has increased the chances
of getting memory from another node.

We probably need to investigate.
Shijie Huang Feb. 27, 2024, 6:30 a.m. UTC | #9
在 2024/2/26 18:10, Alexander Lobakin 写道:
> __alloc_skb() and skb clones aren't anyway something very hotpathish, do
> you have any particular perf numbers and/or usecases where %NUMA_NO_NODE
> really hurts?

 From the memcached test, I do not see really performance hurts.


Thanks

Huang Shijie
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2dde34c29203..ebc42b2604ad 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1343,7 +1343,7 @@  static inline bool skb_fclone_busy(const struct sock *sk,
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, NUMA_NO_NODE);
+	return __alloc_skb(size, priority, SKB_ALLOC_FCLONE, numa_node_id());
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);