diff mbox series

[net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache

Message ID 169211265663.1491038.8580163757548985946.stgit@firesoul (mailing list archive)
State New
Headers show
Series [net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache | expand

Commit Message

Jesper Dangaard Brouer Aug. 15, 2023, 3:17 p.m. UTC
Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE
in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE")
now is the time to use this flag for networking as proposed
earlier see link.

The SKB (sk_buff) kmem_cache slab is critical for network performance.
Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
performance by amortising the alloc/free cost.

For the bulk API to perform efficiently the slub fragmentation need to
be low. Especially for the SLUB allocator, the efficiency of bulk free
API depend on objects belonging to the same slab (page).

When running different network performance microbenchmarks, I started
to notice that performance was reduced (slightly) when machines had
longer uptimes. I believe the cause was 'skbuff_head_cache' got
aliased/merged into the general slub for 256 bytes sized objects (with
my kernel config, without CONFIG_HARDENED_USERCOPY).

For SKB kmem_cache network stack have other various reasons for
not merging, but it varies depending on kernel config (e.g.
CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE
for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs.

When CONFIG_SLUB_TINY is configured the bulk APIs are essentially
disabled. Thus, for this case drop the SLAB_NO_MERGE flag.

Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/
Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 net/core/skbuff.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox Aug. 15, 2023, 3:53 p.m. UTC | #1
On Tue, Aug 15, 2023 at 05:17:36PM +0200, Jesper Dangaard Brouer wrote:
> For the bulk API to perform efficiently the slub fragmentation need to
> be low. Especially for the SLUB allocator, the efficiency of bulk free
> API depend on objects belonging to the same slab (page).

Hey Jesper,

You probably haven't seen this patch series from Vlastimil:

https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@suse.cz/

I wonder if you'd like to give it a try?  It should provide some immunity
to this problem, and might even be faster than the current approach.
If it isn't, it'd be good to understand why, and if it could be improved.

No objection to this patch going in for now, of course.
Jesper Dangaard Brouer Aug. 18, 2023, 12:32 p.m. UTC | #2
On 15/08/2023 17.53, Matthew Wilcox wrote:
> On Tue, Aug 15, 2023 at 05:17:36PM +0200, Jesper Dangaard Brouer wrote:
>> For the bulk API to perform efficiently the slub fragmentation need to
>> be low. Especially for the SLUB allocator, the efficiency of bulk free
>> API depend on objects belonging to the same slab (page).
> 
> Hey Jesper,
> 
> You probably haven't seen this patch series from Vlastimil:
> 
> https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@suse.cz/
> 
> I wonder if you'd like to give it a try?  It should provide some immunity
> to this problem, and might even be faster than the current approach.
> If it isn't, it'd be good to understand why, and if it could be improved.

I took a quick look at:
  - 
https://lore.kernel.org/linux-mm/20230810163627.6206-11-vbabka@suse.cz/#Z31mm:slub.c

To Vlastimil, sorry but I don't think this approach with spin_lock will 
be faster than SLUB's normal fast-path using this_cpu_cmpxchg.

My experience is that SLUB this_cpu_cmpxchg trick is faster than spin_lock.

On my testlab CPU E5-1650 v4 @ 3.60GHz:
  - spin_lock+unlock : 34 cycles(tsc) 9.485 ns
  - this_cpu_cmpxchg :  5 cycles(tsc) 1.585 ns
  - locked cmpxchg   : 18 cycles(tsc) 5.006 ns

SLUB does use a cmpxchg_double which I don't have a microbench for.

> No objection to this patch going in for now, of course.
>
Vlastimil Babka Aug. 18, 2023, 3:15 p.m. UTC | #3
On 8/15/23 17:17, Jesper Dangaard Brouer wrote:
> Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE
> in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE")
> now is the time to use this flag for networking as proposed
> earlier see link.
> 
> The SKB (sk_buff) kmem_cache slab is critical for network performance.
> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
> performance by amortising the alloc/free cost.
> 
> For the bulk API to perform efficiently the slub fragmentation need to
> be low. Especially for the SLUB allocator, the efficiency of bulk free
> API depend on objects belonging to the same slab (page).
> 
> When running different network performance microbenchmarks, I started
> to notice that performance was reduced (slightly) when machines had
> longer uptimes. I believe the cause was 'skbuff_head_cache' got
> aliased/merged into the general slub for 256 bytes sized objects (with
> my kernel config, without CONFIG_HARDENED_USERCOPY).
> 
> For SKB kmem_cache network stack have other various reasons for
> not merging, but it varies depending on kernel config (e.g.
> CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE
> for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs.
> 
> When CONFIG_SLUB_TINY is configured the bulk APIs are essentially
> disabled. Thus, for this case drop the SLAB_NO_MERGE flag.
> 
> Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  net/core/skbuff.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index a298992060e6..92aee3e0376a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4750,12 +4750,23 @@ static void skb_extensions_init(void)
>  static void skb_extensions_init(void) {}
>  #endif
>  
> +/* The SKB kmem_cache slab is critical for network performance.  Never
> + * merge/alias the slab with similar sized objects.  This avoids fragmentation
> + * that hurts performance of kmem_cache_{alloc,free}_bulk APIs.
> + */
> +#ifndef CONFIG_SLUB_TINY
> +#define FLAG_SKB_NO_MERGE	SLAB_NO_MERGE
> +#else /* CONFIG_SLUB_TINY - simple loop in kmem_cache_alloc_bulk */
> +#define FLAG_SKB_NO_MERGE	0
> +#endif
> +
>  void __init skb_init(void)
>  {
>  	skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache",
>  					      sizeof(struct sk_buff),
>  					      0,
> -					      SLAB_HWCACHE_ALIGN|SLAB_PANIC,
> +					      SLAB_HWCACHE_ALIGN|SLAB_PANIC|
> +						FLAG_SKB_NO_MERGE,
>  					      offsetof(struct sk_buff, cb),
>  					      sizeof_field(struct sk_buff, cb),
>  					      NULL);
> 
>
Vlastimil Babka Aug. 18, 2023, 3:20 p.m. UTC | #4
On 8/18/23 14:32, Jesper Dangaard Brouer wrote:
> 
> 
> On 15/08/2023 17.53, Matthew Wilcox wrote:
>> On Tue, Aug 15, 2023 at 05:17:36PM +0200, Jesper Dangaard Brouer wrote:
>>> For the bulk API to perform efficiently the slub fragmentation need to
>>> be low. Especially for the SLUB allocator, the efficiency of bulk free
>>> API depend on objects belonging to the same slab (page).
>>
>> Hey Jesper,
>>
>> You probably haven't seen this patch series from Vlastimil:
>>
>> https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@suse.cz/
>>
>> I wonder if you'd like to give it a try?  It should provide some immunity
>> to this problem, and might even be faster than the current approach.
>> If it isn't, it'd be good to understand why, and if it could be improved.

I didn't Cc Jesper on that yet, as the initial attempt was focused on
the maple tree nodes use case. But you'll notice using the percpu array
requires the cache to be created with SLAB_NO_MERGE anyway, so this
patch would be still necessary :)

> I took a quick look at:
>  -
> https://lore.kernel.org/linux-mm/20230810163627.6206-11-vbabka@suse.cz/#Z31mm:slub.c
> 
> To Vlastimil, sorry but I don't think this approach with spin_lock will
> be faster than SLUB's normal fast-path using this_cpu_cmpxchg.
> 
> My experience is that SLUB this_cpu_cmpxchg trick is faster than spin_lock.
> 
> On my testlab CPU E5-1650 v4 @ 3.60GHz:
>  - spin_lock+unlock : 34 cycles(tsc) 9.485 ns
>  - this_cpu_cmpxchg :  5 cycles(tsc) 1.585 ns
>  - locked cmpxchg   : 18 cycles(tsc) 5.006 ns

Hm that's unexpected difference between spin_lock+unlock where AFAIK
spin_lock is basically a locked cmpxchg and unlock a simple write, and I
assume these measurements are on uncontended lock?

> SLUB does use a cmpxchg_double which I don't have a microbench for.

Yeah it's possible the _double will be slower. Yeah the locking will
have to be considered more thoroughly for the percpu array.

>> No objection to this patch going in for now, of course.
>>
>
Jakub Kicinski Aug. 18, 2023, 4:26 p.m. UTC | #5
On Tue, 15 Aug 2023 17:17:36 +0200 Jesper Dangaard Brouer wrote:
> Subject: [PATCH net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache

Has this been a problem "forever"?
We're getting late in the release cycle for non-regression & non-crash
changes, even if small regression potential..
Jesper Dangaard Brouer Aug. 18, 2023, 7:59 p.m. UTC | #6
On 18/08/2023 18.26, Jakub Kicinski wrote:
> On Tue, 15 Aug 2023 17:17:36 +0200 Jesper Dangaard Brouer wrote:
>> Subject: [PATCH net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache
> 
> Has this been a problem "forever"?
> We're getting late in the release cycle for non-regression & non-crash
> changes, even if small regression potential..
> 

This patch is not dangerous ;-)

SKB slab/kmem_cache is already almost always "no_merge" (due to the
flags it uses), but certain kernel .config combinations can cause it to
be merged (with other 256 bytes slabs).  I happen to hit this config
combination in my testlab, and noticed the problem slight performance
impact.  If anything this patch makes it more consistent.

--Jesper
patchwork-bot+netdevbpf@kernel.org Aug. 18, 2023, 10:20 p.m. UTC | #7
Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Tue, 15 Aug 2023 17:17:36 +0200 you wrote:
> Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE
> in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE")
> now is the time to use this flag for networking as proposed
> earlier see link.
> 
> The SKB (sk_buff) kmem_cache slab is critical for network performance.
> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
> performance by amortising the alloc/free cost.
> 
> [...]

Here is the summary with links:
  - [net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache
    https://git.kernel.org/netdev/net-next/c/0a0643164da4

You are awesome, thank you!
Alexander Lobakin Aug. 21, 2023, 1:55 p.m. UTC | #8
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Tue, 15 Aug 2023 17:17:36 +0200

> Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE
> in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE")
> now is the time to use this flag for networking as proposed
> earlier see link.
> 
> The SKB (sk_buff) kmem_cache slab is critical for network performance.
> Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain
> performance by amortising the alloc/free cost.
> 
> For the bulk API to perform efficiently the slub fragmentation need to
> be low. Especially for the SLUB allocator, the efficiency of bulk free
> API depend on objects belonging to the same slab (page).
> 
> When running different network performance microbenchmarks, I started
> to notice that performance was reduced (slightly) when machines had
> longer uptimes. I believe the cause was 'skbuff_head_cache' got
> aliased/merged into the general slub for 256 bytes sized objects (with
> my kernel config, without CONFIG_HARDENED_USERCOPY).
> 
> For SKB kmem_cache network stack have other various reasons for
> not merging, but it varies depending on kernel config (e.g.
> CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE
> for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs.
> 
> When CONFIG_SLUB_TINY is configured the bulk APIs are essentially
> disabled. Thus, for this case drop the SLAB_NO_MERGE flag.
> 
> Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/
> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
> ---
>  net/core/skbuff.c |   13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index a298992060e6..92aee3e0376a 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4750,12 +4750,23 @@ static void skb_extensions_init(void)
>  static void skb_extensions_init(void) {}
>  #endif
>  
> +/* The SKB kmem_cache slab is critical for network performance.  Never
> + * merge/alias the slab with similar sized objects.  This avoids fragmentation
> + * that hurts performance of kmem_cache_{alloc,free}_bulk APIs.
> + */
> +#ifndef CONFIG_SLUB_TINY
> +#define FLAG_SKB_NO_MERGE	SLAB_NO_MERGE
> +#else /* CONFIG_SLUB_TINY - simple loop in kmem_cache_alloc_bulk */
> +#define FLAG_SKB_NO_MERGE	0
> +#endif
> +
>  void __init skb_init(void)
>  {
>  	skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache",
>  					      sizeof(struct sk_buff),
>  					      0,
> -					      SLAB_HWCACHE_ALIGN|SLAB_PANIC,
> +					      SLAB_HWCACHE_ALIGN|SLAB_PANIC|
> +						FLAG_SKB_NO_MERGE,

That alignment tho xD

>  					      offsetof(struct sk_buff, cb),
>  					      sizeof_field(struct sk_buff, cb),
>  					      NULL);

Thanks,
Olek
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a298992060e6..92aee3e0376a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4750,12 +4750,23 @@  static void skb_extensions_init(void)
 static void skb_extensions_init(void) {}
 #endif
 
+/* The SKB kmem_cache slab is critical for network performance.  Never
+ * merge/alias the slab with similar sized objects.  This avoids fragmentation
+ * that hurts performance of kmem_cache_{alloc,free}_bulk APIs.
+ */
+#ifndef CONFIG_SLUB_TINY
+#define FLAG_SKB_NO_MERGE	SLAB_NO_MERGE
+#else /* CONFIG_SLUB_TINY - simple loop in kmem_cache_alloc_bulk */
+#define FLAG_SKB_NO_MERGE	0
+#endif
+
 void __init skb_init(void)
 {
 	skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache",
 					      sizeof(struct sk_buff),
 					      0,
-					      SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+					      SLAB_HWCACHE_ALIGN|SLAB_PANIC|
+						FLAG_SKB_NO_MERGE,
 					      offsetof(struct sk_buff, cb),
 					      sizeof_field(struct sk_buff, cb),
 					      NULL);