Message ID | 169211265663.1491038.8580163757548985946.stgit@firesoul (mailing list archive) |
---|---|
State | Accepted |
Commit | 0a0643164da4a1976455aa12f0a96d08ee290752 |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache | expand |
On Tue, Aug 15, 2023 at 05:17:36PM +0200, Jesper Dangaard Brouer wrote: > For the bulk API to perform efficiently the slub fragmentation need to > be low. Especially for the SLUB allocator, the efficiency of bulk free > API depend on objects belonging to the same slab (page). Hey Jesper, You probably haven't seen this patch series from Vlastimil: https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@suse.cz/ I wonder if you'd like to give it a try? It should provide some immunity to this problem, and might even be faster than the current approach. If it isn't, it'd be good to understand why, and if it could be improved. No objection to this patch going in for now, of course.
On 15/08/2023 17.53, Matthew Wilcox wrote: > On Tue, Aug 15, 2023 at 05:17:36PM +0200, Jesper Dangaard Brouer wrote: >> For the bulk API to perform efficiently the slub fragmentation need to >> be low. Especially for the SLUB allocator, the efficiency of bulk free >> API depend on objects belonging to the same slab (page). > > Hey Jesper, > > You probably haven't seen this patch series from Vlastimil: > > https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@suse.cz/ > > I wonder if you'd like to give it a try? It should provide some immunity > to this problem, and might even be faster than the current approach. > If it isn't, it'd be good to understand why, and if it could be improved. I took a quick look at: - https://lore.kernel.org/linux-mm/20230810163627.6206-11-vbabka@suse.cz/#Z31mm:slub.c To Vlastimil, sorry but I don't think this approach with spin_lock will be faster than SLUB's normal fast-path using this_cpu_cmpxchg. My experience is that SLUB this_cpu_cmpxchg trick is faster than spin_lock. On my testlab CPU E5-1650 v4 @ 3.60GHz: - spin_lock+unlock : 34 cycles(tsc) 9.485 ns - this_cpu_cmpxchg : 5 cycles(tsc) 1.585 ns - locked cmpxchg : 18 cycles(tsc) 5.006 ns SLUB does use a cmpxchg_double which I don't have a microbench for. > No objection to this patch going in for now, of course. >
On 8/15/23 17:17, Jesper Dangaard Brouer wrote: > Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE > in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE") > now is the time to use this flag for networking as proposed > earlier see link. > > The SKB (sk_buff) kmem_cache slab is critical for network performance. > Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain > performance by amortising the alloc/free cost. > > For the bulk API to perform efficiently the slub fragmentation need to > be low. Especially for the SLUB allocator, the efficiency of bulk free > API depend on objects belonging to the same slab (page). > > When running different network performance microbenchmarks, I started > to notice that performance was reduced (slightly) when machines had > longer uptimes. I believe the cause was 'skbuff_head_cache' got > aliased/merged into the general slub for 256 bytes sized objects (with > my kernel config, without CONFIG_HARDENED_USERCOPY). > > For SKB kmem_cache network stack have other various reasons for > not merging, but it varies depending on kernel config (e.g. > CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE > for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs. > > When CONFIG_SLUB_TINY is configured the bulk APIs are essentially > disabled. Thus, for this case drop the SLAB_NO_MERGE flag. > > Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/ > Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> > --- > net/core/skbuff.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index a298992060e6..92aee3e0376a 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4750,12 +4750,23 @@ static void skb_extensions_init(void) > static void skb_extensions_init(void) {} > #endif > > +/* The SKB kmem_cache slab is critical for network performance. Never > + * merge/alias the slab with similar sized objects. This avoids fragmentation > + * that hurts performance of kmem_cache_{alloc,free}_bulk APIs. > + */ > +#ifndef CONFIG_SLUB_TINY > +#define FLAG_SKB_NO_MERGE SLAB_NO_MERGE > +#else /* CONFIG_SLUB_TINY - simple loop in kmem_cache_alloc_bulk */ > +#define FLAG_SKB_NO_MERGE 0 > +#endif > + > void __init skb_init(void) > { > skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache", > sizeof(struct sk_buff), > 0, > - SLAB_HWCACHE_ALIGN|SLAB_PANIC, > + SLAB_HWCACHE_ALIGN|SLAB_PANIC| > + FLAG_SKB_NO_MERGE, > offsetof(struct sk_buff, cb), > sizeof_field(struct sk_buff, cb), > NULL); > >
On 8/18/23 14:32, Jesper Dangaard Brouer wrote: > > > On 15/08/2023 17.53, Matthew Wilcox wrote: >> On Tue, Aug 15, 2023 at 05:17:36PM +0200, Jesper Dangaard Brouer wrote: >>> For the bulk API to perform efficiently the slub fragmentation need to >>> be low. Especially for the SLUB allocator, the efficiency of bulk free >>> API depend on objects belonging to the same slab (page). >> >> Hey Jesper, >> >> You probably haven't seen this patch series from Vlastimil: >> >> https://lore.kernel.org/linux-mm/20230810163627.6206-9-vbabka@suse.cz/ >> >> I wonder if you'd like to give it a try? It should provide some immunity >> to this problem, and might even be faster than the current approach. >> If it isn't, it'd be good to understand why, and if it could be improved. I didn't Cc Jesper on that yet, as the initial attempt was focused on the maple tree nodes use case. But you'll notice using the percpu array requires the cache to be created with SLAB_NO_MERGE anyway, so this patch would be still necessary :) > I took a quick look at: > - > https://lore.kernel.org/linux-mm/20230810163627.6206-11-vbabka@suse.cz/#Z31mm:slub.c > > To Vlastimil, sorry but I don't think this approach with spin_lock will > be faster than SLUB's normal fast-path using this_cpu_cmpxchg. > > My experience is that SLUB this_cpu_cmpxchg trick is faster than spin_lock. > > On my testlab CPU E5-1650 v4 @ 3.60GHz: > - spin_lock+unlock : 34 cycles(tsc) 9.485 ns > - this_cpu_cmpxchg : 5 cycles(tsc) 1.585 ns > - locked cmpxchg : 18 cycles(tsc) 5.006 ns Hm that's unexpected difference between spin_lock+unlock where AFAIK spin_lock is basically a locked cmpxchg and unlock a simple write, and I assume these measurements are on uncontended lock? > SLUB does use a cmpxchg_double which I don't have a microbench for. Yeah it's possible the _double will be slower. Yeah the locking will have to be considered more thoroughly for the percpu array. >> No objection to this patch going in for now, of course. >> >
On Tue, 15 Aug 2023 17:17:36 +0200 Jesper Dangaard Brouer wrote:
> Subject: [PATCH net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache
Has this been a problem "forever"?
We're getting late in the release cycle for non-regression & non-crash
changes, even if small regression potential..
On 18/08/2023 18.26, Jakub Kicinski wrote: > On Tue, 15 Aug 2023 17:17:36 +0200 Jesper Dangaard Brouer wrote: >> Subject: [PATCH net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache > > Has this been a problem "forever"? > We're getting late in the release cycle for non-regression & non-crash > changes, even if small regression potential.. > This patch is not dangerous ;-) SKB slab/kmem_cache is already almost always "no_merge" (due to the flags it uses), but certain kernel .config combinations can cause it to be merged (with other 256 bytes slabs). I happen to hit this config combination in my testlab, and noticed the problem slight performance impact. If anything this patch makes it more consistent. --Jesper
Hello: This patch was applied to netdev/net-next.git (main) by Jakub Kicinski <kuba@kernel.org>: On Tue, 15 Aug 2023 17:17:36 +0200 you wrote: > Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE > in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE") > now is the time to use this flag for networking as proposed > earlier see link. > > The SKB (sk_buff) kmem_cache slab is critical for network performance. > Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain > performance by amortising the alloc/free cost. > > [...] Here is the summary with links: - [net] net: use SLAB_NO_MERGE for kmem_cache skbuff_head_cache https://git.kernel.org/netdev/net-next/c/0a0643164da4 You are awesome, thank you!
From: Jesper Dangaard Brouer <hawk@kernel.org> Date: Tue, 15 Aug 2023 17:17:36 +0200 > Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE > in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE") > now is the time to use this flag for networking as proposed > earlier see link. > > The SKB (sk_buff) kmem_cache slab is critical for network performance. > Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain > performance by amortising the alloc/free cost. > > For the bulk API to perform efficiently the slub fragmentation need to > be low. Especially for the SLUB allocator, the efficiency of bulk free > API depend on objects belonging to the same slab (page). > > When running different network performance microbenchmarks, I started > to notice that performance was reduced (slightly) when machines had > longer uptimes. I believe the cause was 'skbuff_head_cache' got > aliased/merged into the general slub for 256 bytes sized objects (with > my kernel config, without CONFIG_HARDENED_USERCOPY). > > For SKB kmem_cache network stack have other various reasons for > not merging, but it varies depending on kernel config (e.g. > CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE > for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs. > > When CONFIG_SLUB_TINY is configured the bulk APIs are essentially > disabled. Thus, for this case drop the SLAB_NO_MERGE flag. > > Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/ > Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> > --- > net/core/skbuff.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index a298992060e6..92aee3e0376a 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4750,12 +4750,23 @@ static void skb_extensions_init(void) > static void skb_extensions_init(void) {} > #endif > > +/* The SKB kmem_cache slab is critical for network performance. Never > + * merge/alias the slab with similar sized objects. This avoids fragmentation > + * that hurts performance of kmem_cache_{alloc,free}_bulk APIs. > + */ > +#ifndef CONFIG_SLUB_TINY > +#define FLAG_SKB_NO_MERGE SLAB_NO_MERGE > +#else /* CONFIG_SLUB_TINY - simple loop in kmem_cache_alloc_bulk */ > +#define FLAG_SKB_NO_MERGE 0 > +#endif > + > void __init skb_init(void) > { > skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache", > sizeof(struct sk_buff), > 0, > - SLAB_HWCACHE_ALIGN|SLAB_PANIC, > + SLAB_HWCACHE_ALIGN|SLAB_PANIC| > + FLAG_SKB_NO_MERGE, That alignment tho xD > offsetof(struct sk_buff, cb), > sizeof_field(struct sk_buff, cb), > NULL); Thanks, Olek
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index a298992060e6..92aee3e0376a 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -4750,12 +4750,23 @@ static void skb_extensions_init(void) static void skb_extensions_init(void) {} #endif +/* The SKB kmem_cache slab is critical for network performance. Never + * merge/alias the slab with similar sized objects. This avoids fragmentation + * that hurts performance of kmem_cache_{alloc,free}_bulk APIs. + */ +#ifndef CONFIG_SLUB_TINY +#define FLAG_SKB_NO_MERGE SLAB_NO_MERGE +#else /* CONFIG_SLUB_TINY - simple loop in kmem_cache_alloc_bulk */ +#define FLAG_SKB_NO_MERGE 0 +#endif + void __init skb_init(void) { skbuff_cache = kmem_cache_create_usercopy("skbuff_head_cache", sizeof(struct sk_buff), 0, - SLAB_HWCACHE_ALIGN|SLAB_PANIC, + SLAB_HWCACHE_ALIGN|SLAB_PANIC| + FLAG_SKB_NO_MERGE, offsetof(struct sk_buff, cb), sizeof_field(struct sk_buff, cb), NULL);
Since v6.5-rc1 MM-tree is merged and contains a new flag SLAB_NO_MERGE in commit d0bf7d5759c1 ("mm/slab: introduce kmem_cache flag SLAB_NO_MERGE") now is the time to use this flag for networking as proposed earlier see link. The SKB (sk_buff) kmem_cache slab is critical for network performance. Network stack uses kmem_cache_{alloc,free}_bulk APIs to gain performance by amortising the alloc/free cost. For the bulk API to perform efficiently the slub fragmentation need to be low. Especially for the SLUB allocator, the efficiency of bulk free API depend on objects belonging to the same slab (page). When running different network performance microbenchmarks, I started to notice that performance was reduced (slightly) when machines had longer uptimes. I believe the cause was 'skbuff_head_cache' got aliased/merged into the general slub for 256 bytes sized objects (with my kernel config, without CONFIG_HARDENED_USERCOPY). For SKB kmem_cache network stack have other various reasons for not merging, but it varies depending on kernel config (e.g. CONFIG_HARDENED_USERCOPY). We want to explicitly set SLAB_NO_MERGE for this kmem_cache to get most out of kmem_cache_{alloc,free}_bulk APIs. When CONFIG_SLUB_TINY is configured the bulk APIs are essentially disabled. Thus, for this case drop the SLAB_NO_MERGE flag. Link: https://lore.kernel.org/all/167396280045.539803.7540459812377220500.stgit@firesoul/ Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> --- net/core/skbuff.c | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-)