Message ID | 20210920154816.31832-1-42.hyeyoo@gmail.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | [RFC,v2] mm, sl[au]b: Introduce lockless cache | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Mon, Sep 20, 2021 at 03:48:16PM +0000, Hyeonggon Yoo wrote: > +#define KMEM_LOCKLESS_CACHE_QUEUE_SIZE 64 I would suggest that, to be nice to the percpu allocator, this be one less than 2^n. > +struct kmem_lockless_cache { > + void *queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE]; > + unsigned int size; > +}; I would also suggest that 'size' be first as it is going to be accessed every time, and then there's a reasonable chance that queue[size - 1] will be in the same cacheline. CPUs will tend to handle that better. > +/** > + * kmem_cache_alloc_cached - try to allocate from cache without lock > + * @s: slab cache > + * @flags: SLAB flags > + * > + * Try to allocate from cache without lock. If fails, fill the lockless cache > + * using bulk alloc API > + * > + * Be sure that there's no race condition. > + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function. > + * > + * Return: a pointer to free object on allocation success, NULL on failure. > + */ > +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags) > +{ > + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > + > + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); > + > + if (cache->size) /* fastpath without lock */ > + return cache->queue[--cache->size]; > + > + /* slowpath */ > + cache->size = kmem_cache_alloc_bulk(s, gfpflags, > + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); Go back to the Bonwick paper and look at the magazine section again. You have to allocate _half_ the size of the queue, otherwise you get into pathological situations where you start to free and allocate every time. > +void kmem_cache_free_cached(struct kmem_cache *s, void *p) > +{ > + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > + > + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); > + > + /* Is there better way to do this? */ > + if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE) > + kmem_cache_free(s, cache->queue[--cache->size]); Yes. if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE) { kmem_cache_free_bulk(s, KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2, &cache->queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2)); cache->size = KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2; (check the maths on that; it might have some off-by-one)
Hello Matthew, I love your suggestion to make it more cache-friendly (and batch allocation) and it sounds good to me. I appreciate it so much. But give me some time (from just one to few days) to read relevant papers. and then I'll reply to your suggestions and send v3 with your suggestions in mind. Cheers, Hyeonggon
> @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align, > } > EXPORT_SYMBOL(kmem_cache_create); > > +/** > + * kmem_cache_alloc_cached - try to allocate from cache without lock > + * @s: slab cache > + * @flags: SLAB flags > + * > + * Try to allocate from cache without lock. If fails, fill the lockless cache > + * using bulk alloc API > + * > + * Be sure that there's no race condition. > + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function. > + * > + * Return: a pointer to free object on allocation success, NULL on failure. > + */ > +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags) > +{ > + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > + > + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); > + > + if (cache->size) /* fastpath without lock */ > + return cache->queue[--cache->size]; > + > + /* slowpath */ > + cache->size = kmem_cache_alloc_bulk(s, gfpflags, > + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); > + if (cache->size) > + return cache->queue[--cache->size]; > + else > + return NULL; > +} > +EXPORT_SYMBOL(kmem_cache_alloc_cached); How does this work for preempt? You seem to assume that the function is invoked with preempt disabled, but then it could only be used with GFP_ATOMIC. There are basically two types of use cases for this: 1) Freeing can happen from interrupts 2) Freeing cannot happen from interrupts What I implemented for IOPOLL doesn't need to care about interrupts, hence preemption disable is enough. But we do need that, at least. And if you don't care about users that free from irq/softirq, then that should be mentioned. Locking context should be mentioned, too. The above may be just fine IFF both alloc and free are protected by a lock higher up. If not, both need preemption disabled and GFP_ATOMIC. I'd suggest making the get/put cpu part of the API internally. > +/** > + * kmem_cache_free_cached - return object to cache > + * @s: slab cache > + * @p: pointer to free > + */ > +void kmem_cache_free_cached(struct kmem_cache *s, void *p) > +{ > + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > + > + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); Don't use BUG_ON, just do: if (WARN_ON_ONCE(!(s->flags & SLAB_LOCKLESS_CACHE))) { kmem_cache_free(s, p); return; }
On Mon, Sep 20, 2021 at 11:01:16PM +0100, Matthew Wilcox wrote: > On Mon, Sep 20, 2021 at 03:48:16PM +0000, Hyeonggon Yoo wrote: > > +#define KMEM_LOCKLESS_CACHE_QUEUE_SIZE 64 > > I would suggest that, to be nice to the percpu allocator, this be > one less than 2^n. I think first we need to discuss if KMEM_LOCKLESS_CACHE_QUEUE_SIZE will be okay with constant, or per kmem_cache size, or global boot parameter. But even before that, we need to discuss how to manage magazine. because that affect on what KMEM_LOCKLESS_CACHE_QUEUE_SIZE will be. > > +struct kmem_lockless_cache { > > + void *queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE]; > > + unsigned int size; > > +}; > > I would also suggest that 'size' be first as it is going to be accessed > every time, and then there's a reasonable chance that queue[size - 1] will > be in the same cacheline. CPUs will tend to handle that better. That looks good to me, as you said there's chance that 'size' and some of queue's elements (hopefully queue[size - 1]) are in same cacheline. Plus, in v2 I didn't consider cache line when determining position of kmem_lockless_cache in kmem_cache, I think we need to reconsider this too. > > +/** > > + * kmem_cache_alloc_cached - try to allocate from cache without lock > > + * @s: slab cache > > + * @flags: SLAB flags > > + * > > + * Try to allocate from cache without lock. If fails, fill the lockless cache > > + * using bulk alloc API > > + * > > + * Be sure that there's no race condition. > > + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function. > > + * > > + * Return: a pointer to free object on allocation success, NULL on failure. > > + */ > > +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags) > > +{ > > + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > > + > > + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); > > + > > + if (cache->size) /* fastpath without lock */ > > + return cache->queue[--cache->size]; > > + > > + /* slowpath */ > > + cache->size = kmem_cache_alloc_bulk(s, gfpflags, > > + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); > > Go back to the Bonwick paper and look at the magazine section again. > You have to allocate _half_ the size of the queue, otherwise you get > into pathological situations where you start to free and allocate > every time. I want to ask you where idea of allocating 'half' the size of queue came from. the paper you sent does not work with single queue(magazine). Instead, it manages pool of magazines. And after reading the paper, I see managing pool of magazine (where M is an boot parameter) is valid approach to reduce hitting slowpath. Thanks, Hyeonggon Yoo > > +void kmem_cache_free_cached(struct kmem_cache *s, void *p) > > +{ > > + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > > + > > + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); > > + > > + /* Is there better way to do this? */ > > + if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE) > > + kmem_cache_free(s, cache->queue[--cache->size]); > > Yes. > > if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE) { > kmem_cache_free_bulk(s, KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2, > &cache->queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2)); > cache->size = KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2; > > (check the maths on that; it might have some off-by-one)
On Tue, Sep 21, 2021 at 03:42:39PM +0000, Hyeonggon Yoo wrote: > > > + /* slowpath */ > > > + cache->size = kmem_cache_alloc_bulk(s, gfpflags, > > > + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); > > > > Go back to the Bonwick paper and look at the magazine section again. > > You have to allocate _half_ the size of the queue, otherwise you get > > into pathological situations where you start to free and allocate > > every time. > > I want to ask you where idea of allocating 'half' the size of queue came from. > the paper you sent does not work with single queue(magazine). Instead, > it manages pool of magazines. > > And after reading the paper, I see managing pool of magazine (where M is > an boot parameter) is valid approach to reduce hitting slowpath. Bonwick uses two magazines per cpu; if both are empty, one is replaced with a full one. If both are full, one is replaced with an empty one. Our slab implementation doesn't provide magazine allocation, but it does provide bulk allocation. So translating the Bonwick implementation to our implementation, we need to bulk-allocate or bulk-free half of the array at any time.
On Tue, Sep 21, 2021 at 09:37:40AM -0600, Jens Axboe wrote: > > @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align, > > } > > EXPORT_SYMBOL(kmem_cache_create); > > > > +/** > > + * kmem_cache_alloc_cached - try to allocate from cache without lock > > + * @s: slab cache > > + * @flags: SLAB flags > > + * > > + * Try to allocate from cache without lock. If fails, fill the lockless cache > > + * using bulk alloc API > > + * > > + * Be sure that there's no race condition. > > + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function. > > + * > > + * Return: a pointer to free object on allocation success, NULL on failure. > > + */ > > +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags) > > +{ > > + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > > + > > + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); > > + > > + if (cache->size) /* fastpath without lock */ > > + return cache->queue[--cache->size]; > > + > > + /* slowpath */ > > + cache->size = kmem_cache_alloc_bulk(s, gfpflags, > > + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); > > + if (cache->size) > > + return cache->queue[--cache->size]; > > + else > > + return NULL; > > +} > > +EXPORT_SYMBOL(kmem_cache_alloc_cached); Hello Jens, I'm so happy that you gave comment. > What I implemented for IOPOLL doesn't need to care about interrupts, > hence preemption disable is enough. But we do need that, at least. To be honest, that was my mistake. I was mistakenly using percpu API. it's a shame :> Thank you for pointing that. Fixed it in v3 (work in progress now) > There are basically two types of use cases for this: > > 1) Freeing can happen from interrupts > 2) Freeing cannot happen from interrupts > I considered only case 2) when writing code. Well, To support 1), I think there are two ways: a) internally call kmem_cache_free when in_interrupt() is true b) caller must disable interrupt when freeing I think a) is okay, how do you think? note that b) can be problematic with kmem_cache_free_bulk as it says interrupts must be enabled. > How does this work for preempt? You seem to assume that the function is > invoked with preempt disabled, but then it could only be used with > GFP_ATOMIC. I wrote it just same prototype with kmem_cache_alloc, and the gfpflags parameter is unnecessary as you said. Okay, let's remove it in v3. > And if you don't care about users that free from irq/softirq, then that > should be mentioned. Locking context should be mentioned, too. The above > may be just fine IFF both alloc and free are protected by a lock higher > up. If not, both need preemption disabled and GFP_ATOMIC. I'd suggest > making the get/put cpu part of the API internally. Actually I didn't put much effort in documentation. (Especially on what context is expected before calling them) comments will be updated in v3, with your comment in mind. > > +/** > > + * kmem_cache_free_cached - return object to cache > > + * @s: slab cache > > + * @p: pointer to free > > + */ > > +void kmem_cache_free_cached(struct kmem_cache *s, void *p) > > +{ > > + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > > + > > + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); > > Don't use BUG_ON, just do: > > if (WARN_ON_ONCE(!(s->flags & SLAB_LOCKLESS_CACHE))) { > kmem_cache_free(s, p); > return; > } > Ok. I agree WARN is better than BUG. Thanks, Hyeonggon Yoo > -- > Jens Axboe >
Hello Matthew. There's good news. in v3 (work in progress now), I fixed some bugs (I hate kernel panics!) And for test, made NAPI use it. it works pretty well. On Tue, Sep 21, 2021 at 05:17:02PM +0100, Matthew Wilcox wrote: > On Tue, Sep 21, 2021 at 03:42:39PM +0000, Hyeonggon Yoo wrote: > > > > + /* slowpath */ > > > > + cache->size = kmem_cache_alloc_bulk(s, gfpflags, > > > > + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); > > > > > > Go back to the Bonwick paper and look at the magazine section again. > > > You have to allocate _half_ the size of the queue, otherwise you get > > > into pathological situations where you start to free and allocate > > > every time. > > > > I want to ask you where idea of allocating 'half' the size of queue came from. > > the paper you sent does not work with single queue(magazine). Instead, > > it manages pool of magazines. > > > > And after reading the paper, I see managing pool of magazine (where M is > > an boot parameter) is valid approach to reduce hitting slowpath. > > Bonwick uses two magazines per cpu; if both are empty, one is replaced > with a full one. If both are full, one is replaced with an empty one. > Our slab implementation doesn't provide magazine allocation, but it does > provide bulk allocation. > So translating the Bonwick implementation to > our implementation, we need to bulk-allocate or bulk-free half of the > array at any time. Is there a reason that the number should be 'half'? what about something like this: diff --git a/mm/slab_common.c b/mm/slab_common.c index 884d3311cd8e..f32736302d53 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -455,12 +455,13 @@ void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags) } cache = get_cpu_ptr(s->cache); - if (cache->size) /* fastpath without lock */ + if (cache->size) /* fastpath without lock */ p = cache->queue[--cache->size]; else { /* slowpath */ - cache->size = kmem_cache_alloc_bulk(s, gfpflags, - KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); + cache->size += kmem_cache_alloc_bulk(s, gfpflags, + KMEM_LOCKLESS_CACHE_BATCHCOUNT, + cache->queue); if (cache->size) p = cache->queue[--cache->size]; else @@ -491,13 +492,13 @@ void kmem_cache_free_cached(struct kmem_cache *s, void *p) cache = get_cpu_ptr(s->cache); if (cache->size < KMEM_LOCKLESS_CACHE_QUEUE_SIZE) { cache->queue[cache->size++] = p; - put_cpu_ptr(s->cache); - return ; + } else { + kmem_cache_free_bulk(s, + KMEM_LOCKLESS_CACHE_BATCHCOUNT, + cache->queue - KMEM_LOCKLESS_CACHE_BATCHCOUNT); + cache->size -= KMEM_LOCKLESS_CACHE_BATCHCOUNT; } put_cpu_ptr(s->cache); - - /* Is there better way to do this? */ - kmem_cache_free(s, p); } EXPORT_SYMBOL(kmem_cache_free_cached);
> @@ -491,13 +492,13 @@ void kmem_cache_free_cached(struct kmem_cache *s, void *p) > cache = get_cpu_ptr(s->cache); > if (cache->size < KMEM_LOCKLESS_CACHE_QUEUE_SIZE) { > cache->queue[cache->size++] = p; > - put_cpu_ptr(s->cache); > - return ; > + } else { > + kmem_cache_free_bulk(s, > + KMEM_LOCKLESS_CACHE_BATCHCOUNT, > + cache->queue - KMEM_LOCKLESS_CACHE_BATCHCOUNT); > + cache->size -= KMEM_LOCKLESS_CACHE_BATCHCOUNT; > } > put_cpu_ptr(s->cache); > - > - /* Is there better way to do this? */ > - kmem_cache_free(s, p); > } > EXPORT_SYMBOL(kmem_cache_free_cached); Sent you a wrong code. Above was buggy code from some hours ago because of cache->queue - KMEM_LOCKLESS_CACHE_BATCHCOUNT. So that is now: cache = get_cpu_ptr(s->cache); if (cache->size < KMEM_LOCKLESS_CACHE_QUEUE_SIZE) { cache->queue[cache->size++] = p; } else { kmem_cache_free_bulk(s, KMEM_LOCKLESS_CACHE_BATCHCOUNT, cache->queue + KMEM_LOCKLESS_CACHE_QUEUE_SIZE - KMEM_LOCKLESS_CACHE_BATCHCOUNT); cache->size -= KMEM_LOCKLESS_CACHE_BATCHCOUNT; } put_cpu_ptr(s->cache);
On 9/22/21 2:19 AM, Hyeonggon Yoo wrote: > On Tue, Sep 21, 2021 at 09:37:40AM -0600, Jens Axboe wrote: >>> @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align, >>> } >>> EXPORT_SYMBOL(kmem_cache_create); >>> >>> +/** >>> + * kmem_cache_alloc_cached - try to allocate from cache without lock >>> + * @s: slab cache >>> + * @flags: SLAB flags >>> + * >>> + * Try to allocate from cache without lock. If fails, fill the lockless cache >>> + * using bulk alloc API >>> + * >>> + * Be sure that there's no race condition. >>> + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function. >>> + * >>> + * Return: a pointer to free object on allocation success, NULL on failure. >>> + */ >>> +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags) >>> +{ >>> + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); >>> + >>> + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); >>> + >>> + if (cache->size) /* fastpath without lock */ >>> + return cache->queue[--cache->size]; >>> + >>> + /* slowpath */ >>> + cache->size = kmem_cache_alloc_bulk(s, gfpflags, >>> + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); >>> + if (cache->size) >>> + return cache->queue[--cache->size]; >>> + else >>> + return NULL; >>> +} >>> +EXPORT_SYMBOL(kmem_cache_alloc_cached); > > Hello Jens, I'm so happy that you gave comment. > >> What I implemented for IOPOLL doesn't need to care about interrupts, >> hence preemption disable is enough. But we do need that, at least. > > To be honest, that was my mistake. I was mistakenly using percpu API. > it's a shame :> Thank you for pointing that. > > Fixed it in v3 (work in progress now) Another thing to fix from there, just make it: if (cache->size) return cached item return NULL; No need for an if/else with the return. > >> There are basically two types of use cases for this: >> >> 1) Freeing can happen from interrupts >> 2) Freeing cannot happen from interrupts >> > > I considered only case 2) when writing code. Well, To support 1), > I think there are two ways: > > a) internally call kmem_cache_free when in_interrupt() is true > b) caller must disable interrupt when freeing > > I think a) is okay, how do you think? If the API doesn't support freeing from interrupts, then I'd make that the rule. Caller should know better if that can happen, and then just use kmem_cache_free() if in a problematic context. That avoids polluting the fast path with that check. I'd still make it a WARN_ON_ONCE() as described and it can get removed later, hopefully. But note it's not just the freeing side that would be problematic. If you do support from-irq freeing, then the alloc side would need local_irq_save/restore() instead of just basic preempt protection. That would make it more expensive all around. > note that b) can be problematic with kmem_cache_free_bulk > as it says interrupts must be enabled. Not sure that's actually true, apart from being bitrot. >> How does this work for preempt? You seem to assume that the function is >> invoked with preempt disabled, but then it could only be used with >> GFP_ATOMIC. > > I wrote it just same prototype with kmem_cache_alloc, and the gfpflags > parameter is unnecessary as you said. Okay, let's remove it in v3. Please also run some actual comparitative benchmarks on this, with a real workload. You also need an internal user of this, a stand-alone feature isn't really useful. It needs someone using it, and the benchmarks would be based on that (or those) use cases. Another consideration - is it going to be OK to ignore slab pruning for this? Probably. But it needs to be thought about. In terms of API, not sure why these are separate functions. Creating the cache with SLAB_FOO should be enough, and then kmem_cache_alloc() tries the cache first. If nothing there, fallback to the regular path. Something like this would only be used for cases that have a high alloc+free rate, would seem like a shame to need two function calls for the case where you just want a piece of memory and the cache is empty.
On Wed, Sep 22, 2021 at 06:58:00AM -0600, Jens Axboe wrote: > On 9/22/21 2:19 AM, Hyeonggon Yoo wrote: > > On Tue, Sep 21, 2021 at 09:37:40AM -0600, Jens Axboe wrote: > >>> @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align, > >>> } > >>> EXPORT_SYMBOL(kmem_cache_create); > >>> > >>> +/** > >>> + * kmem_cache_alloc_cached - try to allocate from cache without lock > >>> + * @s: slab cache > >>> + * @flags: SLAB flags > >>> + * > >>> + * Try to allocate from cache without lock. If fails, fill the lockless cache > >>> + * using bulk alloc API > >>> + * > >>> + * Be sure that there's no race condition. > >>> + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function. > >>> + * > >>> + * Return: a pointer to free object on allocation success, NULL on failure. > >>> + */ > >>> +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags) > >>> +{ > >>> + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); > >>> + > >>> + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); > >>> + > >>> + if (cache->size) /* fastpath without lock */ > >>> + return cache->queue[--cache->size]; > >>> + > >>> + /* slowpath */ > >>> + cache->size = kmem_cache_alloc_bulk(s, gfpflags, > >>> + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); > >>> + if (cache->size) > >>> + return cache->queue[--cache->size]; > >>> + else > >>> + return NULL; > >>> +} > >>> +EXPORT_SYMBOL(kmem_cache_alloc_cached); > > > > Hello Jens, I'm so happy that you gave comment. > > > >> What I implemented for IOPOLL doesn't need to care about interrupts, > >> hence preemption disable is enough. But we do need that, at least. > > > > To be honest, that was my mistake. I was mistakenly using percpu API. > > it's a shame :> Thank you for pointing that. > > > > Fixed it in v3 (work in progress now) Hello Jens, thank you so much for reviewing this. Your thoughtful review helps a lot And I feel we're going into right direction. Anyway, let's start discussion again. > > Another thing to fix from there, just make it: > > if (cache->size) > return cached item > return NULL; > > No need for an if/else with the return. That looks better. I'll consider that in next version. > >> How does this work for preempt? You seem to assume that the function is > >> invoked with preempt disabled, but then it could only be used with > >> GFP_ATOMIC. > > > > I wrote it just same prototype with kmem_cache_alloc, and the gfpflags > > parameter is unnecessary as you said. Okay, let's remove it in v3. > > Please also run some actual comparitative benchmarks on this, with a > real workload. You also need an internal user of this, a stand-alone > feature isn't really useful. It needs someone using it, and the > benchmarks would be based on that (or those) use cases. > Yeah, That's important point too. as my project affects performance, I should measure the impact. And as you said, to measure impact, I should find use cases. For test, I made NAPI (network layer IO Polling) to use this. it may help measuring. So, Okay. I'll show its impact on performance, as work progresses. > Another consideration - is it going to be OK to ignore slab pruning for > this? Probably. But it needs to be thought about. > I think I should consider slab pruning too. That's one of benefits of using slab allocator. I'll add this in TODOs. > In terms of API, not sure why these are separate functions. Creating the > cache with SLAB_FOO should be enough, and then kmem_cache_alloc() tries > the cache first. If nothing there, fallback to the regular path. > Something like this would only be used for cases that have a high > alloc+free rate, would seem like a shame to need two function calls for > the case where you just want a piece of memory and the cache is empty. > It would be wonderful if we can do that. And I'm working on this. I'm testing this in my private repository and looking at what happens. it seems there's some issues to solve. > >> There are basically two types of use cases for this: > >> > >> 1) Freeing can happen from interrupts > >> 2) Freeing cannot happen from interrupts > >> > > > > I considered only case 2) when writing code. Well, To support 1), > > I think there are two ways: > > > > a) internally call kmem_cache_free when in_interrupt() is true > > b) caller must disable interrupt when freeing > > > > I think a) is okay, how do you think? > > If the API doesn't support freeing from interrupts, then I'd make that > the rule. Caller should know better if that can happen, and then just > use kmem_cache_free() if in a problematic context. That avoids polluting > the fast path with that check. I'd still make it a WARN_ON_ONCE() as > described and it can get removed later, hopefully. Hmm.. As you mentioned above, to provide unified API, we should check its context in kmem_cache_alloc anyway. But yes, it is better not to allow calling from interrupt kmem_cache_{alloc,free}_cached. > But note it's not just the freeing side that would be problematic. If > you do support from-irq freeing, then the alloc side would need > local_irq_save/restore() instead of just basic preempt protection. That > would make it more expensive all around I think so too. One reason of this work is to reduce cost of disabling interrupts. That's costly. > > note that b) can be problematic with kmem_cache_free_bulk > > as it says interrupts must be enabled. > > Not sure that's actually true, apart from being bitrot. I don't get what you mean. Can you tell me in detail? (Yeah, maybe I'm misunderstanding kmem_cache_{alloc,free}_bulk.) Thanks, Hyeonggon Yoo > -- > Jens Axboe >
Hello there! In v1 and v2, I showed simple proof of concept of lockless cache. After some discussions, it turns out that there are some issues to solve. It will take some to solve them. I made git repository on github to share its progress: https://github.com/hygoni/linux.git This is based on v5.15-rc2, and the main branch will be 'lockless_cache'. There's *nothing* in the repo now and will be updated randomly. hopefully every 1-3 days, or every week. (note that I'm not full time kernel developer. it's hobby.) I'll make use of 'issues' and 'discussions' tab (github feature to track issues and discussions). You can join if you're interested. Thank you for your interest on this project! -- Hyeonggon
On Wed, 22 Sep 2021 06:58:00 -0600 Jens Axboe wrote: > > I considered only case 2) when writing code. Well, To support 1), > > I think there are two ways: > > > > a) internally call kmem_cache_free when in_interrupt() is true > > b) caller must disable interrupt when freeing > > > > I think a) is okay, how do you think? > > If the API doesn't support freeing from interrupts, then I'd make that > the rule. Caller should know better if that can happen, and then just > use kmem_cache_free() if in a problematic context. That avoids polluting > the fast path with that check. I'd still make it a WARN_ON_ONCE() as > described and it can get removed later, hopefully. Shooting from the hip a little but if I'm getting the context right this is all very similar to the skb cache so lockdep_assert_in_softirq() may be useful: /* * Acceptable for protecting per-CPU resources accessed from BH. * Much like in_softirq() - semantics are ambiguous, use carefully. */ #define lockdep_assert_in_softirq() \ do { \ WARN_ON_ONCE(__lockdep_enabled && \ (!in_softirq() || in_irq() || in_nmi())); \ } while (0)
diff --git a/include/linux/slab.h b/include/linux/slab.h index 083f3ce550bc..091f514dc8e0 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -120,6 +120,9 @@ /* Slab deactivation flag */ #define SLAB_DEACTIVATED ((slab_flags_t __force)0x10000000U) +/* use percpu lockless cache */ +#define SLAB_LOCKLESS_CACHE ((slab_flags_t __force)0x20000000U) + /* * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests. * @@ -327,6 +330,13 @@ enum kmalloc_cache_type { NR_KMALLOC_TYPES }; +#define KMEM_LOCKLESS_CACHE_QUEUE_SIZE 64 + +struct kmem_lockless_cache { + void *queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE]; + unsigned int size; +}; + #ifndef CONFIG_SLOB extern struct kmem_cache * kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1]; @@ -429,6 +439,19 @@ void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc; void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags) __assume_slab_alignment __malloc; void kmem_cache_free(struct kmem_cache *, void *); +#ifndef CONFIG_SLOB + +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags); +void kmem_cache_free_cached(struct kmem_cache *s, void *p); + +#else + +#define kmem_cache_alloc_cached kmem_cache_alloc +#define kmem_cache_free_cached kmem_cache_free + +#endif /* CONFIG_SLOB */ + + /* * Bulk allocation and freeing operations. These are accelerated in an * allocator specific way to avoid taking locks repeatedly or building diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h index 3aa5e1e73ab6..9f3161f38a8a 100644 --- a/include/linux/slab_def.h +++ b/include/linux/slab_def.h @@ -85,6 +85,8 @@ struct kmem_cache { unsigned int usersize; /* Usercopy region size */ struct kmem_cache_node *node[MAX_NUMNODES]; + + struct kmem_lockless_cache __percpu *cache; /* percpu lockless cache */ }; static inline void *nearest_obj(struct kmem_cache *cache, struct page *page, diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h index 85499f0586b0..1dc3527efba8 100644 --- a/include/linux/slub_def.h +++ b/include/linux/slub_def.h @@ -96,6 +96,7 @@ struct kmem_cache { unsigned int object_size;/* The size of an object without metadata */ struct reciprocal_value reciprocal_size; unsigned int offset; /* Free pointer offset */ + struct kmem_lockless_cache __percpu *cache; /* percpu lockless cache */ #ifdef CONFIG_SLUB_CPU_PARTIAL /* Number of per cpu partial objects to keep around */ unsigned int cpu_partial; diff --git a/mm/slab_common.c b/mm/slab_common.c index ec2bb0beed75..5b8e4d5a644d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -262,6 +262,13 @@ static struct kmem_cache *create_cache(const char *name, s->useroffset = useroffset; s->usersize = usersize; + if (flags & SLAB_LOCKLESS_CACHE) { + s->cache = alloc_percpu(struct kmem_lockless_cache); + if (!s->cache) + goto out_free_cache; + s->cache->size = 0; + } + err = __kmem_cache_create(s, flags); if (err) goto out_free_cache; @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align, } EXPORT_SYMBOL(kmem_cache_create); +/** + * kmem_cache_alloc_cached - try to allocate from cache without lock + * @s: slab cache + * @flags: SLAB flags + * + * Try to allocate from cache without lock. If fails, fill the lockless cache + * using bulk alloc API + * + * Be sure that there's no race condition. + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function. + * + * Return: a pointer to free object on allocation success, NULL on failure. + */ +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags) +{ + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); + + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); + + if (cache->size) /* fastpath without lock */ + return cache->queue[--cache->size]; + + /* slowpath */ + cache->size = kmem_cache_alloc_bulk(s, gfpflags, + KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue); + if (cache->size) + return cache->queue[--cache->size]; + else + return NULL; +} +EXPORT_SYMBOL(kmem_cache_alloc_cached); + +/** + * kmem_cache_free_cached - return object to cache + * @s: slab cache + * @p: pointer to free + */ +void kmem_cache_free_cached(struct kmem_cache *s, void *p) +{ + struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache); + + BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE)); + + /* Is there better way to do this? */ + if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE) + kmem_cache_free(s, cache->queue[--cache->size]); + + cache->queue[cache->size++] = p; +} +EXPORT_SYMBOL(kmem_cache_free_cached); + static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) { LIST_HEAD(to_destroy); @@ -460,6 +518,8 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) static int shutdown_cache(struct kmem_cache *s) { + struct kmem_lockless_cache *cache; + /* free asan quarantined objects */ kasan_cache_shutdown(s); @@ -468,6 +528,12 @@ static int shutdown_cache(struct kmem_cache *s) list_del(&s->list); + if (s->flags & SLAB_LOCKLESS_CACHE) { + cache = this_cpu_ptr(s->cache); + kmem_cache_free_bulk(s, cache->size, cache->queue); + free_percpu(s->cache); + } + if (s->flags & SLAB_TYPESAFE_BY_RCU) { #ifdef SLAB_SUPPORTS_SYSFS sysfs_slab_unlink(s);