diff mbox series

[RFC,v2] mm, sl[au]b: Introduce lockless cache

Message ID 20210920154816.31832-1-42.hyeyoo@gmail.com (mailing list archive)
State RFC
Headers show
Series [RFC,v2] mm, sl[au]b: Introduce lockless cache | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Hyeonggon Yoo Sept. 20, 2021, 3:48 p.m. UTC
This is RFC v2 of lockless cache on slab, for situation like IO Polling.
It is untested, and just simple proof of concept yet.

So there will be things to improve or erroneous code. (I'm sure of it)
Any opinions or suggestions will be appreciated a lot!

v1 is here:
        https://lore.kernel.org/linux-mm/20210919164239.49905-1-42.hyeyoo@gmail.com/

Changes since v1:
        - It was implemented as separate layer from slab,
                but it is now in slab.
        - Changed linked list to array

Things to think about, or things to work on:
        - Applying limit, batchcount like SLAB
        - I suspect if it does make sence to implment it in SLOB/SLAB.
        - Can we improve it's mechanism depending on SL[AOU]B?
        - Test needed
        - Finding and fixing erroneous code :(
---
 include/linux/slab.h     | 23 ++++++++++++++
 include/linux/slab_def.h |  2 ++
 include/linux/slub_def.h |  1 +
 mm/slab_common.c         | 66 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 92 insertions(+)

Comments

Matthew Wilcox Sept. 20, 2021, 10:01 p.m. UTC | #1
On Mon, Sep 20, 2021 at 03:48:16PM +0000, Hyeonggon Yoo wrote:
> +#define KMEM_LOCKLESS_CACHE_QUEUE_SIZE 64

I would suggest that, to be nice to the percpu allocator, this be
one less than 2^n.

> +struct kmem_lockless_cache {
> +	void *queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE];
> +	unsigned int size;
> +};

I would also suggest that 'size' be first as it is going to be accessed
every time, and then there's a reasonable chance that queue[size - 1] will
be in the same cacheline.  CPUs will tend to handle that better.

> +/**
> + * kmem_cache_alloc_cached - try to allocate from cache without lock
> + * @s: slab cache
> + * @flags: SLAB flags
> + *
> + * Try to allocate from cache without lock. If fails, fill the lockless cache
> + * using bulk alloc API
> + *
> + * Be sure that there's no race condition.
> + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function.
> + *
> + * Return: a pointer to free object on allocation success, NULL on failure.
> + */
> +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags)
> +{
> +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> +
> +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
> +
> +	if (cache->size) /* fastpath without lock */
> +		return cache->queue[--cache->size];
> +
> +	/* slowpath */
> +	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
> +			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);

Go back to the Bonwick paper and look at the magazine section again.
You have to allocate _half_ the size of the queue, otherwise you get
into pathological situations where you start to free and allocate
every time.

> +void kmem_cache_free_cached(struct kmem_cache *s, void *p)
> +{
> +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> +
> +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
> +
> +	/* Is there better way to do this? */
> +	if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE)
> +		kmem_cache_free(s, cache->queue[--cache->size]);

Yes.

	if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE) {
		kmem_cache_free_bulk(s, KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2,
			&cache->queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2));
		cache->size = KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2;

(check the maths on that; it might have some off-by-one)
Hyeonggon Yoo Sept. 21, 2021, 10:56 a.m. UTC | #2
Hello Matthew,
I love your suggestion to make it more cache-friendly (and batch
allocation) and it sounds good to me. I appreciate it so much.

But give me some time (from just one to few days) to
read relevant papers. and then I'll reply to your suggestions
and send v3 with your suggestions in mind.

Cheers,
Hyeonggon
Jens Axboe Sept. 21, 2021, 3:37 p.m. UTC | #3
> @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
>  }
>  EXPORT_SYMBOL(kmem_cache_create);
>  
> +/**
> + * kmem_cache_alloc_cached - try to allocate from cache without lock
> + * @s: slab cache
> + * @flags: SLAB flags
> + *
> + * Try to allocate from cache without lock. If fails, fill the lockless cache
> + * using bulk alloc API
> + *
> + * Be sure that there's no race condition.
> + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function.
> + *
> + * Return: a pointer to free object on allocation success, NULL on failure.
> + */
> +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags)
> +{
> +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> +
> +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
> +
> +	if (cache->size) /* fastpath without lock */
> +		return cache->queue[--cache->size];
> +
> +	/* slowpath */
> +	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
> +			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
> +	if (cache->size)
> +		return cache->queue[--cache->size];
> +	else
> +		return NULL;
> +}
> +EXPORT_SYMBOL(kmem_cache_alloc_cached);

How does this work for preempt? You seem to assume that the function is
invoked with preempt disabled, but then it could only be used with
GFP_ATOMIC.

There are basically two types of use cases for this:

1) Freeing can happen from interrupts
2) Freeing cannot happen from interrupts

What I implemented for IOPOLL doesn't need to care about interrupts,
hence preemption disable is enough. But we do need that, at least.

And if you don't care about users that free from irq/softirq, then that
should be mentioned. Locking context should be mentioned, too. The above
may be just fine IFF both alloc and free are protected by a lock higher
up. If not, both need preemption disabled and GFP_ATOMIC. I'd suggest
making the get/put cpu part of the API internally.

> +/**
> + * kmem_cache_free_cached - return object to cache
> + * @s: slab cache
> + * @p: pointer to free
> + */
> +void kmem_cache_free_cached(struct kmem_cache *s, void *p)
> +{
> +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> +
> +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));

Don't use BUG_ON, just do:

	if (WARN_ON_ONCE(!(s->flags & SLAB_LOCKLESS_CACHE))) {
		kmem_cache_free(s, p);
		return;
	}
Hyeonggon Yoo Sept. 21, 2021, 3:42 p.m. UTC | #4
On Mon, Sep 20, 2021 at 11:01:16PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 20, 2021 at 03:48:16PM +0000, Hyeonggon Yoo wrote:
> > +#define KMEM_LOCKLESS_CACHE_QUEUE_SIZE 64
> 
> I would suggest that, to be nice to the percpu allocator, this be
> one less than 2^n.

I think first we need to discuss if KMEM_LOCKLESS_CACHE_QUEUE_SIZE will be
okay with constant, or per kmem_cache size, or global boot parameter.

But even before that, we need to discuss how to manage magazine.
because that affect on what KMEM_LOCKLESS_CACHE_QUEUE_SIZE will be.

> > +struct kmem_lockless_cache {
> > +	void *queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE];
> > +	unsigned int size;
> > +};
> 
> I would also suggest that 'size' be first as it is going to be accessed
> every time, and then there's a reasonable chance that queue[size - 1] will
> be in the same cacheline.  CPUs will tend to handle that better.

That looks good to me, as you said there's chance that 'size' and some of
queue's elements (hopefully queue[size - 1]) are in same cacheline.

Plus, in v2 I didn't consider cache line when determining position of
kmem_lockless_cache in kmem_cache, I think we need to reconsider this
too.

> > +/**
> > + * kmem_cache_alloc_cached - try to allocate from cache without lock
> > + * @s: slab cache
> > + * @flags: SLAB flags
> > + *
> > + * Try to allocate from cache without lock. If fails, fill the lockless cache
> > + * using bulk alloc API
> > + *
> > + * Be sure that there's no race condition.
> > + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function.
> > + *
> > + * Return: a pointer to free object on allocation success, NULL on failure.
> > + */
> > +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags)
> > +{
> > +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> > +
> > +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
> > +
> > +	if (cache->size) /* fastpath without lock */
> > +		return cache->queue[--cache->size];
> > +
> > +	/* slowpath */
> > +	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
> > +			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
> 
> Go back to the Bonwick paper and look at the magazine section again.
> You have to allocate _half_ the size of the queue, otherwise you get
> into pathological situations where you start to free and allocate
> every time.

I want to ask you where idea of allocating 'half' the size of queue came from.
the paper you sent does not work with single queue(magazine). Instead,
it manages pool of magazines.

And after reading the paper, I see managing pool of magazine (where M is
an boot parameter) is valid approach to reduce hitting slowpath.

Thanks,
Hyeonggon Yoo

> > +void kmem_cache_free_cached(struct kmem_cache *s, void *p)
> > +{
> > +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> > +
> > +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
> > +
> > +	/* Is there better way to do this? */
> > +	if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE)
> > +		kmem_cache_free(s, cache->queue[--cache->size]);
> 
> Yes.
> 
> 	if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE) {
> 		kmem_cache_free_bulk(s, KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2,
> 			&cache->queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2));
> 		cache->size = KMEM_LOCKLESS_CACHE_QUEUE_SIZE / 2;
> 
> (check the maths on that; it might have some off-by-one)
Matthew Wilcox Sept. 21, 2021, 4:17 p.m. UTC | #5
On Tue, Sep 21, 2021 at 03:42:39PM +0000, Hyeonggon Yoo wrote:
> > > +	/* slowpath */
> > > +	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
> > > +			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
> > 
> > Go back to the Bonwick paper and look at the magazine section again.
> > You have to allocate _half_ the size of the queue, otherwise you get
> > into pathological situations where you start to free and allocate
> > every time.
> 
> I want to ask you where idea of allocating 'half' the size of queue came from.
> the paper you sent does not work with single queue(magazine). Instead,
> it manages pool of magazines.
> 
> And after reading the paper, I see managing pool of magazine (where M is
> an boot parameter) is valid approach to reduce hitting slowpath.

Bonwick uses two magazines per cpu; if both are empty, one is replaced
with a full one.  If both are full, one is replaced with an empty one.
Our slab implementation doesn't provide magazine allocation, but it does
provide bulk allocation.  So translating the Bonwick implementation to
our implementation, we need to bulk-allocate or bulk-free half of the
array at any time.
Hyeonggon Yoo Sept. 22, 2021, 8:19 a.m. UTC | #6
On Tue, Sep 21, 2021 at 09:37:40AM -0600, Jens Axboe wrote:
> > @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> >  }
> >  EXPORT_SYMBOL(kmem_cache_create);
> >  
> > +/**
> > + * kmem_cache_alloc_cached - try to allocate from cache without lock
> > + * @s: slab cache
> > + * @flags: SLAB flags
> > + *
> > + * Try to allocate from cache without lock. If fails, fill the lockless cache
> > + * using bulk alloc API
> > + *
> > + * Be sure that there's no race condition.
> > + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function.
> > + *
> > + * Return: a pointer to free object on allocation success, NULL on failure.
> > + */
> > +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags)
> > +{
> > +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> > +
> > +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
> > +
> > +	if (cache->size) /* fastpath without lock */
> > +		return cache->queue[--cache->size];
> > +
> > +	/* slowpath */
> > +	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
> > +			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
> > +	if (cache->size)
> > +		return cache->queue[--cache->size];
> > +	else
> > +		return NULL;
> > +}
> > +EXPORT_SYMBOL(kmem_cache_alloc_cached);

Hello Jens, I'm so happy that you gave comment.

> What I implemented for IOPOLL doesn't need to care about interrupts,
> hence preemption disable is enough. But we do need that, at least.

To be honest, that was my mistake. I was mistakenly using percpu API.
it's a shame :> Thank you for pointing that.

Fixed it in v3 (work in progress now)

> There are basically two types of use cases for this:
> 
> 1) Freeing can happen from interrupts
> 2) Freeing cannot happen from interrupts
>

I considered only case 2) when writing code. Well, To support 1),
I think there are two ways:

 a) internally call kmem_cache_free when in_interrupt() is true
 b) caller must disable interrupt when freeing

I think a) is okay, how do you think?

note that b) can be problematic with kmem_cache_free_bulk
as it says interrupts must be enabled.

> How does this work for preempt? You seem to assume that the function is
> invoked with preempt disabled, but then it could only be used with
> GFP_ATOMIC.

I wrote it just same prototype with kmem_cache_alloc, and the gfpflags
parameter is unnecessary as you said. Okay, let's remove it in v3.

> And if you don't care about users that free from irq/softirq, then that
> should be mentioned. Locking context should be mentioned, too. The above
> may be just fine IFF both alloc and free are protected by a lock higher
> up. If not, both need preemption disabled and GFP_ATOMIC. I'd suggest
> making the get/put cpu part of the API internally.

Actually I didn't put much effort in documentation. (Especially
on what context is expected before calling them)

comments will be updated in v3, with your comment in mind.

> > +/**
> > + * kmem_cache_free_cached - return object to cache
> > + * @s: slab cache
> > + * @p: pointer to free
> > + */
> > +void kmem_cache_free_cached(struct kmem_cache *s, void *p)
> > +{
> > +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> > +
> > +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
> 
> Don't use BUG_ON, just do:
> 
> 	if (WARN_ON_ONCE(!(s->flags & SLAB_LOCKLESS_CACHE))) {
> 		kmem_cache_free(s, p);
> 		return;
> 	}
>

Ok. I agree WARN is better than BUG.

Thanks,
Hyeonggon Yoo

> -- 
> Jens Axboe
>
Hyeonggon Yoo Sept. 22, 2021, 8:32 a.m. UTC | #7
Hello Matthew.
There's good news.

in v3 (work in progress now), I fixed some bugs (I hate kernel panics!)
And for test, made NAPI use it. it works pretty well.

On Tue, Sep 21, 2021 at 05:17:02PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 21, 2021 at 03:42:39PM +0000, Hyeonggon Yoo wrote:
> > > > +	/* slowpath */
> > > > +	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
> > > > +			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
> > > 
> > > Go back to the Bonwick paper and look at the magazine section again.
> > > You have to allocate _half_ the size of the queue, otherwise you get
> > > into pathological situations where you start to free and allocate
> > > every time.
> > 
> > I want to ask you where idea of allocating 'half' the size of queue came from.
> > the paper you sent does not work with single queue(magazine). Instead,
> > it manages pool of magazines.
> > 
> > And after reading the paper, I see managing pool of magazine (where M is
> > an boot parameter) is valid approach to reduce hitting slowpath.
> 
> Bonwick uses two magazines per cpu; if both are empty, one is replaced
> with a full one.  If both are full, one is replaced with an empty one.
> Our slab implementation doesn't provide magazine allocation, but it does
> provide bulk allocation.
> So translating the Bonwick implementation to
> our implementation, we need to bulk-allocate or bulk-free half of the
> array at any time.

Is there a reason that the number should be 'half'?

what about something like this:

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 884d3311cd8e..f32736302d53 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -455,12 +455,13 @@ void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags)
        }

        cache = get_cpu_ptr(s->cache);
-       if (cache->size) /* fastpath without lock */
+       if (cache->size) /* fastpath without lock */
                p = cache->queue[--cache->size];
        else {
                /* slowpath */
-               cache->size = kmem_cache_alloc_bulk(s, gfpflags,
-                               KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
+               cache->size += kmem_cache_alloc_bulk(s, gfpflags,
+                               KMEM_LOCKLESS_CACHE_BATCHCOUNT,
+                               cache->queue);
                if (cache->size)
                        p = cache->queue[--cache->size];
                else
@@ -491,13 +492,13 @@ void kmem_cache_free_cached(struct kmem_cache *s, void *p)
        cache = get_cpu_ptr(s->cache);
        if (cache->size < KMEM_LOCKLESS_CACHE_QUEUE_SIZE) {
                cache->queue[cache->size++] = p;
-               put_cpu_ptr(s->cache);
-               return ;
+       } else {
+               kmem_cache_free_bulk(s,
+                               KMEM_LOCKLESS_CACHE_BATCHCOUNT,
+                               cache->queue - KMEM_LOCKLESS_CACHE_BATCHCOUNT);
+               cache->size -= KMEM_LOCKLESS_CACHE_BATCHCOUNT;
        }
        put_cpu_ptr(s->cache);
-
-       /* Is there better way to do this? */
-       kmem_cache_free(s, p);
 }
 EXPORT_SYMBOL(kmem_cache_free_cached);
Hyeonggon Yoo Sept. 22, 2021, 9:11 a.m. UTC | #8
> @@ -491,13 +492,13 @@ void kmem_cache_free_cached(struct kmem_cache *s, void *p)
>         cache = get_cpu_ptr(s->cache);
>         if (cache->size < KMEM_LOCKLESS_CACHE_QUEUE_SIZE) {
>                 cache->queue[cache->size++] = p;
> -               put_cpu_ptr(s->cache);
> -               return ;
> +       } else {
> +               kmem_cache_free_bulk(s,
> +                               KMEM_LOCKLESS_CACHE_BATCHCOUNT,
> +                               cache->queue - KMEM_LOCKLESS_CACHE_BATCHCOUNT);
> +               cache->size -= KMEM_LOCKLESS_CACHE_BATCHCOUNT;
>         }
>         put_cpu_ptr(s->cache);
> -
> -       /* Is there better way to do this? */
> -       kmem_cache_free(s, p);
>  }
>  EXPORT_SYMBOL(kmem_cache_free_cached);

Sent you a wrong code.

Above was buggy code from some hours ago 
because of cache->queue - KMEM_LOCKLESS_CACHE_BATCHCOUNT.

So that is now:

	cache = get_cpu_ptr(s->cache);
	if (cache->size < KMEM_LOCKLESS_CACHE_QUEUE_SIZE) {
		cache->queue[cache->size++] = p;
	} else {
		kmem_cache_free_bulk(s,
				KMEM_LOCKLESS_CACHE_BATCHCOUNT,
				cache->queue + KMEM_LOCKLESS_CACHE_QUEUE_SIZE
				- KMEM_LOCKLESS_CACHE_BATCHCOUNT);
		cache->size -= KMEM_LOCKLESS_CACHE_BATCHCOUNT;
	}
	put_cpu_ptr(s->cache);
Jens Axboe Sept. 22, 2021, 12:58 p.m. UTC | #9
On 9/22/21 2:19 AM, Hyeonggon Yoo wrote:
> On Tue, Sep 21, 2021 at 09:37:40AM -0600, Jens Axboe wrote:
>>> @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
>>>  }
>>>  EXPORT_SYMBOL(kmem_cache_create);
>>>  
>>> +/**
>>> + * kmem_cache_alloc_cached - try to allocate from cache without lock
>>> + * @s: slab cache
>>> + * @flags: SLAB flags
>>> + *
>>> + * Try to allocate from cache without lock. If fails, fill the lockless cache
>>> + * using bulk alloc API
>>> + *
>>> + * Be sure that there's no race condition.
>>> + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function.
>>> + *
>>> + * Return: a pointer to free object on allocation success, NULL on failure.
>>> + */
>>> +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags)
>>> +{
>>> +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
>>> +
>>> +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
>>> +
>>> +	if (cache->size) /* fastpath without lock */
>>> +		return cache->queue[--cache->size];
>>> +
>>> +	/* slowpath */
>>> +	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
>>> +			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
>>> +	if (cache->size)
>>> +		return cache->queue[--cache->size];
>>> +	else
>>> +		return NULL;
>>> +}
>>> +EXPORT_SYMBOL(kmem_cache_alloc_cached);
> 
> Hello Jens, I'm so happy that you gave comment.
> 
>> What I implemented for IOPOLL doesn't need to care about interrupts,
>> hence preemption disable is enough. But we do need that, at least.
> 
> To be honest, that was my mistake. I was mistakenly using percpu API.
> it's a shame :> Thank you for pointing that.
> 
> Fixed it in v3 (work in progress now)

Another thing to fix from there, just make it:

if (cache->size)
	return cached item
return NULL;

No need for an if/else with the return.

> 
>> There are basically two types of use cases for this:
>>
>> 1) Freeing can happen from interrupts
>> 2) Freeing cannot happen from interrupts
>>
> 
> I considered only case 2) when writing code. Well, To support 1),
> I think there are two ways:
> 
>  a) internally call kmem_cache_free when in_interrupt() is true
>  b) caller must disable interrupt when freeing
> 
> I think a) is okay, how do you think?

If the API doesn't support freeing from interrupts, then I'd make that
the rule. Caller should know better if that can happen, and then just
use kmem_cache_free() if in a problematic context. That avoids polluting
the fast path with that check. I'd still make it a WARN_ON_ONCE() as
described and it can get removed later, hopefully.

But note it's not just the freeing side that would be problematic. If
you do support from-irq freeing, then the alloc side would need
local_irq_save/restore() instead of just basic preempt protection. That
would make it more expensive all around.

> note that b) can be problematic with kmem_cache_free_bulk
> as it says interrupts must be enabled.

Not sure that's actually true, apart from being bitrot.

>> How does this work for preempt? You seem to assume that the function is
>> invoked with preempt disabled, but then it could only be used with
>> GFP_ATOMIC.
> 
> I wrote it just same prototype with kmem_cache_alloc, and the gfpflags
> parameter is unnecessary as you said. Okay, let's remove it in v3.

Please also run some actual comparitative benchmarks on this, with a
real workload. You also need an internal user of this, a stand-alone
feature isn't really useful. It needs someone using it, and the
benchmarks would be based on that (or those) use cases.

Another consideration - is it going to be OK to ignore slab pruning for
this? Probably. But it needs to be thought about.

In terms of API, not sure why these are separate functions. Creating the
cache with SLAB_FOO should be enough, and then kmem_cache_alloc() tries
the cache first. If nothing there, fallback to the regular path.
Something like this would only be used for cases that have a high
alloc+free rate, would seem like a shame to need two function calls for
the case where you just want a piece of memory and the cache is empty.
Hyeonggon Yoo Sept. 23, 2021, 3:34 a.m. UTC | #10
On Wed, Sep 22, 2021 at 06:58:00AM -0600, Jens Axboe wrote:
> On 9/22/21 2:19 AM, Hyeonggon Yoo wrote:
> > On Tue, Sep 21, 2021 at 09:37:40AM -0600, Jens Axboe wrote:
> >>> @@ -424,6 +431,57 @@ kmem_cache_create(const char *name, unsigned int size, unsigned int align,
> >>>  }
> >>>  EXPORT_SYMBOL(kmem_cache_create);
> >>>  
> >>> +/**
> >>> + * kmem_cache_alloc_cached - try to allocate from cache without lock
> >>> + * @s: slab cache
> >>> + * @flags: SLAB flags
> >>> + *
> >>> + * Try to allocate from cache without lock. If fails, fill the lockless cache
> >>> + * using bulk alloc API
> >>> + *
> >>> + * Be sure that there's no race condition.
> >>> + * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function.
> >>> + *
> >>> + * Return: a pointer to free object on allocation success, NULL on failure.
> >>> + */
> >>> +void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags)
> >>> +{
> >>> +	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
> >>> +
> >>> +	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
> >>> +
> >>> +	if (cache->size) /* fastpath without lock */
> >>> +		return cache->queue[--cache->size];
> >>> +
> >>> +	/* slowpath */
> >>> +	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
> >>> +			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
> >>> +	if (cache->size)
> >>> +		return cache->queue[--cache->size];
> >>> +	else
> >>> +		return NULL;
> >>> +}
> >>> +EXPORT_SYMBOL(kmem_cache_alloc_cached);
> > 
> > Hello Jens, I'm so happy that you gave comment.
> > 
> >> What I implemented for IOPOLL doesn't need to care about interrupts,
> >> hence preemption disable is enough. But we do need that, at least.
> > 
> > To be honest, that was my mistake. I was mistakenly using percpu API.
> > it's a shame :> Thank you for pointing that.
> > 
> > Fixed it in v3 (work in progress now)

Hello Jens, thank you so much for reviewing this. Your thoughtful review helps
a lot And I feel we're going into right direction.

Anyway, let's start discussion again.

> 
> Another thing to fix from there, just make it:
> 
> if (cache->size)
> 	return cached item
> return NULL;
> 
> No need for an if/else with the return.

That looks better. I'll consider that in next version.

> >> How does this work for preempt? You seem to assume that the function is
> >> invoked with preempt disabled, but then it could only be used with
> >> GFP_ATOMIC.
> > 
> > I wrote it just same prototype with kmem_cache_alloc, and the gfpflags
> > parameter is unnecessary as you said. Okay, let's remove it in v3.
> 
> Please also run some actual comparitative benchmarks on this, with a
> real workload. You also need an internal user of this, a stand-alone
> feature isn't really useful. It needs someone using it, and the
> benchmarks would be based on that (or those) use cases.
>

Yeah, That's important point too. as my project affects performance,
I should measure the impact. And as you said, to measure impact, I should
find use cases. For test, I made NAPI (network layer IO Polling) to use
this. it may help measuring.

So, Okay. I'll show its impact on performance, as work progresses.

> Another consideration - is it going to be OK to ignore slab pruning for
> this? Probably. But it needs to be thought about.
>

I think I should consider slab pruning too. That's one of benefits of
using slab allocator. I'll add this in TODOs.

> In terms of API, not sure why these are separate functions. Creating the
> cache with SLAB_FOO should be enough, and then kmem_cache_alloc() tries
> the cache first. If nothing there, fallback to the regular path.
> Something like this would only be used for cases that have a high
> alloc+free rate, would seem like a shame to need two function calls for
> the case where you just want a piece of memory and the cache is empty.
> 

It would be wonderful if we can do that. And I'm working on this.
I'm testing this in my private repository and looking at what happens.
it seems there's some issues to solve.

> >> There are basically two types of use cases for this:
> >>
> >> 1) Freeing can happen from interrupts
> >> 2) Freeing cannot happen from interrupts
> >>
> > 
> > I considered only case 2) when writing code. Well, To support 1),
> > I think there are two ways:
> > 
> >  a) internally call kmem_cache_free when in_interrupt() is true
> >  b) caller must disable interrupt when freeing
> > 
> > I think a) is okay, how do you think?
> 
> If the API doesn't support freeing from interrupts, then I'd make that
> the rule. Caller should know better if that can happen, and then just
> use kmem_cache_free() if in a problematic context. That avoids polluting
> the fast path with that check. I'd still make it a WARN_ON_ONCE() as
> described and it can get removed later, hopefully.

Hmm.. As you mentioned above, to provide unified API, we should check
its context in kmem_cache_alloc anyway. But yes, it is better not to allow
calling from interrupt kmem_cache_{alloc,free}_cached.

> But note it's not just the freeing side that would be problematic. If
> you do support from-irq freeing, then the alloc side would need
> local_irq_save/restore() instead of just basic preempt protection. That
> would make it more expensive all around

I think so too. One reason of this work is to reduce cost of disabling
interrupts. That's costly.

> > note that b) can be problematic with kmem_cache_free_bulk
> > as it says interrupts must be enabled.
> 
> Not sure that's actually true, apart from being bitrot.

I don't get what you mean.
Can you tell me in detail?

(Yeah, maybe I'm misunderstanding kmem_cache_{alloc,free}_bulk.)

Thanks,
Hyeonggon Yoo

> -- 
> Jens Axboe
>
Hyeonggon Yoo Sept. 23, 2021, 3:55 a.m. UTC | #11
Hello there!

In v1 and v2, I showed simple proof of concept of lockless cache.
After some discussions, it turns out that there are some issues to solve.
It will take some to solve them.

I made git repository on github to share its progress:
	https://github.com/hygoni/linux.git

This is based on v5.15-rc2, and the main branch will be
'lockless_cache'. There's *nothing* in the repo now and
will be updated randomly. hopefully every 1-3 days, or every week.
(note that I'm not full time kernel developer. it's hobby.)

I'll make use of 'issues' and 'discussions' tab (github feature to track
issues and discussions).

You can join if you're interested.
Thank you for your interest on this project!

--
Hyeonggon
Jakub Kicinski Sept. 23, 2021, 1:28 p.m. UTC | #12
On Wed, 22 Sep 2021 06:58:00 -0600 Jens Axboe wrote:
> > I considered only case 2) when writing code. Well, To support 1),
> > I think there are two ways:
> > 
> >  a) internally call kmem_cache_free when in_interrupt() is true
> >  b) caller must disable interrupt when freeing
> > 
> > I think a) is okay, how do you think?  
> 
> If the API doesn't support freeing from interrupts, then I'd make that
> the rule. Caller should know better if that can happen, and then just
> use kmem_cache_free() if in a problematic context. That avoids polluting
> the fast path with that check. I'd still make it a WARN_ON_ONCE() as
> described and it can get removed later, hopefully.

Shooting from the hip a little but if I'm getting the context right
this is all very similar to the skb cache so lockdep_assert_in_softirq()
may be useful:

/*
 * Acceptable for protecting per-CPU resources accessed from BH.
 * Much like in_softirq() - semantics are ambiguous, use carefully.
 */
#define lockdep_assert_in_softirq()					\
do {									\
	WARN_ON_ONCE(__lockdep_enabled			&&		\
		     (!in_softirq() || in_irq() || in_nmi()));		\
} while (0)
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 083f3ce550bc..091f514dc8e0 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -120,6 +120,9 @@ 
 /* Slab deactivation flag */
 #define SLAB_DEACTIVATED	((slab_flags_t __force)0x10000000U)
 
+/* use percpu lockless cache */
+#define SLAB_LOCKLESS_CACHE	((slab_flags_t __force)0x20000000U)
+
 /*
  * ZERO_SIZE_PTR will be returned for zero sized kmalloc requests.
  *
@@ -327,6 +330,13 @@  enum kmalloc_cache_type {
 	NR_KMALLOC_TYPES
 };
 
+#define KMEM_LOCKLESS_CACHE_QUEUE_SIZE 64
+
+struct kmem_lockless_cache {
+	void *queue[KMEM_LOCKLESS_CACHE_QUEUE_SIZE];
+	unsigned int size;
+};
+
 #ifndef CONFIG_SLOB
 extern struct kmem_cache *
 kmalloc_caches[NR_KMALLOC_TYPES][KMALLOC_SHIFT_HIGH + 1];
@@ -429,6 +439,19 @@  void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc;
 void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags) __assume_slab_alignment __malloc;
 void kmem_cache_free(struct kmem_cache *, void *);
 
+#ifndef CONFIG_SLOB
+
+void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags);
+void kmem_cache_free_cached(struct kmem_cache *s, void *p);
+
+#else
+
+#define kmem_cache_alloc_cached kmem_cache_alloc
+#define kmem_cache_free_cached kmem_cache_free
+
+#endif /* CONFIG_SLOB */
+
+
 /*
  * Bulk allocation and freeing operations. These are accelerated in an
  * allocator specific way to avoid taking locks repeatedly or building
diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 3aa5e1e73ab6..9f3161f38a8a 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -85,6 +85,8 @@  struct kmem_cache {
 	unsigned int usersize;		/* Usercopy region size */
 
 	struct kmem_cache_node *node[MAX_NUMNODES];
+
+	struct kmem_lockless_cache __percpu *cache; /* percpu lockless cache */
 };
 
 static inline void *nearest_obj(struct kmem_cache *cache, struct page *page,
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 85499f0586b0..1dc3527efba8 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -96,6 +96,7 @@  struct kmem_cache {
 	unsigned int object_size;/* The size of an object without metadata */
 	struct reciprocal_value reciprocal_size;
 	unsigned int offset;	/* Free pointer offset */
+	struct kmem_lockless_cache __percpu *cache; /* percpu lockless cache */
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
 	unsigned int cpu_partial;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index ec2bb0beed75..5b8e4d5a644d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -262,6 +262,13 @@  static struct kmem_cache *create_cache(const char *name,
 	s->useroffset = useroffset;
 	s->usersize = usersize;
 
+	if (flags & SLAB_LOCKLESS_CACHE) {
+		s->cache = alloc_percpu(struct kmem_lockless_cache);
+		if (!s->cache)
+			goto out_free_cache;
+		s->cache->size = 0;
+	}
+
 	err = __kmem_cache_create(s, flags);
 	if (err)
 		goto out_free_cache;
@@ -424,6 +431,57 @@  kmem_cache_create(const char *name, unsigned int size, unsigned int align,
 }
 EXPORT_SYMBOL(kmem_cache_create);
 
+/**
+ * kmem_cache_alloc_cached - try to allocate from cache without lock
+ * @s: slab cache
+ * @flags: SLAB flags
+ *
+ * Try to allocate from cache without lock. If fails, fill the lockless cache
+ * using bulk alloc API
+ *
+ * Be sure that there's no race condition.
+ * Must create slab cache with SLAB_LOCKLESS_CACHE flag to use this function.
+ *
+ * Return: a pointer to free object on allocation success, NULL on failure.
+ */
+void *kmem_cache_alloc_cached(struct kmem_cache *s, gfp_t gfpflags)
+{
+	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
+
+	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
+
+	if (cache->size) /* fastpath without lock */
+		return cache->queue[--cache->size];
+
+	/* slowpath */
+	cache->size = kmem_cache_alloc_bulk(s, gfpflags,
+			KMEM_LOCKLESS_CACHE_QUEUE_SIZE, cache->queue);
+	if (cache->size)
+		return cache->queue[--cache->size];
+	else
+		return NULL;
+}
+EXPORT_SYMBOL(kmem_cache_alloc_cached);
+
+/**
+ * kmem_cache_free_cached - return object to cache
+ * @s: slab cache
+ * @p: pointer to free
+ */
+void kmem_cache_free_cached(struct kmem_cache *s, void *p)
+{
+	struct kmem_lockless_cache *cache = this_cpu_ptr(s->cache);
+
+	BUG_ON(!(s->flags & SLAB_LOCKLESS_CACHE));
+
+	/* Is there better way to do this? */
+	if (cache->size == KMEM_LOCKLESS_CACHE_QUEUE_SIZE)
+		kmem_cache_free(s, cache->queue[--cache->size]);
+
+	cache->queue[cache->size++] = p;
+}
+EXPORT_SYMBOL(kmem_cache_free_cached);
+
 static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work)
 {
 	LIST_HEAD(to_destroy);
@@ -460,6 +518,8 @@  static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work)
 
 static int shutdown_cache(struct kmem_cache *s)
 {
+	struct kmem_lockless_cache *cache;
+
 	/* free asan quarantined objects */
 	kasan_cache_shutdown(s);
 
@@ -468,6 +528,12 @@  static int shutdown_cache(struct kmem_cache *s)
 
 	list_del(&s->list);
 
+	if (s->flags & SLAB_LOCKLESS_CACHE) {
+		cache = this_cpu_ptr(s->cache);
+		kmem_cache_free_bulk(s, cache->size, cache->queue);
+		free_percpu(s->cache);
+	}
+
 	if (s->flags & SLAB_TYPESAFE_BY_RCU) {
 #ifdef SLAB_SUPPORTS_SYSFS
 		sysfs_slab_unlink(s);