diff mbox series

[v5,4/6] mm/slab: Introduce kmem_buckets_create() and family

Message ID 20240619193357.1333772-4-kees@kernel.org (mailing list archive)
State Superseded
Headers show
Series slab: Introduce dedicated bucket allocator | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Kees Cook June 19, 2024, 7:33 p.m. UTC
Dedicated caches are available for fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.

This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.

While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations that are directly user
controlled. Addressing these cases is limited in scope, so isolating these
kinds of interfaces will not become an unbounded game of whack-a-mole. For
example, many pass through memdup_user(), making isolation there very
effective.

In order to isolate user-controllable dynamically-sized
allocations from the common system kmalloc allocations, introduce
kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
kmem_buckets_alloc_track_caller() for where caller tracking is
needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
is needed.

This can also be used in the future to extend allocation profiling's use
of code tagging to implement per-caller allocation cache isolation[1]
even for dynamic allocations.

Memory allocation pinning[2] is still needed to plug the Use-After-Free
cross-allocator weakness, but that is an existing and separate issue
which is complementary to this improvement. Development continues for
that feature via the SLAB_VIRTUAL[3] series (which could also provide
guard pages -- another complementary improvement).

Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
Signed-off-by: Kees Cook <kees@kernel.org>
---
 include/linux/slab.h | 13 ++++++++
 mm/slab_common.c     | 78 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

Comments

Vlastimil Babka June 20, 2024, 1:56 p.m. UTC | #1
On 6/19/24 9:33 PM, Kees Cook wrote:
> Dedicated caches are available for fixed size allocations via
> kmem_cache_alloc(), but for dynamically sized allocations there is only
> the global kmalloc API's set of buckets available. This means it isn't
> possible to separate specific sets of dynamically sized allocations into
> a separate collection of caches.
> 
> This leads to a use-after-free exploitation weakness in the Linux
> kernel since many heap memory spraying/grooming attacks depend on using
> userspace-controllable dynamically sized allocations to collide with
> fixed size allocations that end up in same cache.
> 
> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> against these kinds of "type confusion" attacks, including for fixed
> same-size heap objects, we can create a complementary deterministic
> defense for dynamically sized allocations that are directly user
> controlled. Addressing these cases is limited in scope, so isolating these
> kinds of interfaces will not become an unbounded game of whack-a-mole. For
> example, many pass through memdup_user(), making isolation there very
> effective.
> 
> In order to isolate user-controllable dynamically-sized
> allocations from the common system kmalloc allocations, introduce
> kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
> kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
> kmem_buckets_alloc_track_caller() for where caller tracking is
> needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
> is needed.
> 
> This can also be used in the future to extend allocation profiling's use
> of code tagging to implement per-caller allocation cache isolation[1]
> even for dynamic allocations.
> 
> Memory allocation pinning[2] is still needed to plug the Use-After-Free
> cross-allocator weakness, but that is an existing and separate issue
> which is complementary to this improvement. Development continues for
> that feature via the SLAB_VIRTUAL[3] series (which could also provide
> guard pages -- another complementary improvement).
> 
> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
> Signed-off-by: Kees Cook <kees@kernel.org>
> ---
>  include/linux/slab.h | 13 ++++++++
>  mm/slab_common.c     | 78 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 8d0800c7579a..3698b15b6138 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -549,6 +549,11 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
>  
>  void kmem_cache_free(struct kmem_cache *s, void *objp);
>  
> +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> +				  slab_flags_t flags,
> +				  unsigned int useroffset, unsigned int usersize,
> +				  void (*ctor)(void *));

I'd drop the ctor, I can't imagine how it would be used with variable-sized
allocations. Probably also "align" doesn't make much sense since we're just
copying the kmalloc cache sizes and its implicit alignment of any
power-of-two allocations. I don't think any current kmalloc user would
suddenly need either of those as you convert it to buckets, and definitely
not any user converted automatically by the code tagging.
Kees Cook June 20, 2024, 6:54 p.m. UTC | #2
On Thu, Jun 20, 2024 at 03:56:27PM +0200, Vlastimil Babka wrote:
> On 6/19/24 9:33 PM, Kees Cook wrote:
> > Dedicated caches are available for fixed size allocations via
> > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > the global kmalloc API's set of buckets available. This means it isn't
> > possible to separate specific sets of dynamically sized allocations into
> > a separate collection of caches.
> > 
> > This leads to a use-after-free exploitation weakness in the Linux
> > kernel since many heap memory spraying/grooming attacks depend on using
> > userspace-controllable dynamically sized allocations to collide with
> > fixed size allocations that end up in same cache.
> > 
> > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > against these kinds of "type confusion" attacks, including for fixed
> > same-size heap objects, we can create a complementary deterministic
> > defense for dynamically sized allocations that are directly user
> > controlled. Addressing these cases is limited in scope, so isolating these
> > kinds of interfaces will not become an unbounded game of whack-a-mole. For
> > example, many pass through memdup_user(), making isolation there very
> > effective.
> > 
> > In order to isolate user-controllable dynamically-sized
> > allocations from the common system kmalloc allocations, introduce
> > kmem_buckets_create(), which behaves like kmem_cache_create(). Introduce
> > kmem_buckets_alloc(), which behaves like kmem_cache_alloc(). Introduce
> > kmem_buckets_alloc_track_caller() for where caller tracking is
> > needed. Introduce kmem_buckets_valloc() for cases where vmalloc fallback
> > is needed.
> > 
> > This can also be used in the future to extend allocation profiling's use
> > of code tagging to implement per-caller allocation cache isolation[1]
> > even for dynamic allocations.
> > 
> > Memory allocation pinning[2] is still needed to plug the Use-After-Free
> > cross-allocator weakness, but that is an existing and separate issue
> > which is complementary to this improvement. Development continues for
> > that feature via the SLAB_VIRTUAL[3] series (which could also provide
> > guard pages -- another complementary improvement).
> > 
> > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> > Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> > Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]
> > Signed-off-by: Kees Cook <kees@kernel.org>
> > ---
> >  include/linux/slab.h | 13 ++++++++
> >  mm/slab_common.c     | 78 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 91 insertions(+)
> > 
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index 8d0800c7579a..3698b15b6138 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -549,6 +549,11 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
> >  
> >  void kmem_cache_free(struct kmem_cache *s, void *objp);
> >  
> > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> > +				  slab_flags_t flags,
> > +				  unsigned int useroffset, unsigned int usersize,
> > +				  void (*ctor)(void *));
> 
> I'd drop the ctor, I can't imagine how it would be used with variable-sized
> allocations.

I've kept it because for "kmalloc wrapper" APIs, e.g. devm_kmalloc(),
there is some "housekeeping" that gets done explicitly right now that I
think would be better served by using a ctor in the future. These APIs
are variable-sized, but have a fixed size header, so they have a
"minimum size" that the ctor can still operate on, etc.

> Probably also "align" doesn't make much sense since we're just
> copying the kmalloc cache sizes and its implicit alignment of any
> power-of-two allocations.

Yeah, that's probably true. I kept it since I wanted to mirror
kmem_cache_create() to make this API more familiar looking.

> I don't think any current kmalloc user would
> suddenly need either of those as you convert it to buckets, and definitely
> not any user converted automatically by the code tagging.

Right, it's not needed for either the explicit users nor the future
automatic users. But since these arguments are available internally,
there seems to be future utility,  it's not fast path, and it made things
feel like the existing API, I'd kind of like to keep it.

But all that said, if you really don't want it, then sure I can drop
those arguments. Adding them back in the future shouldn't be too
much churn.
Vlastimil Babka June 20, 2024, 8:43 p.m. UTC | #3
On 6/20/24 8:54 PM, Kees Cook wrote:
> On Thu, Jun 20, 2024 at 03:56:27PM +0200, Vlastimil Babka wrote:
>> > @@ -549,6 +549,11 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
>> >  
>> >  void kmem_cache_free(struct kmem_cache *s, void *objp);
>> >  
>> > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
>> > +				  slab_flags_t flags,
>> > +				  unsigned int useroffset, unsigned int usersize,
>> > +				  void (*ctor)(void *));
>> 
>> I'd drop the ctor, I can't imagine how it would be used with variable-sized
>> allocations.
> 
> I've kept it because for "kmalloc wrapper" APIs, e.g. devm_kmalloc(),
> there is some "housekeeping" that gets done explicitly right now that I
> think would be better served by using a ctor in the future. These APIs
> are variable-sized, but have a fixed size header, so they have a
> "minimum size" that the ctor can still operate on, etc.
> 
>> Probably also "align" doesn't make much sense since we're just
>> copying the kmalloc cache sizes and its implicit alignment of any
>> power-of-two allocations.
> 
> Yeah, that's probably true. I kept it since I wanted to mirror
> kmem_cache_create() to make this API more familiar looking.

Rust people were asking about kmalloc alignment (but I forgot the details)
so maybe this could be useful for them? CC rust-for-linux.

>> I don't think any current kmalloc user would
>> suddenly need either of those as you convert it to buckets, and definitely
>> not any user converted automatically by the code tagging.
> 
> Right, it's not needed for either the explicit users nor the future
> automatic users. But since these arguments are available internally,
> there seems to be future utility,  it's not fast path, and it made things
> feel like the existing API, I'd kind of like to keep it.
> 
> But all that said, if you really don't want it, then sure I can drop
> those arguments. Adding them back in the future shouldn't be too
> much churn.

I guess we can keep it then.
Andi Kleen June 20, 2024, 10:48 p.m. UTC | #4
Kees Cook <kees@kernel.org> writes:

> Dedicated caches are available for fixed size allocations via
> kmem_cache_alloc(), but for dynamically sized allocations there is only
> the global kmalloc API's set of buckets available. This means it isn't
> possible to separate specific sets of dynamically sized allocations into
> a separate collection of caches.
>
> This leads to a use-after-free exploitation weakness in the Linux
> kernel since many heap memory spraying/grooming attacks depend on using
> userspace-controllable dynamically sized allocations to collide with
> fixed size allocations that end up in same cache.
>
> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> against these kinds of "type confusion" attacks, including for fixed
> same-size heap objects, we can create a complementary deterministic
> defense for dynamically sized allocations that are directly user
> controlled. Addressing these cases is limited in scope, so isolating these
> kinds of interfaces will not become an unbounded game of whack-a-mole. For
> example, many pass through memdup_user(), making isolation there very
> effective.

Isn't the attack still possible if the attacker can free the slab page
during the use-after-free period with enough memory pressure?

Someone else might grab the page that was in the bucket for another slab
and the type confusion could hurt again.

Or is there some other defense against that, other than
CONFIG_DEBUG_PAGEALLOC or full slab poisoning? And how expensive
does it get when any of those are enabled?

I remember reading some paper about a apple allocator trying similar
techniques and it tried very hard to never reuse memory (probably
not a good idea for Linux though)

I assume you thought about this, but it would be good to discuss such
limitations and interactions in the commit log.

-Andi
Kees Cook June 20, 2024, 11:29 p.m. UTC | #5
On Thu, Jun 20, 2024 at 03:48:24PM -0700, Andi Kleen wrote:
> Kees Cook <kees@kernel.org> writes:
> 
> > Dedicated caches are available for fixed size allocations via
> > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > the global kmalloc API's set of buckets available. This means it isn't
> > possible to separate specific sets of dynamically sized allocations into
> > a separate collection of caches.
> >
> > This leads to a use-after-free exploitation weakness in the Linux
> > kernel since many heap memory spraying/grooming attacks depend on using
> > userspace-controllable dynamically sized allocations to collide with
> > fixed size allocations that end up in same cache.
> >
> > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > against these kinds of "type confusion" attacks, including for fixed
> > same-size heap objects, we can create a complementary deterministic
> > defense for dynamically sized allocations that are directly user
> > controlled. Addressing these cases is limited in scope, so isolating these
> > kinds of interfaces will not become an unbounded game of whack-a-mole. For
> > example, many pass through memdup_user(), making isolation there very
> > effective.
> 
> Isn't the attack still possible if the attacker can free the slab page
> during the use-after-free period with enough memory pressure?
> 
> Someone else might grab the page that was in the bucket for another slab
> and the type confusion could hurt again.
> 
> Or is there some other defense against that, other than
> CONFIG_DEBUG_PAGEALLOC or full slab poisoning? And how expensive
> does it get when any of those are enabled?
> 
> I remember reading some paper about a apple allocator trying similar
> techniques and it tried very hard to never reuse memory (probably
> not a good idea for Linux though)
> 
> I assume you thought about this, but it would be good to discuss such
> limitations and interactions in the commit log.

Yup! It's in there; it's just after what you quoted above. Here it is:

> > Memory allocation pinning[2] is still needed to plug the Use-After-Free
> > cross-allocator weakness, but that is an existing and separate issue
> > which is complementary to this improvement. Development continues for
> > that feature via the SLAB_VIRTUAL[3] series (which could also provide
> > guard pages -- another complementary improvement).
> > [...]
> > Link: https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html [2]
> > Link: https://lore.kernel.org/lkml/20230915105933.495735-1-matteorizzo@google.com/ [3]

Let me know if you think this description needs to be improved...

-Kees
Boqun Feng June 28, 2024, 5:35 a.m. UTC | #6
On Thu, Jun 20, 2024 at 10:43:39PM +0200, Vlastimil Babka wrote:
> On 6/20/24 8:54 PM, Kees Cook wrote:
> > On Thu, Jun 20, 2024 at 03:56:27PM +0200, Vlastimil Babka wrote:
> >> > @@ -549,6 +549,11 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
> >> >  
> >> >  void kmem_cache_free(struct kmem_cache *s, void *objp);
> >> >  
> >> > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> >> > +				  slab_flags_t flags,
> >> > +				  unsigned int useroffset, unsigned int usersize,
> >> > +				  void (*ctor)(void *));
> >> 
> >> I'd drop the ctor, I can't imagine how it would be used with variable-sized
> >> allocations.
> > 
> > I've kept it because for "kmalloc wrapper" APIs, e.g. devm_kmalloc(),
> > there is some "housekeeping" that gets done explicitly right now that I
> > think would be better served by using a ctor in the future. These APIs
> > are variable-sized, but have a fixed size header, so they have a
> > "minimum size" that the ctor can still operate on, etc.
> > 
> >> Probably also "align" doesn't make much sense since we're just
> >> copying the kmalloc cache sizes and its implicit alignment of any
> >> power-of-two allocations.
> > 
> > Yeah, that's probably true. I kept it since I wanted to mirror
> > kmem_cache_create() to make this API more familiar looking.
> 
> Rust people were asking about kmalloc alignment (but I forgot the details)

It was me! The ask is whether we can specify the alignment for the
allocation API, for example, requesting a size=96 and align=32 memory,
or the allocation API could do a "best alignment", for example,
allocating a size=96 will give a align=32 memory. As far as I
understand, kmalloc() doesn't support this.

> so maybe this could be useful for them? CC rust-for-linux.
> 

I took a quick look as what kmem_buckets is, and seems to me that align
doesn't make sense here (and probably not useful in Rust as well)
because a kmem_buckets is a set of kmem_caches, each has its own object
size, making them share the same alignment is probably not what you
want. But I could be missing something.

Regards,
Boqun

> >> I don't think any current kmalloc user would
> >> suddenly need either of those as you convert it to buckets, and definitely
> >> not any user converted automatically by the code tagging.
> > 
> > Right, it's not needed for either the explicit users nor the future
> > automatic users. But since these arguments are available internally,
> > there seems to be future utility,  it's not fast path, and it made things
> > feel like the existing API, I'd kind of like to keep it.
> > 
> > But all that said, if you really don't want it, then sure I can drop
> > those arguments. Adding them back in the future shouldn't be too
> > much churn.
> 
> I guess we can keep it then.
>
Vlastimil Babka June 28, 2024, 8:40 a.m. UTC | #7
On 6/28/24 7:35 AM, Boqun Feng wrote:
> On Thu, Jun 20, 2024 at 10:43:39PM +0200, Vlastimil Babka wrote:
>> On 6/20/24 8:54 PM, Kees Cook wrote:
>> > On Thu, Jun 20, 2024 at 03:56:27PM +0200, Vlastimil Babka wrote:
>> >> > @@ -549,6 +549,11 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
>> >> >  
>> >> >  void kmem_cache_free(struct kmem_cache *s, void *objp);
>> >> >  
>> >> > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
>> >> > +				  slab_flags_t flags,
>> >> > +				  unsigned int useroffset, unsigned int usersize,
>> >> > +				  void (*ctor)(void *));
>> >> 
>> >> I'd drop the ctor, I can't imagine how it would be used with variable-sized
>> >> allocations.
>> > 
>> > I've kept it because for "kmalloc wrapper" APIs, e.g. devm_kmalloc(),
>> > there is some "housekeeping" that gets done explicitly right now that I
>> > think would be better served by using a ctor in the future. These APIs
>> > are variable-sized, but have a fixed size header, so they have a
>> > "minimum size" that the ctor can still operate on, etc.
>> > 
>> >> Probably also "align" doesn't make much sense since we're just
>> >> copying the kmalloc cache sizes and its implicit alignment of any
>> >> power-of-two allocations.
>> > 
>> > Yeah, that's probably true. I kept it since I wanted to mirror
>> > kmem_cache_create() to make this API more familiar looking.
>> 
>> Rust people were asking about kmalloc alignment (but I forgot the details)
> 
> It was me! The ask is whether we can specify the alignment for the
> allocation API, for example, requesting a size=96 and align=32 memory,
> or the allocation API could do a "best alignment", for example,
> allocating a size=96 will give a align=32 memory. As far as I
> understand, kmalloc() doesn't support this.

Hm yeah we only have guarantees for power-or-2 allocations.

>> so maybe this could be useful for them? CC rust-for-linux.
>> 
> 
> I took a quick look as what kmem_buckets is, and seems to me that align
> doesn't make sense here (and probably not useful in Rust as well)
> because a kmem_buckets is a set of kmem_caches, each has its own object
> size, making them share the same alignment is probably not what you
> want. But I could be missing something.

How flexible do you need those alignments to be? Besides the power-of-two
guarantees, we currently have only two odd sizes with 96 and 192. If those
were guaranteed to be aligned 32 bytes, would that be sufficient? Also do
you ever allocate anything smaller than 32 bytes then?

To summarize, if Rust's requirements can be summarized by some rules and
it's not completely ad-hoc per-allocation alignment requirement (or if it
is, does it have an upper bound?) we could perhaps figure out the creation
of rust-specific kmem_buckets to give it what's needed?

> Regards,
> Boqun
> 
>> >> I don't think any current kmalloc user would
>> >> suddenly need either of those as you convert it to buckets, and definitely
>> >> not any user converted automatically by the code tagging.
>> > 
>> > Right, it's not needed for either the explicit users nor the future
>> > automatic users. But since these arguments are available internally,
>> > there seems to be future utility,  it's not fast path, and it made things
>> > feel like the existing API, I'd kind of like to keep it.
>> > 
>> > But all that said, if you really don't want it, then sure I can drop
>> > those arguments. Adding them back in the future shouldn't be too
>> > much churn.
>> 
>> I guess we can keep it then.
>>
Alice Ryhl June 28, 2024, 9:06 a.m. UTC | #8
On Fri, Jun 28, 2024 at 10:40 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 6/28/24 7:35 AM, Boqun Feng wrote:
> > On Thu, Jun 20, 2024 at 10:43:39PM +0200, Vlastimil Babka wrote:
> >> On 6/20/24 8:54 PM, Kees Cook wrote:
> >> > On Thu, Jun 20, 2024 at 03:56:27PM +0200, Vlastimil Babka wrote:
> >> >> > @@ -549,6 +549,11 @@ void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
> >> >> >
> >> >> >  void kmem_cache_free(struct kmem_cache *s, void *objp);
> >> >> >
> >> >> > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> >> >> > +                                 slab_flags_t flags,
> >> >> > +                                 unsigned int useroffset, unsigned int usersize,
> >> >> > +                                 void (*ctor)(void *));
> >> >>
> >> >> I'd drop the ctor, I can't imagine how it would be used with variable-sized
> >> >> allocations.
> >> >
> >> > I've kept it because for "kmalloc wrapper" APIs, e.g. devm_kmalloc(),
> >> > there is some "housekeeping" that gets done explicitly right now that I
> >> > think would be better served by using a ctor in the future. These APIs
> >> > are variable-sized, but have a fixed size header, so they have a
> >> > "minimum size" that the ctor can still operate on, etc.
> >> >
> >> >> Probably also "align" doesn't make much sense since we're just
> >> >> copying the kmalloc cache sizes and its implicit alignment of any
> >> >> power-of-two allocations.
> >> >
> >> > Yeah, that's probably true. I kept it since I wanted to mirror
> >> > kmem_cache_create() to make this API more familiar looking.
> >>
> >> Rust people were asking about kmalloc alignment (but I forgot the details)
> >
> > It was me! The ask is whether we can specify the alignment for the
> > allocation API, for example, requesting a size=96 and align=32 memory,
> > or the allocation API could do a "best alignment", for example,
> > allocating a size=96 will give a align=32 memory. As far as I
> > understand, kmalloc() doesn't support this.
>
> Hm yeah we only have guarantees for power-or-2 allocations.
>
> >> so maybe this could be useful for them? CC rust-for-linux.
> >>
> >
> > I took a quick look as what kmem_buckets is, and seems to me that align
> > doesn't make sense here (and probably not useful in Rust as well)
> > because a kmem_buckets is a set of kmem_caches, each has its own object
> > size, making them share the same alignment is probably not what you
> > want. But I could be missing something.
>
> How flexible do you need those alignments to be? Besides the power-of-two
> guarantees, we currently have only two odd sizes with 96 and 192. If those
> were guaranteed to be aligned 32 bytes, would that be sufficient? Also do
> you ever allocate anything smaller than 32 bytes then?
>
> To summarize, if Rust's requirements can be summarized by some rules and
> it's not completely ad-hoc per-allocation alignment requirement (or if it
> is, does it have an upper bound?) we could perhaps figure out the creation
> of rust-specific kmem_buckets to give it what's needed?

Rust's allocator API can take any size and alignment as long as:

1. The alignment is a power of two.
2. The size is non-zero.
3. When you round up the size to the next multiple of the alignment,
then it must not overflow the signed type isize / ssize_t.

What happens right now is that when Rust wants an allocation with a
higher alignment than ARCH_SLAB_MINALIGN, then it will increase size
until it becomes a power of two so that the power-of-two guarantee
gives a properly aligned allocation.

Alice
Vlastimil Babka June 28, 2024, 9:17 a.m. UTC | #9
On 6/28/24 11:06 AM, Alice Ryhl wrote:
>> >>
>> >
>> > I took a quick look as what kmem_buckets is, and seems to me that align
>> > doesn't make sense here (and probably not useful in Rust as well)
>> > because a kmem_buckets is a set of kmem_caches, each has its own object
>> > size, making them share the same alignment is probably not what you
>> > want. But I could be missing something.
>>
>> How flexible do you need those alignments to be? Besides the power-of-two
>> guarantees, we currently have only two odd sizes with 96 and 192. If those
>> were guaranteed to be aligned 32 bytes, would that be sufficient? Also do
>> you ever allocate anything smaller than 32 bytes then?
>>
>> To summarize, if Rust's requirements can be summarized by some rules and
>> it's not completely ad-hoc per-allocation alignment requirement (or if it
>> is, does it have an upper bound?) we could perhaps figure out the creation
>> of rust-specific kmem_buckets to give it what's needed?
> 
> Rust's allocator API can take any size and alignment as long as:
> 
> 1. The alignment is a power of two.
> 2. The size is non-zero.
> 3. When you round up the size to the next multiple of the alignment,
> then it must not overflow the signed type isize / ssize_t.
> 
> What happens right now is that when Rust wants an allocation with a
> higher alignment than ARCH_SLAB_MINALIGN, then it will increase size
> until it becomes a power of two so that the power-of-two guarantee
> gives a properly aligned allocation.

So am I correct thinking that, if the cache of size 96 bytes guaranteed a
32byte alignment, and 192 bytes guaranteed 64byte alignment, and the rest of
sizes with the already guaranteed power-of-two alignment, then on rust side
you would only have to round up sizes to the next multiples of the alignemnt
(rule 3 above) and that would be sufficient?
 Abstracting from the specific sizes of 96 and 192, the guarantee on kmalloc
side would have to be - guarantee alignment to the largest power-of-two
divisor of the size. Does that sound right?

Then I think we could have some flag for kmem_buckets creation that would do
the right thing.

> Alice
Alice Ryhl June 28, 2024, 9:34 a.m. UTC | #10
On Fri, Jun 28, 2024 at 11:17 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 6/28/24 11:06 AM, Alice Ryhl wrote:
> >> >>
> >> >
> >> > I took a quick look as what kmem_buckets is, and seems to me that align
> >> > doesn't make sense here (and probably not useful in Rust as well)
> >> > because a kmem_buckets is a set of kmem_caches, each has its own object
> >> > size, making them share the same alignment is probably not what you
> >> > want. But I could be missing something.
> >>
> >> How flexible do you need those alignments to be? Besides the power-of-two
> >> guarantees, we currently have only two odd sizes with 96 and 192. If those
> >> were guaranteed to be aligned 32 bytes, would that be sufficient? Also do
> >> you ever allocate anything smaller than 32 bytes then?
> >>
> >> To summarize, if Rust's requirements can be summarized by some rules and
> >> it's not completely ad-hoc per-allocation alignment requirement (or if it
> >> is, does it have an upper bound?) we could perhaps figure out the creation
> >> of rust-specific kmem_buckets to give it what's needed?
> >
> > Rust's allocator API can take any size and alignment as long as:
> >
> > 1. The alignment is a power of two.
> > 2. The size is non-zero.
> > 3. When you round up the size to the next multiple of the alignment,
> > then it must not overflow the signed type isize / ssize_t.
> >
> > What happens right now is that when Rust wants an allocation with a
> > higher alignment than ARCH_SLAB_MINALIGN, then it will increase size
> > until it becomes a power of two so that the power-of-two guarantee
> > gives a properly aligned allocation.
>
> So am I correct thinking that, if the cache of size 96 bytes guaranteed a
> 32byte alignment, and 192 bytes guaranteed 64byte alignment, and the rest of
> sizes with the already guaranteed power-of-two alignment, then on rust side
> you would only have to round up sizes to the next multiples of the alignemnt
> (rule 3 above) and that would be sufficient?
>  Abstracting from the specific sizes of 96 and 192, the guarantee on kmalloc
> side would have to be - guarantee alignment to the largest power-of-two
> divisor of the size. Does that sound right?
>
> Then I think we could have some flag for kmem_buckets creation that would do
> the right thing.

If kmalloc/krealloc guarantee that an allocation is aligned according
to the largest power-of-two divisor of the size, then the Rust
allocator would definitely be simplified as we would not longer need
this part:

if layout.align() > bindings::ARCH_SLAB_MINALIGN {
    // The alignment requirement exceeds the slab guarantee, thus try
to enlarge the size
    // to use the "power-of-two" size/alignment guarantee (see
comments in `kmalloc()` for
    // more information).
    //
    // Note that `layout.size()` (after padding) is guaranteed to be a
multiple of
    // `layout.align()`, so `next_power_of_two` gives enough alignment
guarantee.
    size = size.next_power_of_two();
}

We would only need to keep the part that rounds up the size to a
multiple of the alignment.

Alice
Kees Cook June 28, 2024, 3:47 p.m. UTC | #11
On Thu, Jun 27, 2024 at 10:35:36PM -0700, Boqun Feng wrote:
> On Thu, Jun 20, 2024 at 10:43:39PM +0200, Vlastimil Babka wrote:
> > Rust people were asking about kmalloc alignment (but I forgot the details)
> 
> It was me! The ask is whether we can specify the alignment for the
> allocation API, for example, requesting a size=96 and align=32 memory,
> or the allocation API could do a "best alignment", for example,
> allocating a size=96 will give a align=32 memory. As far as I
> understand, kmalloc() doesn't support this.

I can drop the "align" argument. Do we want to hard-code a
per-cache-size alignment for the caches in a kmem_buckets collection?
Vlastimil Babka June 28, 2024, 4:53 p.m. UTC | #12
On 6/28/24 5:47 PM, Kees Cook wrote:
> On Thu, Jun 27, 2024 at 10:35:36PM -0700, Boqun Feng wrote:
>> On Thu, Jun 20, 2024 at 10:43:39PM +0200, Vlastimil Babka wrote:
>> > Rust people were asking about kmalloc alignment (but I forgot the details)
>> 
>> It was me! The ask is whether we can specify the alignment for the
>> allocation API, for example, requesting a size=96 and align=32 memory,
>> or the allocation API could do a "best alignment", for example,
>> allocating a size=96 will give a align=32 memory. As far as I
>> understand, kmalloc() doesn't support this.
> 
> I can drop the "align" argument. Do we want to hard-code a
> per-cache-size alignment for the caches in a kmem_buckets collection?

I think you can drop it as a single value is really ill suited for a
collection of different sizes.

As for Rust's requirements we could consider whether to add a special flag
if they use own bucket, or just implement the rules for non-power-of-two
size globally. It should be feasible as I think in the non-debug caches it
shouldn't in fact change the existing layout, which is the same situation as
when we codified the power-of-two caches alignment as guaranteed.
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8d0800c7579a..3698b15b6138 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -549,6 +549,11 @@  void *kmem_cache_alloc_lru_noprof(struct kmem_cache *s, struct list_lru *lru,
 
 void kmem_cache_free(struct kmem_cache *s, void *objp);
 
+kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
+				  slab_flags_t flags,
+				  unsigned int useroffset, unsigned int usersize,
+				  void (*ctor)(void *));
+
 /*
  * Bulk allocation and freeing operations. These are accelerated in an
  * allocator specific way to avoid taking locks repeatedly or building
@@ -681,6 +686,12 @@  static __always_inline __alloc_size(1) void *kmalloc_noprof(size_t size, gfp_t f
 }
 #define kmalloc(...)				alloc_hooks(kmalloc_noprof(__VA_ARGS__))
 
+#define kmem_buckets_alloc(_b, _size, _flags)	\
+	alloc_hooks(__kmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE))
+
+#define kmem_buckets_alloc_track_caller(_b, _size, _flags)	\
+	alloc_hooks(__kmalloc_node_track_caller_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE, _RET_IP_))
+
 static __always_inline __alloc_size(1) void *kmalloc_node_noprof(size_t size, gfp_t flags, int node)
 {
 	if (__builtin_constant_p(size) && size) {
@@ -808,6 +819,8 @@  void *__kvmalloc_node_noprof(DECL_BUCKET_PARAMS(size, b), gfp_t flags, int node)
 #define kvzalloc(_size, _flags)			kvmalloc(_size, (_flags)|__GFP_ZERO)
 
 #define kvzalloc_node(_size, _flags, _node)	kvmalloc_node(_size, (_flags)|__GFP_ZERO, _node)
+#define kmem_buckets_valloc(_b, _size, _flags)	\
+	alloc_hooks(__kvmalloc_node_noprof(PASS_BUCKET_PARAMS(_size, _b), _flags, NUMA_NO_NODE))
 
 static inline __alloc_size(1, 2) void *
 kvmalloc_array_node_noprof(size_t n, size_t size, gfp_t flags, int node)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9b0f2ef951f1..453bc4ec8b57 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -392,6 +392,80 @@  kmem_cache_create(const char *name, unsigned int size, unsigned int align,
 }
 EXPORT_SYMBOL(kmem_cache_create);
 
+static struct kmem_cache *kmem_buckets_cache __ro_after_init;
+
+kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
+				  slab_flags_t flags,
+				  unsigned int useroffset,
+				  unsigned int usersize,
+				  void (*ctor)(void *))
+{
+	kmem_buckets *b;
+	int idx;
+
+	/*
+	 * When the separate buckets API is not built in, just return
+	 * a non-NULL value for the kmem_buckets pointer, which will be
+	 * unused when performing allocations.
+	 */
+	if (!IS_ENABLED(CONFIG_SLAB_BUCKETS))
+		return ZERO_SIZE_PTR;
+
+	if (WARN_ON(!kmem_buckets_cache))
+		return NULL;
+
+	b = kmem_cache_alloc(kmem_buckets_cache, GFP_KERNEL|__GFP_ZERO);
+	if (WARN_ON(!b))
+		return NULL;
+
+	flags |= SLAB_NO_MERGE;
+
+	for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
+		char *short_size, *cache_name;
+		unsigned int cache_useroffset, cache_usersize;
+		unsigned int size;
+
+		if (!kmalloc_caches[KMALLOC_NORMAL][idx])
+			continue;
+
+		size = kmalloc_caches[KMALLOC_NORMAL][idx]->object_size;
+		if (!size)
+			continue;
+
+		short_size = strchr(kmalloc_caches[KMALLOC_NORMAL][idx]->name, '-');
+		if (WARN_ON(!short_size))
+			goto fail;
+
+		cache_name = kasprintf(GFP_KERNEL, "%s-%s", name, short_size + 1);
+		if (WARN_ON(!cache_name))
+			goto fail;
+
+		if (useroffset >= size) {
+			cache_useroffset = 0;
+			cache_usersize = 0;
+		} else {
+			cache_useroffset = useroffset;
+			cache_usersize = min(size - cache_useroffset, usersize);
+		}
+		(*b)[idx] = kmem_cache_create_usercopy(cache_name, size,
+					align, flags, cache_useroffset,
+					cache_usersize, ctor);
+		kfree(cache_name);
+		if (WARN_ON(!(*b)[idx]))
+			goto fail;
+	}
+
+	return b;
+
+fail:
+	for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++)
+		kmem_cache_destroy((*b)[idx]);
+	kfree(b);
+
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_buckets_create);
+
 #ifdef SLAB_SUPPORTS_SYSFS
 /*
  * For a given kmem_cache, kmem_cache_destroy() should only be called
@@ -931,6 +1005,10 @@  void __init create_kmalloc_caches(void)
 
 	/* Kmalloc array is now usable */
 	slab_state = UP;
+
+	kmem_buckets_cache = kmem_cache_create("kmalloc_buckets",
+					       sizeof(kmem_buckets),
+					       0, SLAB_NO_MERGE, NULL);
 }
 
 /**