diff mbox series

[v2,4/9] slab: Introduce kmem_buckets_create()

Message ID 20240305101026.694758-4-keescook@chromium.org (mailing list archive)
State Superseded
Headers show
Series slab: Introduce dedicated bucket allocator | expand

Commit Message

Kees Cook March 5, 2024, 10:10 a.m. UTC
Dedicated caches are available For fixed size allocations via
kmem_cache_alloc(), but for dynamically sized allocations there is only
the global kmalloc API's set of buckets available. This means it isn't
possible to separate specific sets of dynamically sized allocations into
a separate collection of caches.

This leads to a use-after-free exploitation weakness in the Linux
kernel since many heap memory spraying/grooming attacks depend on using
userspace-controllable dynamically sized allocations to collide with
fixed size allocations that end up in same cache.

While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
against these kinds of "type confusion" attacks, including for fixed
same-size heap objects, we can create a complementary deterministic
defense for dynamically sized allocations.

In order to isolate user-controllable sized allocations from system
allocations, introduce kmem_buckets_create(), which behaves like
kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
which behaves like kmem_cache_alloc().)

Allows for confining allocations to a dedicated set of sized caches
(which have the same layout as the kmalloc caches).

This can also be used in the future once codetag allocation annotations
exist to implement per-caller allocation cache isolation[1] even for
dynamic allocations.

Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
Signed-off-by: Kees Cook <keescook@chromium.org>
---
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: linux-mm@kvack.org
---
 include/linux/slab.h |  5 +++
 mm/slab_common.c     | 72 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+)

Comments

Kent Overstreet March 25, 2024, 7:40 p.m. UTC | #1
On Tue, Mar 05, 2024 at 02:10:20AM -0800, Kees Cook wrote:
> Dedicated caches are available For fixed size allocations via
> kmem_cache_alloc(), but for dynamically sized allocations there is only
> the global kmalloc API's set of buckets available. This means it isn't
> possible to separate specific sets of dynamically sized allocations into
> a separate collection of caches.
> 
> This leads to a use-after-free exploitation weakness in the Linux
> kernel since many heap memory spraying/grooming attacks depend on using
> userspace-controllable dynamically sized allocations to collide with
> fixed size allocations that end up in same cache.
> 
> While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> against these kinds of "type confusion" attacks, including for fixed
> same-size heap objects, we can create a complementary deterministic
> defense for dynamically sized allocations.
> 
> In order to isolate user-controllable sized allocations from system
> allocations, introduce kmem_buckets_create(), which behaves like
> kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> which behaves like kmem_cache_alloc().)
> 
> Allows for confining allocations to a dedicated set of sized caches
> (which have the same layout as the kmalloc caches).
> 
> This can also be used in the future once codetag allocation annotations
> exist to implement per-caller allocation cache isolation[1] even for
> dynamic allocations.
> 
> Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Roman Gushchin <roman.gushchin@linux.dev>
> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> Cc: linux-mm@kvack.org
> ---
>  include/linux/slab.h |  5 +++
>  mm/slab_common.c     | 72 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 77 insertions(+)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index f26ac9a6ef9f..058d0e3cd181 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -493,6 +493,11 @@ void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
>  			   gfp_t gfpflags) __assume_slab_alignment __malloc;
>  void kmem_cache_free(struct kmem_cache *s, void *objp);
>  
> +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> +				  slab_flags_t flags,
> +				  unsigned int useroffset, unsigned int usersize,
> +				  void (*ctor)(void *));

I'd prefer an API that initialized an object over one that allocates it
- that is, prefer

kmem_buckets_init(kmem_buckets *bucekts, ...)

by forcing it to be separately allocated, you're adding a pointer deref
to every access.

That would also allow for kmem_buckets to be lazily initialized, which
would play nicely declaring the kmem_buckets in the alloc_hooks() macro.

I'm curious what all the arguments to kmem_buckets_create() are needed
for, if this is supposed to be a replacement for kmalloc() users.
Kees Cook March 25, 2024, 8:40 p.m. UTC | #2
On Mon, Mar 25, 2024 at 03:40:51PM -0400, Kent Overstreet wrote:
> On Tue, Mar 05, 2024 at 02:10:20AM -0800, Kees Cook wrote:
> > Dedicated caches are available For fixed size allocations via
> > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > the global kmalloc API's set of buckets available. This means it isn't
> > possible to separate specific sets of dynamically sized allocations into
> > a separate collection of caches.
> > 
> > This leads to a use-after-free exploitation weakness in the Linux
> > kernel since many heap memory spraying/grooming attacks depend on using
> > userspace-controllable dynamically sized allocations to collide with
> > fixed size allocations that end up in same cache.
> > 
> > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > against these kinds of "type confusion" attacks, including for fixed
> > same-size heap objects, we can create a complementary deterministic
> > defense for dynamically sized allocations.
> > 
> > In order to isolate user-controllable sized allocations from system
> > allocations, introduce kmem_buckets_create(), which behaves like
> > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> > which behaves like kmem_cache_alloc().)
> > 
> > Allows for confining allocations to a dedicated set of sized caches
> > (which have the same layout as the kmalloc caches).
> > 
> > This can also be used in the future once codetag allocation annotations
> > exist to implement per-caller allocation cache isolation[1] even for
> > dynamic allocations.
> > 
> > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> > Signed-off-by: Kees Cook <keescook@chromium.org>
> > ---
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Christoph Lameter <cl@linux.com>
> > Cc: Pekka Enberg <penberg@kernel.org>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Cc: linux-mm@kvack.org
> > ---
> >  include/linux/slab.h |  5 +++
> >  mm/slab_common.c     | 72 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 77 insertions(+)
> > 
> > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > index f26ac9a6ef9f..058d0e3cd181 100644
> > --- a/include/linux/slab.h
> > +++ b/include/linux/slab.h
> > @@ -493,6 +493,11 @@ void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> >  			   gfp_t gfpflags) __assume_slab_alignment __malloc;
> >  void kmem_cache_free(struct kmem_cache *s, void *objp);
> >  
> > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> > +				  slab_flags_t flags,
> > +				  unsigned int useroffset, unsigned int usersize,
> > +				  void (*ctor)(void *));
> 
> I'd prefer an API that initialized an object over one that allocates it
> - that is, prefer
> 
> kmem_buckets_init(kmem_buckets *bucekts, ...)

Sure, that can work. kmem_cache_init() would need to exist for the same
reason though.

> 
> by forcing it to be separately allocated, you're adding a pointer deref
> to every access.

I don't understand what you mean here. "every access"? I take a guess
below...

> That would also allow for kmem_buckets to be lazily initialized, which
> would play nicely declaring the kmem_buckets in the alloc_hooks() macro.

Sure, I think it'll depend on how the per-site allocations got wired up.
I think you're meaning to include a full copy of the kmem cache/bucket
struct with the codetag instead of just a pointer? I don't think that'll
work well to make it runtime selectable, and I don't see it using an
extra deref -- allocations already get the struct from somewhere and
deref it. The only change is where to find the struct.

> I'm curious what all the arguments to kmem_buckets_create() are needed
> for, if this is supposed to be a replacement for kmalloc() users.

Are you confusing kmem_buckets_create() with kmem_buckets_alloc()? These
args are needed to initialize the per-bucket caches, just like is
already done for the global kmalloc per-bucket caches. This mirrors
kmem_cache_create(). (Or more specifically, calls kmem_cache_create()
for each bucket size, so the args need to be passed through.)

If you mean "why expose these arguments because they can just use the
existing defaults already used by the global kmalloc caches" then I
would say, it's to gain the benefit here of narrowing the scope of the
usercopy offsets. Right now kmalloc is forced to allow the full usercopy
window into an allocation, but we don't have to do this any more. For
example, see patch 8, where struct msg_msg doesn't need to expose the
header to userspace:

	msg_buckets = kmem_buckets_create("msg_msg", 0, SLAB_ACCOUNT,
					  sizeof(struct msg_msg),
					  DATALEN_MSG, NULL);

Only DATALEN_MSG many bytes, starting at sizeof(struct msg_msg), will be
allowed to be copied in/out of userspace. Before, it was unbounded.

-Kees
Kent Overstreet March 25, 2024, 9:49 p.m. UTC | #3
On Mon, Mar 25, 2024 at 01:40:34PM -0700, Kees Cook wrote:
> On Mon, Mar 25, 2024 at 03:40:51PM -0400, Kent Overstreet wrote:
> > On Tue, Mar 05, 2024 at 02:10:20AM -0800, Kees Cook wrote:
> > > Dedicated caches are available For fixed size allocations via
> > > kmem_cache_alloc(), but for dynamically sized allocations there is only
> > > the global kmalloc API's set of buckets available. This means it isn't
> > > possible to separate specific sets of dynamically sized allocations into
> > > a separate collection of caches.
> > > 
> > > This leads to a use-after-free exploitation weakness in the Linux
> > > kernel since many heap memory spraying/grooming attacks depend on using
> > > userspace-controllable dynamically sized allocations to collide with
> > > fixed size allocations that end up in same cache.
> > > 
> > > While CONFIG_RANDOM_KMALLOC_CACHES provides a probabilistic defense
> > > against these kinds of "type confusion" attacks, including for fixed
> > > same-size heap objects, we can create a complementary deterministic
> > > defense for dynamically sized allocations.
> > > 
> > > In order to isolate user-controllable sized allocations from system
> > > allocations, introduce kmem_buckets_create(), which behaves like
> > > kmem_cache_create(). (The next patch will introduce kmem_buckets_alloc(),
> > > which behaves like kmem_cache_alloc().)
> > > 
> > > Allows for confining allocations to a dedicated set of sized caches
> > > (which have the same layout as the kmalloc caches).
> > > 
> > > This can also be used in the future once codetag allocation annotations
> > > exist to implement per-caller allocation cache isolation[1] even for
> > > dynamic allocations.
> > > 
> > > Link: https://lore.kernel.org/lkml/202402211449.401382D2AF@keescook [1]
> > > Signed-off-by: Kees Cook <keescook@chromium.org>
> > > ---
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Christoph Lameter <cl@linux.com>
> > > Cc: Pekka Enberg <penberg@kernel.org>
> > > Cc: David Rientjes <rientjes@google.com>
> > > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > Cc: Roman Gushchin <roman.gushchin@linux.dev>
> > > Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > > Cc: linux-mm@kvack.org
> > > ---
> > >  include/linux/slab.h |  5 +++
> > >  mm/slab_common.c     | 72 ++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 77 insertions(+)
> > > 
> > > diff --git a/include/linux/slab.h b/include/linux/slab.h
> > > index f26ac9a6ef9f..058d0e3cd181 100644
> > > --- a/include/linux/slab.h
> > > +++ b/include/linux/slab.h
> > > @@ -493,6 +493,11 @@ void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
> > >  			   gfp_t gfpflags) __assume_slab_alignment __malloc;
> > >  void kmem_cache_free(struct kmem_cache *s, void *objp);
> > >  
> > > +kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
> > > +				  slab_flags_t flags,
> > > +				  unsigned int useroffset, unsigned int usersize,
> > > +				  void (*ctor)(void *));
> > 
> > I'd prefer an API that initialized an object over one that allocates it
> > - that is, prefer
> > 
> > kmem_buckets_init(kmem_buckets *bucekts, ...)
> 
> Sure, that can work. kmem_cache_init() would need to exist for the same
> reason though.

That'll be a very worthwhile addition too; IPC running kernel code
is always crap and dependent loads is a big part of that.

I did mempool_init() and bioset_init() awhile back, so it's someone
else's turn for this one :)

> Sure, I think it'll depend on how the per-site allocations got wired up.
> I think you're meaning to include a full copy of the kmem cache/bucket
> struct with the codetag instead of just a pointer? I don't think that'll
> work well to make it runtime selectable, and I don't see it using an
> extra deref -- allocations already get the struct from somewhere and
> deref it. The only change is where to find the struct.

The codetags are in their own dedicated elf sections already, so if you
put the kmem_buckets in the codetag the entire elf section can be
discarded if it's not in use.

Also, the issue isn't derefs - it's dependent loads and locality. Taking
the address of the kmem_buckets to pass it is fine; the data referred to
will still get pulled into cache when we touch the codetag. If it's
behind a pointer we have to pull the codetag into cache, wait for that
so we can get the kmme_buckets pointer - then start to pull in the
kmem_buckets itself.

If it's a cache miss you just slowed the entire allocation down by
around 30 ns.

> > I'm curious what all the arguments to kmem_buckets_create() are needed
> > for, if this is supposed to be a replacement for kmalloc() users.
> 
> Are you confusing kmem_buckets_create() with kmem_buckets_alloc()? These
> args are needed to initialize the per-bucket caches, just like is
> already done for the global kmalloc per-bucket caches. This mirrors
> kmem_cache_create(). (Or more specifically, calls kmem_cache_create()
> for each bucket size, so the args need to be passed through.)
> 
> If you mean "why expose these arguments because they can just use the
> existing defaults already used by the global kmalloc caches" then I
> would say, it's to gain the benefit here of narrowing the scope of the
> usercopy offsets. Right now kmalloc is forced to allow the full usercopy
> window into an allocation, but we don't have to do this any more. For
> example, see patch 8, where struct msg_msg doesn't need to expose the
> header to userspace:

"usercopy window"? You're now annotating which data can be copied to
userspace?

I'm skeptical, this looks like defensive programming gone amuck to me.
 
> 	msg_buckets = kmem_buckets_create("msg_msg", 0, SLAB_ACCOUNT,
> 					  sizeof(struct msg_msg),
> 					  DATALEN_MSG, NULL);
Kees Cook March 25, 2024, 11:13 p.m. UTC | #4
On Mon, Mar 25, 2024 at 05:49:49PM -0400, Kent Overstreet wrote:
> The codetags are in their own dedicated elf sections already, so if you
> put the kmem_buckets in the codetag the entire elf section can be
> discarded if it's not in use.

Gotcha. Yeah, sounds good. Once codetags and this series land, I can
start working on making the per-site series.

> "usercopy window"? You're now annotating which data can be copied to
> userspace?

Hm? Yes. That's been there for over 7 years. :) It's just that it was only
meaningful for kmem_cache_create() users, since the proposed GFP_USERCOPY
for kmalloc() never landed[1].

-Kees

[1] https://lore.kernel.org/lkml/1497915397-93805-23-git-send-email-keescook@chromium.org/
diff mbox series

Patch

diff --git a/include/linux/slab.h b/include/linux/slab.h
index f26ac9a6ef9f..058d0e3cd181 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -493,6 +493,11 @@  void *kmem_cache_alloc_lru(struct kmem_cache *s, struct list_lru *lru,
 			   gfp_t gfpflags) __assume_slab_alignment __malloc;
 void kmem_cache_free(struct kmem_cache *s, void *objp);
 
+kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
+				  slab_flags_t flags,
+				  unsigned int useroffset, unsigned int usersize,
+				  void (*ctor)(void *));
+
 /*
  * Bulk allocation and freeing operations. These are accelerated in an
  * allocator specific way to avoid taking locks repeatedly or building
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 1d0f25b6ae91..03ba9aac96b6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -392,6 +392,74 @@  kmem_cache_create(const char *name, unsigned int size, unsigned int align,
 }
 EXPORT_SYMBOL(kmem_cache_create);
 
+static struct kmem_cache *kmem_buckets_cache __ro_after_init;
+
+kmem_buckets *kmem_buckets_create(const char *name, unsigned int align,
+				  slab_flags_t flags,
+				  unsigned int useroffset,
+				  unsigned int usersize,
+				  void (*ctor)(void *))
+{
+	kmem_buckets *b;
+	int idx;
+
+	if (WARN_ON(!kmem_buckets_cache))
+		return NULL;
+
+	b = kmem_cache_alloc(kmem_buckets_cache, GFP_KERNEL|__GFP_ZERO);
+	if (WARN_ON(!b))
+		return NULL;
+
+	flags |= SLAB_NO_MERGE;
+
+	for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
+		char *short_size, *cache_name;
+		unsigned int cache_useroffset, cache_usersize;
+		unsigned int size;
+
+		if (!kmalloc_caches[KMALLOC_NORMAL][idx])
+			continue;
+
+		size = kmalloc_caches[KMALLOC_NORMAL][idx]->object_size;
+		if (!size)
+			continue;
+
+		short_size = strchr(kmalloc_caches[KMALLOC_NORMAL][idx]->name, '-');
+		if (WARN_ON(!short_size))
+			goto fail;
+
+		cache_name = kasprintf(GFP_KERNEL, "%s-%s", name, short_size + 1);
+		if (WARN_ON(!cache_name))
+			goto fail;
+
+		if (useroffset >= size) {
+			cache_useroffset = 0;
+			cache_usersize = 0;
+		} else {
+			cache_useroffset = useroffset;
+			cache_usersize = min(size - cache_useroffset, usersize);
+		}
+		(*b)[idx] = kmem_cache_create_usercopy(cache_name, size,
+					align, flags, cache_useroffset,
+					cache_usersize, ctor);
+		kfree(cache_name);
+		if (WARN_ON(!(*b)[idx]))
+			goto fail;
+	}
+
+	return b;
+
+fail:
+	for (idx = 0; idx < ARRAY_SIZE(kmalloc_caches[KMALLOC_NORMAL]); idx++) {
+		if ((*b)[idx])
+			kmem_cache_destroy((*b)[idx]);
+	}
+	kfree(b);
+
+	return NULL;
+}
+EXPORT_SYMBOL(kmem_buckets_create);
+
 #ifdef SLAB_SUPPORTS_SYSFS
 /*
  * For a given kmem_cache, kmem_cache_destroy() should only be called
@@ -933,6 +1001,10 @@  void __init create_kmalloc_caches(slab_flags_t flags)
 
 	/* Kmalloc array is now usable */
 	slab_state = UP;
+
+	kmem_buckets_cache = kmem_cache_create("kmalloc_buckets",
+					       sizeof(kmem_buckets),
+					       0, 0, NULL);
 }
 
 /**