diff mbox series

[RFC,4/5] sbm: fix up calls to dynamic memory allocators

Message ID 20240222131230.635-5-petrtesarik@huaweicloud.com (mailing list archive)
State RFC
Headers show
Series PoC: convert AppArmor parser to SandBox Mode | expand

Commit Message

Petr Tesarik Feb. 22, 2024, 1:12 p.m. UTC
From: Petr Tesarik <petr.tesarik1@huawei-partners.com>

Add fixup functions to call kmalloc(), vmalloc() and friends on behalf of
the sandbox code.

Signed-off-by: Petr Tesarik <petr.tesarik1@huawei-partners.com>
---
 arch/x86/kernel/sbm/core.c | 81 ++++++++++++++++++++++++++++++++++++++
 mm/slab_common.c           |  3 +-
 mm/slub.c                  | 17 ++++----
 mm/vmalloc.c               | 11 +++---
 4 files changed, 98 insertions(+), 14 deletions(-)

Comments

Dave Hansen Feb. 22, 2024, 3:51 p.m. UTC | #1
On 2/22/24 05:12, Petr Tesarik wrote:
>  static const struct sbm_fixup fixups[] =
>  {
> +	/* kmalloc() and friends */
> +	{ kmalloc_trace, proxy_alloc3 },
> +	{ __kmalloc, proxy_alloc1 },
> +	{ __kmalloc_node, proxy_alloc1 },
> +	{ __kmalloc_node_track_caller, proxy_alloc1 },
> +	{ kmalloc_large, proxy_alloc1 },
> +	{ kmalloc_large_node, proxy_alloc1 },
> +	{ krealloc, proxy_alloc2 },
> +	{ kfree, proxy_free },
> +
> +	/* vmalloc() and friends */
> +	{ vmalloc, proxy_alloc1 },
> +	{ __vmalloc, proxy_alloc1 },
> +	{ __vmalloc_node, proxy_alloc1 },
> +	{ vzalloc, proxy_alloc1 },
> +	{ vfree, proxy_free },
> +
>  	{ }
>  };

Petr, thanks for sending this.  This _is_ a pretty concise example of
what it means to convert kernel code to run in your sandbox mode.  But,
from me, it's still "no thanks".

Establishing and maintaining this proxy list will be painful.  Folks
will change the code to call something new and break this *constantly*.

That goes for infrastructure like the allocators and for individual
sandbox instances like apparmor.

It's also telling that sandboxing a bit of apparmor took four fixups.
That tells me we're probably still only looking at the tip of the icebeg
if we were to convert a bunch more sites.

That's on top of everything I was concerned about before.
Petr Tesařík Feb. 22, 2024, 5:57 p.m. UTC | #2
On Thu, 22 Feb 2024 07:51:00 -0800
Dave Hansen <dave.hansen@intel.com> wrote:

> On 2/22/24 05:12, Petr Tesarik wrote:
> >  static const struct sbm_fixup fixups[] =
> >  {
> > +	/* kmalloc() and friends */
> > +	{ kmalloc_trace, proxy_alloc3 },
> > +	{ __kmalloc, proxy_alloc1 },
> > +	{ __kmalloc_node, proxy_alloc1 },
> > +	{ __kmalloc_node_track_caller, proxy_alloc1 },
> > +	{ kmalloc_large, proxy_alloc1 },
> > +	{ kmalloc_large_node, proxy_alloc1 },
> > +	{ krealloc, proxy_alloc2 },
> > +	{ kfree, proxy_free },
> > +
> > +	/* vmalloc() and friends */
> > +	{ vmalloc, proxy_alloc1 },
> > +	{ __vmalloc, proxy_alloc1 },
> > +	{ __vmalloc_node, proxy_alloc1 },
> > +	{ vzalloc, proxy_alloc1 },
> > +	{ vfree, proxy_free },
> > +
> >  	{ }
> >  };  
> 
> Petr, thanks for sending this.  This _is_ a pretty concise example of
> what it means to convert kernel code to run in your sandbox mode.  But,
> from me, it's still "no thanks".
> 
> Establishing and maintaining this proxy list will be painful.  Folks
> will change the code to call something new and break this *constantly*.
> 
> That goes for infrastructure like the allocators and for individual
> sandbox instances like apparmor.

Understood.

OTOH the proxy list is here for the PoC so I could send something that
builds and runs without making it an overly big patch series. As
explained in patch 5/5, the goal is not to make a global list. Instead,
each instance should define what it needs and that way define its
specific policy of interfacing with the rest of the kernel.

To give an example, these AppArmor fixups would be added only to the
sandbox which runs aa_unpack(), but not to the one which runs
unpack_to_rootfs(), which is another PoC I did (but required porting
more patches).

If more fixups are needed after you change your code, you know you've
just added a new dependency. It's then up to you to decide if it was
intentional.

> It's also telling that sandboxing a bit of apparmor took four fixups.
> That tells me we're probably still only looking at the tip of the icebeg
> if we were to convert a bunch more sites.

Yes, it is the cost paid for getting code and data flows under control.

In your opinion this kind of memory safety is not worth the effort of
explicitly defining the interface between a sandboxed component and the
rest of the kernel, because it increases maintenance costs. Correct?

> That's on top of everything I was concerned about before.

Good, I think I can understand the new concern, but regarding
everything you were concerned about before, this part is still not
quite clear to me. I'll try to summarize the points:

* Running code in ring-0 is inherently safer than running code in
  ring-3.

  Since what I'm trying to do is protect kernel data structures
  from memory safety bugs in another part of the kernel, it roughly
  translates to: "Kernel data structures are better protected from
  rogue kernel modules than from userspace applications." This cannot
  possibly be what you are trying to say.

* SMAP, SMEP and/or LASS can somehow protect one part of the kernel
  from memory safety bugs in another part of the kernel.

  I somehow can't see how that is the case. I have always thought that:

  * SMEP prevents the kernel to execute code from user pages.
  * SMAP prevents the kernel to read from or write into user pages.
  * LASS does pretty much the same job as SMEP+SMAP, but instead of
    using page table protection bits, it uses the highest bit of the
    virtual address because that's much faster.

* Hardware designers are adding (other) hardware security defenses to
  ring-0 that are not applied to ring-3.

  Could you give an example of these other security defenses, please?

* Ring-3 is more exposed to attacks.

  This statement sounds a bit too vague on its own. What attack vectors
  are we talking about? The primary attack vector that SBM is trying to
  address are exploits of kernel code vulnerabilities triggered by data
  from sources outside the kernel (boot loader, userspace, etc.).

H. Peter Anvin added a few other points:

* SBM has all the downsides of a microkernel without the upsides.

  I can only guess what would be the downsides and upsides...

  One notorious downside is performance. Agreed, there is some overhead.
  I'm not promoting SBM for time-critical operations. But compared to
  user-mode helpers (which was suggested as an alternative for one of
  the proposed scenarios), the overhead of SBM is at least an order of
  magnitude less.

  IPC and the need to define how servers interact with each other is
  another downside I can think of. Yes, there is a bit of it in SBM, as
  you have correctly noted above.

* SBM introduces architectural changes that are most definitely *very*
  harmful both to maintainers and users.

  It is very difficult to learn something from this statement. Could
  you give some examples of how SBM harms either group, please?

* SBM feels like paravirtualization all over again.

  All right, hpa, you've had lots of pain with paravirtualization. I
  feel with you, I've had my part of it too. Can you imagine how much
  trouble I could have spared myself for the libkdumpfile project if I
  didn't have to deal with the difference between "physical addresses"
  and "machine addresses"?

  However, this is hardly a relevant point. The Linux kernel community
  is respected for making decisions based on facts, not feelings.

* SBM exposes kernel memory to user space.

  This is a misunderstanding. Sandbox mode does not share anything at
  all with user mode. It does share some CPU state with kernel mode,
  but not with user mode. If "user space" was intended to mean "Ring-3",
  then it doesn't explain how that is a really bad idea.

* SBM is not needed, because there is already eBPF.

  Well, yes, but I believe they work on a different level. For example,
  eBPF needs a verifier to ensure memory safety. If you run eBPF code
  itself in a sandbox instead, that verifier is not needed, because
  memory safety is enforced by CPU hardware.

When hpa says that SandBox Mode is "an enormous step in the wrong
direction", I want to understand why this direction is wrong, so I can
take a step in the right direction next time.

So far there has been only one objective concern: the need to track code
(and data) dependencies explicitly. AFAICS this is an inherent drawback
of any kind of program decomposition.

Is decomposition considered harmful?

Petr T
Dave Hansen Feb. 22, 2024, 6:03 p.m. UTC | #3
On 2/22/24 09:57, Petr Tesařík wrote:
> * Hardware designers are adding (other) hardware security defenses to
>   ring-0 that are not applied to ring-3.
> 
>   Could you give an example of these other security defenses, please?

Here's one example:

> https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/data-dependent-prefetcher.html

"DDP is neither trained by nor triggered by supervisor-mode accesses."

But seriously, this is going to be my last message on this topic.  I
appreciate your enthusiasm, but I don't see any viable way forward for
this approach.
diff mbox series

Patch

diff --git a/arch/x86/kernel/sbm/core.c b/arch/x86/kernel/sbm/core.c
index c8ac7ecb08cc..3cf3842292b9 100644
--- a/arch/x86/kernel/sbm/core.c
+++ b/arch/x86/kernel/sbm/core.c
@@ -20,6 +20,12 @@ 
 #include <linux/sbm.h>
 #include <linux/sched/task_stack.h>
 
+/*
+ * FIXME: Remove these includes when there is proper API for defining
+ * which functions can be called from sandbox mode.
+ */
+#include <linux/vmalloc.h>
+
 #define GFP_SBM_PGTABLE	(GFP_KERNEL | __GFP_ZERO)
 #define PGD_ORDER	get_order(sizeof(pgd_t) * PTRS_PER_PGD)
 
@@ -52,8 +58,83 @@  struct sbm_fixup {
 	sbm_proxy_call_fn proxy;
 };
 
+static int map_range(struct x86_sbm_state *state, unsigned long start,
+		     unsigned long end, pgprot_t prot);
+
+/* Map the newly allocated dynamic memory region. */
+static unsigned long post_alloc(struct x86_sbm_state *state,
+				unsigned long objp, size_t size)
+{
+	int err;
+
+	if (!objp)
+		return objp;
+
+	err = map_range(state, objp, objp + size, PAGE_SHARED);
+	if (err) {
+		kfree((void*)objp);
+		return 0UL;
+	}
+	return objp;
+}
+
+/* Allocation proxy handler if size is the 1st parameter. */
+static unsigned long proxy_alloc1(struct x86_sbm_state *state,
+				    unsigned long func, struct pt_regs *regs)
+{
+	unsigned long objp;
+
+	objp = x86_sbm_proxy_call(state, func, regs);
+	return post_alloc(state, objp, regs->di);
+}
+
+/* Allocation proxy handler if size is the 2nd parameter. */
+static unsigned long proxy_alloc2(struct x86_sbm_state *state,
+				    unsigned long func, struct pt_regs *regs)
+{
+	unsigned long objp;
+
+	objp = x86_sbm_proxy_call(state, func, regs);
+	return post_alloc(state, objp, regs->si);
+}
+
+/* Allocation proxy handler if size is the 3rd parameter. */
+static unsigned long proxy_alloc3(struct x86_sbm_state *state,
+				    unsigned long func, struct pt_regs *regs)
+{
+	unsigned long objp;
+
+	objp = x86_sbm_proxy_call(state, func, regs);
+	return post_alloc(state, objp, regs->dx);
+}
+
+/* Proxy handler to free previously allocated memory. */
+static unsigned long proxy_free(struct x86_sbm_state *state,
+				unsigned long func, struct pt_regs *regs)
+{
+	/* TODO: unmap allocated addresses from sandbox! */
+	return x86_sbm_proxy_call(state, func, regs);
+}
+
 static const struct sbm_fixup fixups[] =
 {
+	/* kmalloc() and friends */
+	{ kmalloc_trace, proxy_alloc3 },
+	{ __kmalloc, proxy_alloc1 },
+	{ __kmalloc_node, proxy_alloc1 },
+	{ __kmalloc_node_track_caller, proxy_alloc1 },
+	{ kmalloc_large, proxy_alloc1 },
+	{ kmalloc_large_node, proxy_alloc1 },
+	{ krealloc, proxy_alloc2 },
+	{ kfree, proxy_free },
+
+	/* vmalloc() and friends */
+	{ vmalloc, proxy_alloc1 },
+	{ __vmalloc, proxy_alloc1 },
+	{ __vmalloc_node, proxy_alloc1 },
+	{ vzalloc, proxy_alloc1 },
+	{ vfree, proxy_free },
+
 	{ }
 };
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 238293b1dbe1..2b72118d9bfa 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -28,6 +28,7 @@ 
 #include <asm/page.h>
 #include <linux/memcontrol.h>
 #include <linux/stackdepot.h>
+#include <linux/sbm.h>
 
 #include "internal.h"
 #include "slab.h"
@@ -1208,7 +1209,7 @@  __do_krealloc(const void *p, size_t new_size, gfp_t flags)
  *
  * Return: pointer to the allocated memory or %NULL in case of error
  */
-void *krealloc(const void *p, size_t new_size, gfp_t flags)
+void * __nosbm krealloc(const void *p, size_t new_size, gfp_t flags)
 {
 	void *ret;
 
diff --git a/mm/slub.c b/mm/slub.c
index 2ef88bbf56a3..5f2290fe4df0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -42,6 +42,7 @@ 
 #include <kunit/test.h>
 #include <kunit/test-bug.h>
 #include <linux/sort.h>
+#include <linux/sbm.h>
 
 #include <linux/debugfs.h>
 #include <trace/events/kmem.h>
@@ -3913,7 +3914,7 @@  EXPORT_SYMBOL(kmem_cache_alloc_node);
  * directly to the page allocator. We use __GFP_COMP, because we will need to
  * know the allocation order to free the pages properly in kfree.
  */
-static void *__kmalloc_large_node(size_t size, gfp_t flags, int node)
+static void * __nosbm __kmalloc_large_node(size_t size, gfp_t flags, int node)
 {
 	struct folio *folio;
 	void *ptr = NULL;
@@ -3938,7 +3939,7 @@  static void *__kmalloc_large_node(size_t size, gfp_t flags, int node)
 	return ptr;
 }
 
-void *kmalloc_large(size_t size, gfp_t flags)
+void * __nosbm kmalloc_large(size_t size, gfp_t flags)
 {
 	void *ret = __kmalloc_large_node(size, flags, NUMA_NO_NODE);
 
@@ -3983,26 +3984,26 @@  void *__do_kmalloc_node(size_t size, gfp_t flags, int node,
 	return ret;
 }
 
-void *__kmalloc_node(size_t size, gfp_t flags, int node)
+void * __nosbm __kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	return __do_kmalloc_node(size, flags, node, _RET_IP_);
 }
 EXPORT_SYMBOL(__kmalloc_node);
 
-void *__kmalloc(size_t size, gfp_t flags)
+void * __nosbm __kmalloc(size_t size, gfp_t flags)
 {
 	return __do_kmalloc_node(size, flags, NUMA_NO_NODE, _RET_IP_);
 }
 EXPORT_SYMBOL(__kmalloc);
 
-void *__kmalloc_node_track_caller(size_t size, gfp_t flags,
-				  int node, unsigned long caller)
+void * __nosbm __kmalloc_node_track_caller(size_t size, gfp_t flags,
+					   int node, unsigned long caller)
 {
 	return __do_kmalloc_node(size, flags, node, caller);
 }
 EXPORT_SYMBOL(__kmalloc_node_track_caller);
 
-void *kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
+void * __nosbm kmalloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
 {
 	void *ret = slab_alloc_node(s, NULL, gfpflags, NUMA_NO_NODE,
 					    _RET_IP_, size);
@@ -4386,7 +4387,7 @@  static void free_large_kmalloc(struct folio *folio, void *object)
  *
  * If @object is NULL, no operation is performed.
  */
-void kfree(const void *object)
+void __nosbm kfree(const void *object)
 {
 	struct folio *folio;
 	struct slab *slab;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d12a17fc0c17..d7a5b715ac03 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -40,6 +40,7 @@ 
 #include <linux/pgtable.h>
 #include <linux/hugetlb.h>
 #include <linux/sched/mm.h>
+#include <linux/sbm.h>
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
@@ -2804,7 +2805,7 @@  void vfree_atomic(const void *addr)
  * if we have CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG, but making the calling
  * conventions for vfree() arch-dependent would be a really bad idea).
  */
-void vfree(const void *addr)
+void __nosbm vfree(const void *addr)
 {
 	struct vm_struct *vm;
 	int i;
@@ -3379,7 +3380,7 @@  void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *__vmalloc_node(unsigned long size, unsigned long align,
+void * __nosbm __vmalloc_node(unsigned long size, unsigned long align,
 			    gfp_t gfp_mask, int node, const void *caller)
 {
 	return __vmalloc_node_range(size, align, VMALLOC_START, VMALLOC_END,
@@ -3394,7 +3395,7 @@  void *__vmalloc_node(unsigned long size, unsigned long align,
 EXPORT_SYMBOL_GPL(__vmalloc_node);
 #endif
 
-void *__vmalloc(unsigned long size, gfp_t gfp_mask)
+void * __nosbm __vmalloc(unsigned long size, gfp_t gfp_mask)
 {
 	return __vmalloc_node(size, 1, gfp_mask, NUMA_NO_NODE,
 				__builtin_return_address(0));
@@ -3413,7 +3414,7 @@  EXPORT_SYMBOL(__vmalloc);
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *vmalloc(unsigned long size)
+void * __nosbm vmalloc(unsigned long size)
 {
 	return __vmalloc_node(size, 1, GFP_KERNEL, NUMA_NO_NODE,
 				__builtin_return_address(0));
@@ -3453,7 +3454,7 @@  EXPORT_SYMBOL_GPL(vmalloc_huge);
  *
  * Return: pointer to the allocated memory or %NULL on error
  */
-void *vzalloc(unsigned long size)
+void * __nosbm vzalloc(unsigned long size)
 {
 	return __vmalloc_node(size, 1, GFP_KERNEL | __GFP_ZERO, NUMA_NO_NODE,
 				__builtin_return_address(0));