diff mbox series

[POC,RFC,1/2] mm/x86: Add wildcard * option as memmap=nn*align:name

Message ID 20240409211351.075320273@goodmis.org (mailing list archive)
State Superseded
Headers show
Series pstore/mm/x86: Add wildcard memmap to map pstore consistently | expand

Commit Message

Steven Rostedt April 9, 2024, 9:02 p.m. UTC
From: "Steven Rostedt (Google)" <rostedt@goodmis.org>

In order to allow for requesting a memory region that can be used for
things like pstore on multiple machines where the memory is not the same,
add a new option to the memmap=nn$ kernel command line.

The memmap=nn$addr will reserve nn amount of memory at the physical
address addr. To use this, one must know the physical memory layout and
know where usable memory exists in the physical layout.

Add a '*' option that will assign memory by looking for a range that can
fit the given size and alignment. It will start at the high addresses, and
then work its way down.

The format is:  memmap=nn*align:name

Where it will find nn amount of memory at the given alignment of align.
The name field is to allow another subsystem to retrieve where the memory
was found. For example:

  memmap=12M*4096:oops ramoops.mem_name=oops

Where ramoops.mem_name will tell ramoops that memory was reserved for it
via the wildcard '*' option and it can find it by calling:

  if (memmap_named("oops", &start, &size)) {
	// start holds the start address and size holds the size given

Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
---
 arch/x86/kernel/e820.c | 91 ++++++++++++++++++++++++++++++++++++++++++
 include/linux/mm.h     |  2 +
 mm/memory.c            |  7 ++++
 3 files changed, 100 insertions(+)

Comments

Kees Cook April 9, 2024, 10:23 p.m. UTC | #1
On Tue, Apr 09, 2024 at 05:02:55PM -0400, Steven Rostedt wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> In order to allow for requesting a memory region that can be used for
> things like pstore on multiple machines where the memory is not the same,
> add a new option to the memmap=nn$ kernel command line.
> 
> The memmap=nn$addr will reserve nn amount of memory at the physical
> address addr. To use this, one must know the physical memory layout and
> know where usable memory exists in the physical layout.
> 
> Add a '*' option that will assign memory by looking for a range that can
> fit the given size and alignment. It will start at the high addresses, and
> then work its way down.
> 
> The format is:  memmap=nn*align:name
> 
> Where it will find nn amount of memory at the given alignment of align.
> The name field is to allow another subsystem to retrieve where the memory
> was found. For example:
> 
>   memmap=12M*4096:oops ramoops.mem_name=oops
> 
> Where ramoops.mem_name will tell ramoops that memory was reserved for it
> via the wildcard '*' option and it can find it by calling:
> 
>   if (memmap_named("oops", &start, &size)) {
> 	// start holds the start address and size holds the size given
> 
> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> ---
>  arch/x86/kernel/e820.c | 91 ++++++++++++++++++++++++++++++++++++++++++
>  include/linux/mm.h     |  2 +
>  mm/memory.c            |  7 ++++
>  3 files changed, 100 insertions(+)

Do we need to involve e820 at all? I think it might be possible to just
have pstore call request_mem_region() very early? Or does KASLR make
that unstable?
Steven Rostedt April 9, 2024, 11:11 p.m. UTC | #2
On Tue, 9 Apr 2024 15:23:07 -0700
Kees Cook <keescook@chromium.org> wrote:

> Do we need to involve e820 at all? I think it might be possible to just
> have pstore call request_mem_region() very early? Or does KASLR make
> that unstable?

Yeah, would that give the same physical memory each boot, and can we
guarantee that KASLR will not map the kernel over the previous location?

-- Steve
Kees Cook April 9, 2024, 11:41 p.m. UTC | #3
On Tue, Apr 09, 2024 at 07:11:56PM -0400, Steven Rostedt wrote:
> On Tue, 9 Apr 2024 15:23:07 -0700
> Kees Cook <keescook@chromium.org> wrote:
> 
> > Do we need to involve e820 at all? I think it might be possible to just
> > have pstore call request_mem_region() very early? Or does KASLR make
> > that unstable?
> 
> Yeah, would that give the same physical memory each boot, and can we
> guarantee that KASLR will not map the kernel over the previous location?

Hm, no, for physical memory it needs to get excluded very early, which
means e820. So, yeah, your proposal makes sense. I'm not super excited
about this be x86-only though. What does arm64 for for memmap?
Mike Rapoport April 12, 2024, 8:59 p.m. UTC | #4
On Tue, Apr 09, 2024 at 04:41:24PM -0700, Kees Cook wrote:
> On Tue, Apr 09, 2024 at 07:11:56PM -0400, Steven Rostedt wrote:
> > On Tue, 9 Apr 2024 15:23:07 -0700
> > Kees Cook <keescook@chromium.org> wrote:
> > 
> > > Do we need to involve e820 at all? I think it might be possible to just
> > > have pstore call request_mem_region() very early? Or does KASLR make
> > > that unstable?
> > 
> > Yeah, would that give the same physical memory each boot, and can we
> > guarantee that KASLR will not map the kernel over the previous location?
> 
> Hm, no, for physical memory it needs to get excluded very early, which
> means e820.

Whatever memory is reserved in arch/x86/kernel/e820.c, that happens after
kaslr, so to begin with, a new memmap parameter should be also added to
parse_memmap in arch/x86/boot/compressed/kaslr.c to ensure the same
physical address will be available after KASLR.

More generally, memmap= is x86 specific and a bit of a hack.
Why won't you add a new kernel parameter that will be parsed in, say, 
mm/mm_init.c and will create the mmap_map (or whatever it will be named)
and reserve that memory in memblock rather than in e820?

This still will require update to arch/x86/boot/compressed/kaslr.c of
course.

> So, yeah, your proposal makes sense. I'm not super excited
> about this be x86-only though. What does arm64 for for memmap?
> 
> -- 
> Kees Cook
>
Steven Rostedt April 12, 2024, 10:19 p.m. UTC | #5
On Fri, 12 Apr 2024 23:59:07 +0300
Mike Rapoport <rppt@kernel.org> wrote:

> On Tue, Apr 09, 2024 at 04:41:24PM -0700, Kees Cook wrote:
> > On Tue, Apr 09, 2024 at 07:11:56PM -0400, Steven Rostedt wrote:  
> > > On Tue, 9 Apr 2024 15:23:07 -0700
> > > Kees Cook <keescook@chromium.org> wrote:
> > >   
> > > > Do we need to involve e820 at all? I think it might be possible to just
> > > > have pstore call request_mem_region() very early? Or does KASLR make
> > > > that unstable?  
> > > 
> > > Yeah, would that give the same physical memory each boot, and can we
> > > guarantee that KASLR will not map the kernel over the previous location?  
> > 
> > Hm, no, for physical memory it needs to get excluded very early, which
> > means e820.  
> 
> Whatever memory is reserved in arch/x86/kernel/e820.c, that happens after
> kaslr, so to begin with, a new memmap parameter should be also added to
> parse_memmap in arch/x86/boot/compressed/kaslr.c to ensure the same
> physical address will be available after KASLR.

But doesn't KASLR only affect virtual memory not physical memory?

This just makes sure the physical memory it finds will not be used by the
system. Then ramoops does the mapping via vmap() I believe, to get a
virtual address to access the physical address.

> 
> More generally, memmap= is x86 specific and a bit of a hack.
> Why won't you add a new kernel parameter that will be parsed in, say, 
> mm/mm_init.c and will create the mmap_map (or whatever it will be named)
> and reserve that memory in memblock rather than in e820?

Hmm, I only did this approach because I'm familiar with the memmap hack and
extended upon it. But yeah, if I can do the same thing in mm_init.c it
could possibly work for all archs. Thanks for the suggestion, I'll play
with that.

> 
> This still will require update to arch/x86/boot/compressed/kaslr.c of
> course.

Oh, is the issue if KASLR maps the kernel over this location, then we lose
it? We need to tell KASLR not to touch this location?

-- Steve
Kees Cook April 15, 2024, 5:22 p.m. UTC | #6
On Fri, Apr 12, 2024 at 06:19:40PM -0400, Steven Rostedt wrote:
> On Fri, 12 Apr 2024 23:59:07 +0300
> Mike Rapoport <rppt@kernel.org> wrote:
> 
> > On Tue, Apr 09, 2024 at 04:41:24PM -0700, Kees Cook wrote:
> > > On Tue, Apr 09, 2024 at 07:11:56PM -0400, Steven Rostedt wrote:  
> > > > On Tue, 9 Apr 2024 15:23:07 -0700
> > > > Kees Cook <keescook@chromium.org> wrote:
> > > >   
> > > > > Do we need to involve e820 at all? I think it might be possible to just
> > > > > have pstore call request_mem_region() very early? Or does KASLR make
> > > > > that unstable?  
> > > > 
> > > > Yeah, would that give the same physical memory each boot, and can we
> > > > guarantee that KASLR will not map the kernel over the previous location?  
> > > 
> > > Hm, no, for physical memory it needs to get excluded very early, which
> > > means e820.  
> > 
> > Whatever memory is reserved in arch/x86/kernel/e820.c, that happens after
> > kaslr, so to begin with, a new memmap parameter should be also added to
> > parse_memmap in arch/x86/boot/compressed/kaslr.c to ensure the same
> > physical address will be available after KASLR.
> 
> But doesn't KASLR only affect virtual memory not physical memory?

KASLR for x86 (and other archs, like arm64) do both physical and virtual
base randomization.

> This just makes sure the physical memory it finds will not be used by the
> system. Then ramoops does the mapping via vmap() I believe, to get a
> virtual address to access the physical address.

I was assuming, since you were in the e820 code, that it was
manipulating that before KASLR chose a location. But if not, yeah, Mike
is right -- you need to make sure this is getting done before
decompress_kernel().
Mike Rapoport May 1, 2024, 2:57 p.m. UTC | #7
On Mon, Apr 15, 2024 at 10:22:53AM -0700, Kees Cook wrote:
> On Fri, Apr 12, 2024 at 06:19:40PM -0400, Steven Rostedt wrote:
> > On Fri, 12 Apr 2024 23:59:07 +0300
> > Mike Rapoport <rppt@kernel.org> wrote:
> > 
> > > On Tue, Apr 09, 2024 at 04:41:24PM -0700, Kees Cook wrote:
> > > > On Tue, Apr 09, 2024 at 07:11:56PM -0400, Steven Rostedt wrote:  
> > > > > On Tue, 9 Apr 2024 15:23:07 -0700
> > > > > Kees Cook <keescook@chromium.org> wrote:
> > > > >   
> > > > > > Do we need to involve e820 at all? I think it might be possible to just
> > > > > > have pstore call request_mem_region() very early? Or does KASLR make
> > > > > > that unstable?  
> > > > > 
> > > > > Yeah, would that give the same physical memory each boot, and can we
> > > > > guarantee that KASLR will not map the kernel over the previous location?  
> > > > 
> > > > Hm, no, for physical memory it needs to get excluded very early, which
> > > > means e820.  
> > > 
> > > Whatever memory is reserved in arch/x86/kernel/e820.c, that happens after
> > > kaslr, so to begin with, a new memmap parameter should be also added to
> > > parse_memmap in arch/x86/boot/compressed/kaslr.c to ensure the same
> > > physical address will be available after KASLR.
> > 
> > But doesn't KASLR only affect virtual memory not physical memory?
> 
> KASLR for x86 (and other archs, like arm64) do both physical and virtual
> base randomization.
> 
> > This just makes sure the physical memory it finds will not be used by the
> > system. Then ramoops does the mapping via vmap() I believe, to get a
> > virtual address to access the physical address.
> 
> I was assuming, since you were in the e820 code, that it was
> manipulating that before KASLR chose a location. But if not, yeah, Mike
> is right -- you need to make sure this is getting done before
> decompress_kernel().

Right now kaslr can handle up to 4 memmap regions and parse_memmap() in
arch/x86/boot/compressed/kaslr.c should be updated for a new memmap type.

But I think it's better to add a new kernel parameter as I suggested in
another email and teach mem_avoid_memmap() in kaslr.c to deal with it, as
well as with crashkernel=size@offset, btw.
 
> -- 
> Kees Cook
Ard Biesheuvel May 6, 2024, 10:38 a.m. UTC | #8
On Wed, 1 May 2024 at 16:59, Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Apr 15, 2024 at 10:22:53AM -0700, Kees Cook wrote:
> > On Fri, Apr 12, 2024 at 06:19:40PM -0400, Steven Rostedt wrote:
> > > On Fri, 12 Apr 2024 23:59:07 +0300
> > > Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > > On Tue, Apr 09, 2024 at 04:41:24PM -0700, Kees Cook wrote:
> > > > > On Tue, Apr 09, 2024 at 07:11:56PM -0400, Steven Rostedt wrote:
> > > > > > On Tue, 9 Apr 2024 15:23:07 -0700
> > > > > > Kees Cook <keescook@chromium.org> wrote:
> > > > > >
> > > > > > > Do we need to involve e820 at all? I think it might be possible to just
> > > > > > > have pstore call request_mem_region() very early? Or does KASLR make
> > > > > > > that unstable?
> > > > > >
> > > > > > Yeah, would that give the same physical memory each boot, and can we
> > > > > > guarantee that KASLR will not map the kernel over the previous location?
> > > > >
> > > > > Hm, no, for physical memory it needs to get excluded very early, which
> > > > > means e820.
> > > >
> > > > Whatever memory is reserved in arch/x86/kernel/e820.c, that happens after
> > > > kaslr, so to begin with, a new memmap parameter should be also added to
> > > > parse_memmap in arch/x86/boot/compressed/kaslr.c to ensure the same
> > > > physical address will be available after KASLR.
> > >
> > > But doesn't KASLR only affect virtual memory not physical memory?
> >
> > KASLR for x86 (and other archs, like arm64) do both physical and virtual
> > base randomization.
> >
> > > This just makes sure the physical memory it finds will not be used by the
> > > system. Then ramoops does the mapping via vmap() I believe, to get a
> > > virtual address to access the physical address.
> >
> > I was assuming, since you were in the e820 code, that it was
> > manipulating that before KASLR chose a location. But if not, yeah, Mike
> > is right -- you need to make sure this is getting done before
> > decompress_kernel().
>
> Right now kaslr can handle up to 4 memmap regions and parse_memmap() in
> arch/x86/boot/compressed/kaslr.c should be updated for a new memmap type.
>
> But I think it's better to add a new kernel parameter as I suggested in
> another email and teach mem_avoid_memmap() in kaslr.c to deal with it, as
> well as with crashkernel=size@offset, btw.
>

The logic in arch/x86/boot/compressed/kaslr.c is now only used by non-EFI boot.

In general, I am highly skeptical that hopes and prayers are enough to
prevent the firmware from stepping on such a region, unless this is
only a best effort thing, and failures are acceptable. For instance,
booting an EFI system with/without an external display attached, or
with a USB device inserted (without even using it during boot) will
impact the memory map, to the extent that the E820 table derived from
it may look different. (EFI tries to keep the runtime regions in the
same place but the boot-time regions are allocated/freed on demand)

So I would strongly urge to address this properly, and work with
firmware folks to define some kind of protocol for this.
Steven Rostedt May 8, 2024, 11:23 p.m. UTC | #9
On Mon, 6 May 2024 12:38:32 +0200
Ard Biesheuvel <ardb@kernel.org> wrote:


> The logic in arch/x86/boot/compressed/kaslr.c is now only used by non-EFI boot.
> 
> In general, I am highly skeptical that hopes and prayers are enough to
> prevent the firmware from stepping on such a region, unless this is
> only a best effort thing, and failures are acceptable. For instance,

I would be very happy with just a "best effort" approach. I think
kexec/kdump has the same issue and it hasn't been a problem in practice.

> booting an EFI system with/without an external display attached, or
> with a USB device inserted (without even using it during boot) will
> impact the memory map, to the extent that the E820 table derived from
> it may look different. (EFI tries to keep the runtime regions in the
> same place but the boot-time regions are allocated/freed on demand)

Part of my requirement was that the system is exactly the same (no
changes to hardware or even the kernel).

> 
> So I would strongly urge to address this properly, and work with
> firmware folks to define some kind of protocol for this.

We could possibly add that later, but honesty, that is something that I
doubt would ever happen. You would have to get buy-in from all firmware
stakeholders. I'm not sure if this is a big enough use case for them to
even take a look at it.

The main use case for this work is for pstore to have crash information
of what happened up to the crash. In 99.99% of the time, the firmware
or kaslr will not use the memory that was needed, and you can get very
useful information from the crash info. If the pstore is moved, it
should be able to see that the memory is garbage and just reset it.

Note that we can not use kexec/kdump in the field for various reasons,
and I need a way to reserve memory for several different devices (both
x86 and arm).

-- Steve
diff mbox series

Patch

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index 6f1b379e3b38..a8831ef30c73 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -64,6 +64,61 @@  struct e820_table *e820_table __refdata			= &e820_table_init;
 struct e820_table *e820_table_kexec __refdata		= &e820_table_kexec_init;
 struct e820_table *e820_table_firmware __refdata	= &e820_table_firmware_init;
 
+/* For wildcard memory requests, have a table to find them later */
+#define E820_MAX_MAPS		8
+#define E820_MAP_NAME_SIZE	16
+struct e820_mmap_map {
+	char			name[E820_MAP_NAME_SIZE];
+	u64			start;
+	u64			size;
+};
+static struct e820_mmap_map e820_mmap_list[E820_MAX_MAPS] __initdata;
+static int e820_mmap_size				__initdata;
+
+/* Add wildcard region with a lookup name */
+static int __init e820_add_mmap(u64 start, u64 size, const char *name)
+{
+	struct e820_mmap_map *map;
+
+	if (!name || !name[0] || strlen(name) >= E820_MAP_NAME_SIZE)
+		return -EINVAL;
+
+	if (e820_mmap_size >= E820_MAX_MAPS)
+		return -1;
+
+	map = &e820_mmap_list[e820_mmap_size++];
+	map->start = start;
+	map->size = size;
+	strcpy(map->name, name);
+	return 0;
+}
+
+/**
+ * memmap_named - Find a wildcard region with a given name
+ * @name: The name that is attached to a wildcard region
+ * @start: If found, holds the start address
+ * @size: If found, holds the size of the address.
+ *
+ * Returns: 1 if found or 0 if not found.
+ */
+int __init memmap_named(const char *name, u64 *start, u64 *size)
+{
+	struct e820_mmap_map *map;
+	int i;
+
+	for (i = 0; i < e820_mmap_size; i++) {
+		map = &e820_mmap_list[i];
+		if (!map->size)
+			continue;
+		if (strcmp(name, map->name) == 0) {
+			*start = map->start;
+			*size = map->size;
+			return 1;
+		}
+	}
+	return 0;
+}
+
 /* For PCI or other memory-mapped resources */
 unsigned long pci_mem_start = 0xaeedbabe;
 #ifdef CONFIG_PCI
@@ -200,6 +255,29 @@  static void __init e820_print_type(enum e820_type type)
 	}
 }
 
+/*
+ * Search for usable ram that can be reserved for a wildcard.
+ * Start at the highest memory and work down to lower memory.
+ */
+static s64 e820__region(u64 size, u64 align)
+{
+	u64 start;
+	int i;
+
+	for (i = e820_table->nr_entries; i >= 0; i--) {
+		if (e820_table->entries[i].type != E820_TYPE_RAM &&
+		    e820_table->entries[i].type != E820_TYPE_RESERVED_KERN)
+			continue;
+
+		start = e820_table->entries[i].addr + e820_table->entries[i].size;
+		start -= size;
+		start = ALIGN_DOWN(start, align);
+		if (start >= e820_table->entries[i].addr)
+			return start;
+	}
+	return -1;
+}
+
 void __init e820__print_table(char *who)
 {
 	int i;
@@ -944,6 +1022,19 @@  static int __init parse_memmap_one(char *p)
 	} else if (*p == '$') {
 		start_at = memparse(p+1, &p);
 		e820__range_add(start_at, mem_size, E820_TYPE_RESERVED);
+	} else if (*p == '*') {
+		u64 align;
+		/* Followed by alignment and ':' then the name */
+		align = memparse(p+1, &p);
+		start_at = e820__region(mem_size, align);
+		if ((s64)start_at < 0)
+			return -EINVAL;
+		if (*p != ':')
+			return -EINVAL;
+		p++;
+		e820_add_mmap(start_at, mem_size, p);
+		p += strlen(p);
+		e820__range_add(start_at, mem_size, E820_TYPE_RESERVED);
 	} else if (*p == '!') {
 		start_at = memparse(p+1, &p);
 		e820__range_add(start_at, mem_size, E820_TYPE_PRAM);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0436b919f1c7..cf9b34454c6f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4202,4 +4202,6 @@  static inline bool pfn_is_unaccepted_memory(unsigned long pfn)
 	return range_contains_unaccepted_memory(paddr, paddr + PAGE_SIZE);
 }
 
+int memmap_named(const char *name, u64 *start, u64 *size);
+
 #endif /* _LINUX_MM_H */
diff --git a/mm/memory.c b/mm/memory.c
index d2155ced45f8..7a29f17df7c1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -120,6 +120,13 @@  static bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf)
 	return pte_marker_uffd_wp(vmf->orig_pte);
 }
 
+int __init __weak memmap_named(const char *name, u64 *start, u64 *size)
+{
+	pr_info("Kernel command line: memmap=nn*align:name not supported on this kernel");
+	/* zero means not found */
+	return 0;
+}
+
 /*
  * A number of key systems in x86 including ioremap() rely on the assumption
  * that high_memory defines the upper bound on direct map memory, then end