diff mbox

[v6,0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

Message ID 20240613155506.811013916@goodmis.org (mailing list archive)
State Superseded
Headers show

Commit Message

Steven Rostedt June 13, 2024, 3:55 p.m. UTC
Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
 improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

  See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.

Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be
different for different machines.

Solution:

The solution I have come up with is to introduce a new "reserve_mem=" kernel
command line. This parameter takes the following format:

  reserve_mem=nn:align:label

Where nn is the size of memory to reserve, the align is the alignment of
that memory, and label is the way for other sub-systems to find that memory.
This way the kernel command line could have:

  reserve_mem=12M:4096:oops   ramoops.mem_name=oops

At boot up, the kernel will search for 12 megabytes in usable memory regions
with an alignment of 4096. It will start at the highest regions and work its
way down (for those old devices that want access to lower address DMA). When
it finds a region, it will save it off in a small table and mark it with the
"oops" label. Then the pstore ramoops sub-system could ask for that memory
and location, and it will map itself there.

This prototype allows for 8 different mappings (which may be overkill, 4 is
probably plenty) with 16 byte size to store the label.

I have tested this and it works for us to solve the above problem. We can
update the kernel and command line and increase the size of pstore without
needing to update the firmware, or knowing every memory layout of each
board. I only tested this locally, it has not been tested in the field.

Changes since v5: https://lore.kernel.org/all/20240613003435.401549779@goodmis.org/

[ patch at bottom showing differences ]

- Stressed more that this is a best effort use case

- Updated ramoops.rst to document this new feature

- Used a new variable "tmp" to use in reserve_mem_find_by_name() instead
  of using "size" and possibly corrupting it.

Changes since v4: https://lore.kernel.org/all/20240611215610.548854415@goodmis.org/

- Add all checks about reserve_mem before allocation.
  This means reserved_mem_add() is now a void function.

- Check for name duplications.

- Fix compare of align to SMP_CACHE_BYTES ("<" instead of "<=")

Changes since v3: https://lore.kernel.org/all/20240611144911.327227285@goodmis.org/

- Changed table type of start and size from unsigned long to phys_addr_t
  (as well as the parameters to the functions that use them)

- Changed old reference to "early_reserve_mem" to "reserve_mem"

- Check before reservering memory:
  o Size is non-zero
  o name has text in it

- If align is less than SMP_CACHE_BYTES, make it SMP_CACHE_BYTES

- Remove the silly check of testing *p == '\0' after a p += strlen(p)

Changes since v2: https://lore.kernel.org/all/20240606150143.876469296@goodmis.org/

- Fixed typo of "reserver"

- Added EXPORT_SYMBOL_GPL() for reserve_mem_find_by_name()

- Removed "built-in" from module description that was changed from v1.


Changes since v1: https://lore.kernel.org/all/20240603233330.801075898@goodmis.org/

- Updated the change log of the first patch as well as added an entry
  into kernel-parameters.txt about how reserve_mem is for soft reboots
  and may not be reliable. 


Steven Rostedt (Google) (2):
      mm/memblock: Add "reserve_mem" to reserved named memory at boot up
      pstore/ramoops: Add ramoops.mem_name= command line option

----
 Documentation/admin-guide/kernel-parameters.txt |  22 +++++
 Documentation/admin-guide/ramoops.rst           |  13 +++
 fs/pstore/ram.c                                 |  14 +++
 include/linux/mm.h                              |   2 +
 mm/memblock.c                                   | 117 ++++++++++++++++++++++++
 5 files changed, 168 insertions(+)

Comments

Alexander Graf June 13, 2024, 4:54 p.m. UTC | #1
Hey Steve,

On 13.06.24 17:55, Steven Rostedt wrote:
> Reserve unspecified location of physical memory from kernel command line
>
> Background:
>
> In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
> dmesg output and some other information when a crash happens in the field.
> (This is only done when the user selects "Allow Google to collect data for
>   improving the system"). But there are cases when there's a bug that
> requires more data to be retrieved to figure out what is happening. We would
> like to increase the pstore size, either temporarily, or maybe even
> permanently. The pstore on these devices are at a fixed location in RAM (as
> the RAM is not cleared on soft reboots nor crashes). The location is chosen
> by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
> There's a driver that queries for this to initialize the pstore for
> ChromeOS:
>
>    See drivers/platform/chrome/chromeos_pstore.c
>
> Problem:
>
> The problem is that, even though there's a process to change the kernel on
> these systems, and is done regularly to install updates, the firmware is
> updated much less frequently. Choosing the place in RAM also takes special
> care, and may be in a different address for different boards. Updating the
> size via firmware is a large effort and not something that many are willing
> to do for a temporary pstore size change.


(sorry for not commenting on earlier versions, I didn't see v1-v5 in my 
inbox)

Do you have a "real" pstore on these systems that you could store 
non-volatile variables in, such as persistent UEFI variables? If so, you 
could create an actually persistent mapping for your trace pstore even 
across kernel version updates as a general mechanism to create reserved 
memblocks at fixed offsets.


> Requirement:
>
> Need a way to reserve memory that will be at a consistent location for
> every boot, if the kernel and system are the same. Does not need to work
> if rebooting to a different kernel, or if the system can change the
> memory layout between boots.
>
> The reserved memory can not be an hard coded address, as the same kernel /
> command line needs to run on several different machines. The picked memory
> reservation just needs to be the same for a given machine, but may be


With KASLR is enabled, doesn't this approach break too often to be 
reliable enough for the data you want to extract?

Picking up the idea above, with a persistent variable we could even make 
KASLR avoid that reserved pstore region in its search for a viable KASLR 
offset.


Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Steven Rostedt June 13, 2024, 5:12 p.m. UTC | #2
On Thu, 13 Jun 2024 18:54:12 +0200
Alexander Graf <graf@amazon.com> wrote:

> 
> Do you have a "real" pstore on these systems that you could store 
> non-volatile variables in, such as persistent UEFI variables? If so, you 
> could create an actually persistent mapping for your trace pstore even 
> across kernel version updates as a general mechanism to create reserved 
> memblocks at fixed offsets.

After implementing all this, I don't think I can use pstore for my
purpose. pstore is a generic interface for persistent storage, and
requires an interface to access it. From what I understand, it's not
the place to just ask for an area of RAM.

For this, I have a single patch that allows the tracing instance to use
an area reserved by reserve_mem.

  reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace

I've already tested this on qemu and a couple of chromebooks. It works
well.

> 
> 
> > Requirement:
> >
> > Need a way to reserve memory that will be at a consistent location for
> > every boot, if the kernel and system are the same. Does not need to work
> > if rebooting to a different kernel, or if the system can change the
> > memory layout between boots.
> >
> > The reserved memory can not be an hard coded address, as the same kernel /
> > command line needs to run on several different machines. The picked memory
> > reservation just needs to be the same for a given machine, but may be  
> 
> 
> With KASLR is enabled, doesn't this approach break too often to be 
> reliable enough for the data you want to extract?
> 
> Picking up the idea above, with a persistent variable we could even make 
> KASLR avoid that reserved pstore region in its search for a viable KASLR 
> offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.

-- Steve
Alexander Graf June 17, 2024, 7:07 a.m. UTC | #3
Hey Steve,

On 13.06.24 19:12, Steven Rostedt wrote:
> On Thu, 13 Jun 2024 18:54:12 +0200
> Alexander Graf <graf@amazon.com> wrote:
>
>> Do you have a "real" pstore on these systems that you could store
>> non-volatile variables in, such as persistent UEFI variables? If so, you
>> could create an actually persistent mapping for your trace pstore even
>> across kernel version updates as a general mechanism to create reserved
>> memblocks at fixed offsets.
> After implementing all this, I don't think I can use pstore for my
> purpose. pstore is a generic interface for persistent storage, and
> requires an interface to access it. From what I understand, it's not
> the place to just ask for an area of RAM.
>
> For this, I have a single patch that allows the tracing instance to use
> an area reserved by reserve_mem.
>
>    reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace
>
> I've already tested this on qemu and a couple of chromebooks. It works
> well.


I believe we're talking about 2 different things :). Let me rephrase a 
bit and make a concrete example.

Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command 
line option. The kernel now comes up and allocates a random chunk of 
memory that - by (admittedly good) chance - may be at the same physical 
location as before. Let's assume it deemed 0x1000000 as a good offset.

Let's now assume you're running on a UEFI system. There, you always have 
non-volatile storage available to you even in the pre-boot phase. That 
means the kernel could create a UEFI variable that says "12M:4096:trace 
-> 0x1000000". The pre-boot phase takes all these UEFI variables and 
marks them as reserved. When you finally reach your command line parsing 
logic for reserve_mem=, you can flip all reservations that were not on 
the command line back to normal memory.

That way you have pretty much guaranteed persistent memory regions, even 
with KASLR changing your memory layout across boots.

The nice thing is that the above is an extension of what you've already 
built: Systems with UEFI simply get better guarantees that their regions 
persist.


>
>>
>>> Requirement:
>>>
>>> Need a way to reserve memory that will be at a consistent location for
>>> every boot, if the kernel and system are the same. Does not need to work
>>> if rebooting to a different kernel, or if the system can change the
>>> memory layout between boots.
>>>
>>> The reserved memory can not be an hard coded address, as the same kernel /
>>> command line needs to run on several different machines. The picked memory
>>> reservation just needs to be the same for a given machine, but may be
>>
>> With KASLR is enabled, doesn't this approach break too often to be
>> reliable enough for the data you want to extract?
>>
>> Picking up the idea above, with a persistent variable we could even make
>> KASLR avoid that reserved pstore region in its search for a viable KASLR
>> offset.
> I think I was hit by it once in all my testing. For our use case, the
> few times it fails to map is not going to affect what we need this for
> at all.


Once is pretty good. Do you know why? Also once out of how many runs? Is 
the randomness source not as random as it should be or are the number of 
bits for KASLR maybe so few on your target architecture that the odds of 
hitting anything become low? Do these same constraints hold true outside 
of your testing environment?


Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Steven Rostedt June 17, 2024, 8:40 p.m. UTC | #4
On Mon, 17 Jun 2024 09:07:29 +0200
Alexander Graf <graf@amazon.com> wrote:

> Hey Steve,
> 
> 
> I believe we're talking about 2 different things :). Let me rephrase a 
> bit and make a concrete example.
> 
> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command 
> line option. The kernel now comes up and allocates a random chunk of 
> memory that - by (admittedly good) chance - may be at the same physical 
> location as before. Let's assume it deemed 0x1000000 as a good offset.

Note, it's not random. It picks from the top of available memory every
time. But things can mess with it (see below).

> 
> Let's now assume you're running on a UEFI system. There, you always have 
> non-volatile storage available to you even in the pre-boot phase. That 
> means the kernel could create a UEFI variable that says "12M:4096:trace 
> -> 0x1000000". The pre-boot phase takes all these UEFI variables and   
> marks them as reserved. When you finally reach your command line parsing 
> logic for reserve_mem=, you can flip all reservations that were not on 
> the command line back to normal memory.
> 
> That way you have pretty much guaranteed persistent memory regions, even 
> with KASLR changing your memory layout across boots.
> 
> The nice thing is that the above is an extension of what you've already 
> built: Systems with UEFI simply get better guarantees that their regions 
> persist.

This could be an added feature, but it is very architecture specific,
and would likely need architecture specific updates.

> 
> 
> >  
> >>  
> >>> Requirement:
> >>>
> >>> Need a way to reserve memory that will be at a consistent location for
> >>> every boot, if the kernel and system are the same. Does not need to work
> >>> if rebooting to a different kernel, or if the system can change the
> >>> memory layout between boots.
> >>>
> >>> The reserved memory can not be an hard coded address, as the same kernel /
> >>> command line needs to run on several different machines. The picked memory
> >>> reservation just needs to be the same for a given machine, but may be  
> >>
> >> With KASLR is enabled, doesn't this approach break too often to be
> >> reliable enough for the data you want to extract?
> >>
> >> Picking up the idea above, with a persistent variable we could even make
> >> KASLR avoid that reserved pstore region in its search for a viable KASLR
> >> offset.  
> > I think I was hit by it once in all my testing. For our use case, the
> > few times it fails to map is not going to affect what we need this for
> > at all.  
> 
> 
> Once is pretty good. Do you know why? Also once out of how many runs? Is 
> the randomness source not as random as it should be or are the number of 
> bits for KASLR maybe so few on your target architecture that the odds of 
> hitting anything become low? Do these same constraints hold true outside 
> of your testing environment?

So I just ran it a hundred times in a loop. I added a patch to print
the location of "_text". The loop was this:

  for i in `seq 100`; do
	ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  >> text; grub-reboot '1>0'; sleep 0.5; reboot"
	sleep 25
  done

It searches dmesg for my added printk as well as the print of were the
ring buffer was loaded in physical memory.

It takes about 15 seconds to reboot, so I waited 25. The results are
attached. I found that it was consistent 76 times, which means 1 out of
4 it's not. Funny enough, it broke whenever it loaded the kernel below
0x100000000. And then it would be off by a little.

It was consistently at:

  0x27d000000

And when it failed, it was at 0x27ce00000.

Note, when I used the e820 tables to do this, I never saw a failure. My
assumption is that when it is below 0x100000000, something else gets
allocated causing this to get pushed down.

As this code relies on memblock_phys_alloc() being consistent, if
something gets allocated before it differently depending on where the
kernel is, it can also move the location. A plugin to UEFI would mean
that it would need to reserve the memory, and the code here will need
to know where it is. We could always make the function reserve_mem()
global and weak so that architectures can override it.

-- Steve
Alexander Graf June 17, 2024, 9:01 p.m. UTC | #5
[resend because Thunderbird decided to send the previous version as HTML :(]


On 17.06.24 22:40, Steven Rostedt wrote:
> On Mon, 17 Jun 2024 09:07:29 +0200
> Alexander Graf<graf@amazon.com>  wrote:
>
>> Hey Steve,
>>
>>
>> I believe we're talking about 2 different things :). Let me rephrase a
>> bit and make a concrete example.
>>
>> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
>> line option. The kernel now comes up and allocates a random chunk of
>> memory that - by (admittedly good) chance - may be at the same physical
>> location as before. Let's assume it deemed 0x1000000 as a good offset.
> Note, it's not random. It picks from the top of available memory every
> time. But things can mess with it (see below).
>
>> Let's now assume you're running on a UEFI system. There, you always have
>> non-volatile storage available to you even in the pre-boot phase. That
>> means the kernel could create a UEFI variable that says "12M:4096:trace
>> -> 0x1000000". The pre-boot phase takes all these UEFI variables and
>> marks them as reserved. When you finally reach your command line parsing
>> logic for reserve_mem=, you can flip all reservations that were not on
>> the command line back to normal memory.
>>
>> That way you have pretty much guaranteed persistent memory regions, even
>> with KASLR changing your memory layout across boots.
>>
>> The nice thing is that the above is an extension of what you've already
>> built: Systems with UEFI simply get better guarantees that their regions
>> persist.
> This could be an added feature, but it is very architecture specific,
> and would likely need architecture specific updates.


It definitely would be an added feature, yes. But one that allows you to 
ensure persistence a lot more safely :).


>>>>> Requirement:
>>>>>
>>>>> Need a way to reserve memory that will be at a consistent location for
>>>>> every boot, if the kernel and system are the same. Does not need to work
>>>>> if rebooting to a different kernel, or if the system can change the
>>>>> memory layout between boots.
>>>>>
>>>>> The reserved memory can not be an hard coded address, as the same kernel /
>>>>> command line needs to run on several different machines. The picked memory
>>>>> reservation just needs to be the same for a given machine, but may be
>>>> With KASLR is enabled, doesn't this approach break too often to be
>>>> reliable enough for the data you want to extract?
>>>>
>>>> Picking up the idea above, with a persistent variable we could even make
>>>> KASLR avoid that reserved pstore region in its search for a viable KASLR
>>>> offset.
>>> I think I was hit by it once in all my testing. For our use case, the
>>> few times it fails to map is not going to affect what we need this for
>>> at all.
>> Once is pretty good. Do you know why? Also once out of how many runs? Is
>> the randomness source not as random as it should be or are the number of
>> bits for KASLR maybe so few on your target architecture that the odds of
>> hitting anything become low? Do these same constraints hold true outside
>> of your testing environment?
> So I just ran it a hundred times in a loop. I added a patch to print
> the location of "_text". The loop was this:
>
>    for i in `seq 100`; do
>          ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  >> text; grub-reboot '1>0'; sleep 0.5; reboot"
>          sleep 25
>    done
>
> It searches dmesg for my added printk as well as the print of were the
> ring buffer was loaded in physical memory.
>
> It takes about 15 seconds to reboot, so I waited 25. The results are
> attached. I found that it was consistent 76 times, which means 1 out of
> 4 it's not. Funny enough, it broke whenever it loaded the kernel below
> 0x100000000. And then it would be off by a little.
>
> It was consistently at:
>
>    0x27d000000
>
> And when it failed, it was at 0x27ce00000.
>
> Note, when I used the e820 tables to do this, I never saw a failure. My
> assumption is that when it is below 0x100000000, something else gets
> allocated causing this to get pushed down.


Thinking about it again: What if you run the allocation super early (see 
arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to 
allocating only from top, you're effectively kernel version independent 
for your allocations because none of the kernel code ran yet and 
definitely KASLR independent because you're running deterministically 
before KASLR even gets allocated.

> As this code relies on memblock_phys_alloc() being consistent, if
> something gets allocated before it differently depending on where the
> kernel is, it can also move the location. A plugin to UEFI would mean
> that it would need to reserve the memory, and the code here will need
> to know where it is. We could always make the function reserve_mem()
> global and weak so that architectures can override it.


Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new 
type of memblock with the respective reservations and you later call 
memblock_find_in_range_node() instead of memblock_phys_alloc() to pass 
in flags that you want to allocate only from the new 
MEMBLOCK_RESERVE_MEM type. The same model would work for BIOS boots 
through the handle_mem_options() path above. In fact, if the BIOS way 
works fine, we don't even need UEFI variables: The same way allocations 
will be identical during BIOS execution, they should stay identical 
across UEFI launches.

As cherry on top, kexec also works seamlessly with the special memblock 
approach because kexec (at least on x86) hands memblocks as is to the 
next kernel. So the new kernel will also automatically use the same 
ranges for its allocations.


Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Steven Rostedt June 17, 2024, 9:18 p.m. UTC | #6
On Mon, 17 Jun 2024 23:01:12 +0200
Alexander Graf <graf@amazon.com> wrote:
> > This could be an added feature, but it is very architecture specific,
> > and would likely need architecture specific updates.  
> 
> 
> It definitely would be an added feature, yes. But one that allows you to 
> ensure persistence a lot more safely :).

Sure.

> 
> Thinking about it again: What if you run the allocation super early (see 
> arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to 
> allocating only from top, you're effectively kernel version independent 
> for your allocations because none of the kernel code ran yet and 
> definitely KASLR independent because you're running deterministically 
> before KASLR even gets allocated.
> 
> > As this code relies on memblock_phys_alloc() being consistent, if
> > something gets allocated before it differently depending on where the
> > kernel is, it can also move the location. A plugin to UEFI would mean
> > that it would need to reserve the memory, and the code here will need
> > to know where it is. We could always make the function reserve_mem()
> > global and weak so that architectures can override it.  
> 
> 
> Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new 
> type of memblock with the respective reservations and you later call 
> memblock_find_in_range_node() instead of memblock_phys_alloc() to pass 
> in flags that you want to allocate only from the new 
> MEMBLOCK_RESERVE_MEM type. The same model would work for BIOS boots 
> through the handle_mem_options() path above. In fact, if the BIOS way 
> works fine, we don't even need UEFI variables: The same way allocations 
> will be identical during BIOS execution, they should stay identical 
> across UEFI launches.
> 
> As cherry on top, kexec also works seamlessly with the special memblock 
> approach because kexec (at least on x86) hands memblocks as is to the 
> next kernel. So the new kernel will also automatically use the same 
> ranges for its allocations.

I'm all for expanding this. But I would just want to get this in for
now as is. It theoretically works on all architectures. If someone
wants to make in more robust and accurate on a specific architecture,
I'm all for it. Like I said, we could make the reserver_mem() function
global and weak, and then if an architecture has a better way to handle
this, it could use that.

Hmm, x86 could do this with the e820 code like I did in my first
versions. Like I said, it didn't fail at all with that.

And we can have an UEFI version as well.

-- Steve
Ard Biesheuvel June 18, 2024, 10:21 a.m. UTC | #7
On Mon, 17 Jun 2024 at 23:01, Alexander Graf <graf@amazon.com> wrote:
>
> On 17.06.24 22:40, Steven Rostedt wrote:
> > On Mon, 17 Jun 2024 09:07:29 +0200
> > Alexander Graf<graf@amazon.com>  wrote:
> >
> >> Hey Steve,
> >>
> >>
> >> I believe we're talking about 2 different things :). Let me rephrase a
> >> bit and make a concrete example.
> >>
> >> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
> >> line option. The kernel now comes up and allocates a random chunk of
> >> memory that - by (admittedly good) chance - may be at the same physical
> >> location as before. Let's assume it deemed 0x1000000 as a good offset.
> > Note, it's not random. It picks from the top of available memory every
> > time. But things can mess with it (see below).
> >
> >> Let's now assume you're running on a UEFI system. There, you always have
> >> non-volatile storage available to you even in the pre-boot phase. That
> >> means the kernel could create a UEFI variable that says "12M:4096:trace
> >> -> 0x1000000". The pre-boot phase takes all these UEFI variables and
> >> marks them as reserved. When you finally reach your command line parsing
> >> logic for reserve_mem=, you can flip all reservations that were not on
> >> the command line back to normal memory.
> >>
> >> That way you have pretty much guaranteed persistent memory regions, even
> >> with KASLR changing your memory layout across boots.
> >>
> >> The nice thing is that the above is an extension of what you've already
> >> built: Systems with UEFI simply get better guarantees that their regions
> >> persist.
> > This could be an added feature, but it is very architecture specific,
> > and would likely need architecture specific updates.
>
>
> It definitely would be an added feature, yes. But one that allows you to
> ensure persistence a lot more safely :).
>
>
> >>>>> Requirement:
> >>>>>
> >>>>> Need a way to reserve memory that will be at a consistent location for
> >>>>> every boot, if the kernel and system are the same. Does not need to work
> >>>>> if rebooting to a different kernel, or if the system can change the
> >>>>> memory layout between boots.
> >>>>>
> >>>>> The reserved memory can not be an hard coded address, as the same kernel /
> >>>>> command line needs to run on several different machines. The picked memory
> >>>>> reservation just needs to be the same for a given machine, but may be
> >>>> With KASLR is enabled, doesn't this approach break too often to be
> >>>> reliable enough for the data you want to extract?
> >>>>
> >>>> Picking up the idea above, with a persistent variable we could even make
> >>>> KASLR avoid that reserved pstore region in its search for a viable KASLR
> >>>> offset.
> >>> I think I was hit by it once in all my testing. For our use case, the
> >>> few times it fails to map is not going to affect what we need this for
> >>> at all.
> >> Once is pretty good. Do you know why? Also once out of how many runs? Is
> >> the randomness source not as random as it should be or are the number of
> >> bits for KASLR maybe so few on your target architecture that the odds of
> >> hitting anything become low? Do these same constraints hold true outside
> >> of your testing environment?
> > So I just ran it a hundred times in a loop. I added a patch to print
> > the location of "_text". The loop was this:
> >
> >    for i in `seq 100`; do
> >          ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  >> text; grub-reboot '1>0'; sleep 0.5; reboot"
> >          sleep 25
> >    done
> >
> > It searches dmesg for my added printk as well as the print of were the
> > ring buffer was loaded in physical memory.
> >
> > It takes about 15 seconds to reboot, so I waited 25. The results are
> > attached. I found that it was consistent 76 times, which means 1 out of
> > 4 it's not. Funny enough, it broke whenever it loaded the kernel below
> > 0x100000000. And then it would be off by a little.
> >
> > It was consistently at:
> >
> >    0x27d000000
> >
> > And when it failed, it was at 0x27ce00000.
> >
> > Note, when I used the e820 tables to do this, I never saw a failure. My
> > assumption is that when it is below 0x100000000, something else gets
> > allocated causing this to get pushed down.
>
>
> Thinking about it again: What if you run the allocation super early (see
> arch/x86/boot/compressed/kaslr.c:handle_mem_options())?

That code is not used by EFI boot anymore.

In general, I would recommend (and have recommended) against these
kinds of hacks in mainline, because -especially on x86- there is
always someone that turns up with some kind of convoluted use case
that gets broken if we try to change anything in the boot code.

I spent considerable time over the past year making the EFI/x86 boot
code compatible with the new MS imposed requirements on PC boot
firmware (related to secure boot and NX restrictions on memory
mappings). This involved some radical refactoring of the boot
sequence, including the KASLR logic. Adding fragile code there that
will result in regressions observable to end users when it gets broken
is really not what I'd like to see.

So I would personally prefer for this code not to go in at all. But if
it does go in (and Steven has already agreed to this), it needs a
giant disclaimer that it is best effort and may get broken
inadvertently by changes that may seem unrelated.

> If you stick to
> allocating only from top, you're effectively kernel version independent
> for your allocations because none of the kernel code ran yet and
> definitely KASLR independent because you're running deterministically
> before KASLR even gets allocated.
>

Allocating top down under EFI is almost guaranteed to result in
problems, because that is how the EFI page allocator works as well.
This means that allocations will move around depending on, e.g.,
whether some USB stick was inserted on the first boot and removed on
the second, or whether your external display was on or off.

> > As this code relies on memblock_phys_alloc() being consistent, if
> > something gets allocated before it differently depending on where the
> > kernel is, it can also move the location. A plugin to UEFI would mean
> > that it would need to reserve the memory, and the code here will need
> > to know where it is. We could always make the function reserve_mem()
> > global and weak so that architectures can override it.
>
>
> Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new
> type of memblock with the respective reservations and you later call
> memblock_find_in_range_node() instead of memblock_phys_alloc() to pass
> in flags that you want to allocate only from the new
> MEMBLOCK_RESERVE_MEM type.

This would require the introduction of a new kind of memory map for
the handover between EFI stub and core kernel. Currently, x86 relies
on E820 and other architectures pass the EFI memory map unmodified.
Alexander Graf June 18, 2024, 11:47 a.m. UTC | #8
On 18.06.24 12:21, Ard Biesheuvel wrote:
> On Mon, 17 Jun 2024 at 23:01, Alexander Graf <graf@amazon.com> wrote:
>> On 17.06.24 22:40, Steven Rostedt wrote:
>>> On Mon, 17 Jun 2024 09:07:29 +0200
>>> Alexander Graf<graf@amazon.com>  wrote:
>>>
>>>> Hey Steve,
>>>>
>>>>
>>>> I believe we're talking about 2 different things :). Let me rephrase a
>>>> bit and make a concrete example.
>>>>
>>>> Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
>>>> line option. The kernel now comes up and allocates a random chunk of
>>>> memory that - by (admittedly good) chance - may be at the same physical
>>>> location as before. Let's assume it deemed 0x1000000 as a good offset.
>>> Note, it's not random. It picks from the top of available memory every
>>> time. But things can mess with it (see below).
>>>
>>>> Let's now assume you're running on a UEFI system. There, you always have
>>>> non-volatile storage available to you even in the pre-boot phase. That
>>>> means the kernel could create a UEFI variable that says "12M:4096:trace
>>>> -> 0x1000000". The pre-boot phase takes all these UEFI variables and
>>>> marks them as reserved. When you finally reach your command line parsing
>>>> logic for reserve_mem=, you can flip all reservations that were not on
>>>> the command line back to normal memory.
>>>>
>>>> That way you have pretty much guaranteed persistent memory regions, even
>>>> with KASLR changing your memory layout across boots.
>>>>
>>>> The nice thing is that the above is an extension of what you've already
>>>> built: Systems with UEFI simply get better guarantees that their regions
>>>> persist.
>>> This could be an added feature, but it is very architecture specific,
>>> and would likely need architecture specific updates.
>>
>> It definitely would be an added feature, yes. But one that allows you to
>> ensure persistence a lot more safely :).
>>
>>
>>>>>>> Requirement:
>>>>>>>
>>>>>>> Need a way to reserve memory that will be at a consistent location for
>>>>>>> every boot, if the kernel and system are the same. Does not need to work
>>>>>>> if rebooting to a different kernel, or if the system can change the
>>>>>>> memory layout between boots.
>>>>>>>
>>>>>>> The reserved memory can not be an hard coded address, as the same kernel /
>>>>>>> command line needs to run on several different machines. The picked memory
>>>>>>> reservation just needs to be the same for a given machine, but may be
>>>>>> With KASLR is enabled, doesn't this approach break too often to be
>>>>>> reliable enough for the data you want to extract?
>>>>>>
>>>>>> Picking up the idea above, with a persistent variable we could even make
>>>>>> KASLR avoid that reserved pstore region in its search for a viable KASLR
>>>>>> offset.
>>>>> I think I was hit by it once in all my testing. For our use case, the
>>>>> few times it fails to map is not going to affect what we need this for
>>>>> at all.
>>>> Once is pretty good. Do you know why? Also once out of how many runs? Is
>>>> the randomness source not as random as it should be or are the number of
>>>> bits for KASLR maybe so few on your target architecture that the odds of
>>>> hitting anything become low? Do these same constraints hold true outside
>>>> of your testing environment?
>>> So I just ran it a hundred times in a loop. I added a patch to print
>>> the location of "_text". The loop was this:
>>>
>>>     for i in `seq 100`; do
>>>           ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  >> text; grub-reboot '1>0'; sleep 0.5; reboot"
>>>           sleep 25
>>>     done
>>>
>>> It searches dmesg for my added printk as well as the print of were the
>>> ring buffer was loaded in physical memory.
>>>
>>> It takes about 15 seconds to reboot, so I waited 25. The results are
>>> attached. I found that it was consistent 76 times, which means 1 out of
>>> 4 it's not. Funny enough, it broke whenever it loaded the kernel below
>>> 0x100000000. And then it would be off by a little.
>>>
>>> It was consistently at:
>>>
>>>     0x27d000000
>>>
>>> And when it failed, it was at 0x27ce00000.
>>>
>>> Note, when I used the e820 tables to do this, I never saw a failure. My
>>> assumption is that when it is below 0x100000000, something else gets
>>> allocated causing this to get pushed down.
>>
>> Thinking about it again: What if you run the allocation super early (see
>> arch/x86/boot/compressed/kaslr.c:handle_mem_options())?
> That code is not used by EFI boot anymore.
>
> In general, I would recommend (and have recommended) against these
> kinds of hacks in mainline, because -especially on x86- there is
> always someone that turns up with some kind of convoluted use case
> that gets broken if we try to change anything in the boot code.
>
> I spent considerable time over the past year making the EFI/x86 boot
> code compatible with the new MS imposed requirements on PC boot
> firmware (related to secure boot and NX restrictions on memory
> mappings). This involved some radical refactoring of the boot
> sequence, including the KASLR logic. Adding fragile code there that
> will result in regressions observable to end users when it gets broken
> is really not what I'd like to see.
>
> So I would personally prefer for this code not to go in at all. But if
> it does go in (and Steven has already agreed to this), it needs a
> giant disclaimer that it is best effort and may get broken
> inadvertently by changes that may seem unrelated.


Alright, happy to rest my case about making it more reliable for now 
then :).

IMHO the big fat disclaimer should be in the argument name. 
"reserve_mem" to me sounds like it actually guarantees a reservation - 
which it doesn't. Can we name it more along the lines of "debug" (to 
indicate it's not for production data) or "phoenix" (usually gets reborn 
out of ashes, but you can never know for sure): "debug_mem", / 
"phoenix_mem"?


>
>> If you stick to
>> allocating only from top, you're effectively kernel version independent
>> for your allocations because none of the kernel code ran yet and
>> definitely KASLR independent because you're running deterministically
>> before KASLR even gets allocated.
>>
> Allocating top down under EFI is almost guaranteed to result in
> problems, because that is how the EFI page allocator works as well.
> This means that allocations will move around depending on, e.g.,
> whether some USB stick was inserted on the first boot and removed on
> the second, or whether your external display was on or off.


I believe most UEFI implementations only allocate top down in the lower 
32bits. But yes, it's fragile, I hear you. Let's embrace the flaky 
nature of the beast then :).


Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597
Steven Rostedt June 18, 2024, 3:43 p.m. UTC | #9
On Tue, 18 Jun 2024 13:47:49 +0200
Alexander Graf <graf@amazon.com> wrote:

> IMHO the big fat disclaimer should be in the argument name. 
> "reserve_mem" to me sounds like it actually guarantees a reservation - 
> which it doesn't. Can we name it more along the lines of "debug" (to 
> indicate it's not for production data) or "phoenix" (usually gets reborn 
> out of ashes, but you can never know for sure): "debug_mem", / 
> "phoenix_mem"?

I don't see any reason it will not reserve memory. That doesn't seem to
be the issue.  What is not guaranteed is that it will be in the same
location as last time. So the name is correct. It's not
"reserve_consistent_memory" ;-)

-- Steve
Mike Rapoport June 19, 2024, 8:02 a.m. UTC | #10
On Tue, Jun 18, 2024 at 12:21:19PM +0200, Ard Biesheuvel wrote:
> On Mon, 17 Jun 2024 at 23:01, Alexander Graf <graf@amazon.com> wrote:
>
> So I would personally prefer for this code not to go in at all. But if
> it does go in (and Steven has already agreed to this), it needs a
> giant disclaimer that it is best effort and may get broken
> inadvertently by changes that may seem unrelated.

I think that reserve_mem= is way less intrusive than memmap= we anyway
have.
It will reserve memory and the documentation adequately warns that the
location of that memory might be at different physical addresses.
Mike Rapoport June 19, 2024, 9:53 p.m. UTC | #11
From: Mike Rapoport (IBM) <rppt@kernel.org>

On Thu, 13 Jun 2024 11:55:06 -0400, Steven Rostedt wrote:
> Reserve unspecified location of physical memory from kernel command line
> 
> Background:
> 
> In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
> dmesg output and some other information when a crash happens in the field.
> (This is only done when the user selects "Allow Google to collect data for
>  improving the system"). But there are cases when there's a bug that
> requires more data to be retrieved to figure out what is happening. We would
> like to increase the pstore size, either temporarily, or maybe even
> permanently. The pstore on these devices are at a fixed location in RAM (as
> the RAM is not cleared on soft reboots nor crashes). The location is chosen
> by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
> There's a driver that queries for this to initialize the pstore for
> ChromeOS:
> 
> [...]

Applied to for-next branch of memblock.git tree, thanks!

[0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
      commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332
[1/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up
      commit: 1e4c64b71c9bf230b25fde12cbcceacfdc8b3332
[2/2] pstore/ramoops: Add ramoops.mem_name= command line option
      commit: d9d814eebb1ae9742e7fd7f39730653b16326bd4

tree: https://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
branch: for-next

--
Sincerely yours,
Mike.
diff mbox

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ce7de8136f2f..56e18b1a520d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5717,9 +5717,11 @@ 
 			used for systems that do not wipe the RAM, and this command
 			line will try to reserve the same physical memory on
 			soft reboots. Note, it is not guaranteed to be the same
-			location. For example, if KASLR places the kernel at the
-			location of where the RAM reservation was from a previous
-			boot, the new reservation will be at a different location.
+			location. For example, if anything about the system changes
+			or if booting a different kernel. It can also fail if KASLR
+			places the kernel at the location of where the RAM reservation
+			was from a previous boot, the new reservation will be at a
+			different location.
 			Any subsystem using this feature must add a way to verify
 			that the contents of the physical memory is from a previous
 			boot, as there may be cases where the memory will not be
diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst
index e9f85142182d..6f534a707b2a 100644
--- a/Documentation/admin-guide/ramoops.rst
+++ b/Documentation/admin-guide/ramoops.rst
@@ -23,6 +23,8 @@  and type of the memory area are set using three variables:
   * ``mem_size`` for the size. The memory size will be rounded down to a
     power of two.
   * ``mem_type`` to specify if the memory type (default is pgprot_writecombine).
+  * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command
+    line parameter.
 
 Typically the default value of ``mem_type=0`` should be used as that sets the pstore
 mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
@@ -118,6 +120,17 @@  Setting the ramoops parameters can be done in several different manners:
 	return ret;
   }
 
+ D. Using a region of memory reserved via ``reserve_mem`` command line
+    parameter. The address and size will be defined by the ``reserve_mem``
+    parameter. Note, that ``reserve_mem`` may not always allocate memory
+    in the same location, and cannot be relied upon. Testing will need
+    to be done, and it may not work on every machine, nor every kernel.
+    Consider this a "best effort" approach. The ``reserve_mem`` option
+    takes a size, alignment and name as arguments. The name is used
+    to map the memory to a label that can be retrieved by ramoops.
+
+	reserver_mem=2M:4096:oops  ramoops.mem_name=oops
+
 You can specify either RAM memory or peripheral devices' memory. However, when
 specifying RAM, be sure to reserve the memory by issuing memblock_reserve()
 very early in the architecture code, e.g.::
diff --git a/mm/memblock.c b/mm/memblock.c
index 739d106a9165..b7b0e8c3868d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -2301,7 +2301,7 @@  EXPORT_SYMBOL_GPL(reserve_mem_find_by_name);
  */
 static int __init reserve_mem(char *p)
 {
-	phys_addr_t start, size, align;
+	phys_addr_t start, size, align, tmp;
 	char *name;
 	char *oldp;
 	int len;
@@ -2347,8 +2347,8 @@  static int __init reserve_mem(char *p)
 	if (!*p)
 		return -EINVAL;
 
-	/* Make sure the name is not already used (size is only updated if found) */
-	if (reserve_mem_find_by_name(name, &start, &size))
+	/* Make sure the name is not already used */
+	if (reserve_mem_find_by_name(name, &start, &tmp))
 		return -EBUSY;
 
 	start = memblock_phys_alloc(size, align);