mbox series

[00/15] kexec: Allow preservation of ftrace buffers

Message ID 20231213000452.88295-1-graf@amazon.com (mailing list archive)
Headers show
Series kexec: Allow preservation of ftrace buffers | expand

Message

Alexander Graf Dec. 13, 2023, 12:04 a.m. UTC
Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See James' and my Linux Plumbers
Conference 2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this
patch implements basic infrastructure to allow hand over of kernel state
across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
With this patch set applied, you can read ftrace records from the
pre-kexec environment in your post-kexec one. This creates a very powerful
debugging and performance analysis tool for kexec. It's also slightly
easier to reason about than full blown VFIO state preservation.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
                pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
                  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
                specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.

KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.

== Documentation ==

If people are happy with the approach in this patch set, I will write up
conclusive documentation including schemas for the metadata as part of its
next iteration. For now, here's a rudimentary overview:

We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch region" available
for kexec: A physically contiguous memory region that is guaranteed to
not have any memory that KHO would preserve.  The new kernel bootstraps
itself using the scratch region and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.

== Limitations ==

I currently only implemented file based kexec. The kernel interfaces
in the patch set are already in place to support user space kexec as well,
but I have not implemented it yet.

== How to Use ==

To use the code, please boot the kernel with the "kho_scratch=" command
line parameter set: "kho_scratch=512M". KHO requires a scratch region.

Make sure to fill ftrace with contents that you want to observe after
kexec.  Then, before you invoke file based "kexec -l", activate KHO:

  # echo 1 > /sys/kernel/kho/active
  # kexec -l Image --initrd=initrd -s
  # kexec -e

The new kernel will boot up and contain the previous kernel's trace
buffers in /sys/kernel/debug/tracing/trace.



Alex

[1] https://lore.kernel.org/all/169645773092.11424.7258549771090599226.stgit@skinsburskii./
[2] https://lore.kernel.org/all/20231016233215.13090-1-madvenka@linux.microsoft.com/
[3] https://lpc.events/event/17/contributions/1485/attachments/1296/2650/jgowans-preserving-across-kexec.pdf
[4] https://lore.kernel.org/kexec/1682554137-13938-1-git-send-email-anthony.yznaga@oracle.com/


Alexander Graf (15):
  mm,memblock: Add support for scratch memory
  memblock: Declare scratch memory as CMA
  kexec: Add Kexec HandOver (KHO) generation helpers
  kexec: Add KHO parsing support
  kexec: Add KHO support to kexec file loads
  arm64: Add KHO support
  x86: Add KHO support
  tracing: Introduce names for ring buffers
  tracing: Introduce names for events
  tracing: Introduce kho serialization
  tracing: Add kho serialization of trace buffers
  tracing: Recover trace buffers from kexec handover
  tracing: Add kho serialization of trace events
  tracing: Recover trace events from kexec handover
  tracing: Add config option for kexec handover

 Documentation/ABI/testing/sysfs-firmware-kho  |   9 +
 Documentation/ABI/testing/sysfs-kernel-kho    |  53 ++
 .../admin-guide/kernel-parameters.txt         |  10 +
 MAINTAINERS                                   |   2 +
 arch/arm64/Kconfig                            |  12 +
 arch/arm64/kernel/setup.c                     |   2 +
 arch/arm64/mm/init.c                          |   8 +
 arch/x86/Kconfig                              |  12 +
 arch/x86/boot/compressed/kaslr.c              |  55 ++
 arch/x86/include/uapi/asm/bootparam.h         |  15 +-
 arch/x86/kernel/e820.c                        |   9 +
 arch/x86/kernel/kexec-bzimage64.c             |  39 ++
 arch/x86/kernel/setup.c                       |  46 ++
 arch/x86/mm/init_32.c                         |   7 +
 arch/x86/mm/init_64.c                         |   7 +
 drivers/of/fdt.c                              |  41 ++
 drivers/of/kexec.c                            |  36 ++
 include/linux/kexec.h                         |  56 ++
 include/linux/memblock.h                      |  19 +
 include/linux/ring_buffer.h                   |   9 +-
 include/linux/trace_events.h                  |   1 +
 include/trace/trace_events.h                  |   2 +
 include/uapi/linux/kexec.h                    |   6 +
 kernel/Makefile                               |   2 +
 kernel/kexec_file.c                           |  41 ++
 kernel/kexec_kho_in.c                         | 298 ++++++++++
 kernel/kexec_kho_out.c                        | 526 ++++++++++++++++++
 kernel/trace/Kconfig                          |  13 +
 kernel/trace/blktrace.c                       |   1 +
 kernel/trace/ring_buffer.c                    | 267 ++++++++-
 kernel/trace/trace.c                          |  76 ++-
 kernel/trace/trace_branch.c                   |   1 +
 kernel/trace/trace_events.c                   |   3 +
 kernel/trace/trace_functions_graph.c          |   4 +-
 kernel/trace/trace_output.c                   | 106 +++-
 kernel/trace/trace_output.h                   |   1 +
 kernel/trace/trace_probe.c                    |   3 +
 kernel/trace/trace_syscalls.c                 |  29 +
 mm/Kconfig                                    |   4 +
 mm/memblock.c                                 |  83 ++-
 40 files changed, 1901 insertions(+), 13 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
 create mode 100644 kernel/kexec_kho_in.c
 create mode 100644 kernel/kexec_kho_out.c

Comments

Eric W. Biederman Dec. 14, 2023, 2:58 p.m. UTC | #1
Alexander Graf <graf@amazon.com> writes:

> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
>
>   https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
>   * Memory Pools [1] - preallocated persistent memory region + allocator
>   * PRMEM [2] - resizable persistent memory regions with fixed metadata
>                 pointer on the kernel command line + allocator
>   * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>                   address location on the kernel command line
>   * PKRAM [4] - handover of user space pages using a fixed metadata page
>                 specified via command line
>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.

What you are describe in many ways is the same problem as
kexec-on-panic.  The goal of leaving devices running absolutely requires
carving out memory for the new kernel to live in while it is coming up
so that DMA from a device that was not shutdown down does not stomp the
kernel coming up.

If I understand the virtualization case some of those virtual machines
are going to have virtual NICs that are going to want to DMA memory to
the host system.  Which if I understand things correctly means that
among the devices you explicitly want to keep running there is a not
a way to avoid the chance of DMA coming in while the kernel is being
changed.

There is also a huge maintenance challenge associated with all of this.

If you go with something that is essentially kexec-on-panic and then
add a little bit to help find things in the memory of the previous
kernel while the new kernel is coming up I can see it as a possibility.

As an example I think preserving ftrace data of kexec seems bizarre.
I don't see how that is an interesting use case at all.  Not in
the situation of preserving virtual machines, and not in the situation
of kexec on panic.

If you are doing an orderly shutdown and kernel switch you should be
able to manually change the memory.  If you are not doing an orderly
shutdown then I really don't get it.

I don't hate the capability you are trying to build.

I have not read or looked at most of this so I am probably
missing subtle details.

As you are currently describing things I have the sense you have
completely misframed the problem and are trying to solve the wrong parts
of the problem.

Eric
Alexander Graf Dec. 14, 2023, 4:02 p.m. UTC | #2
Hey Eric,

On 14.12.23 15:58, Eric W. Biederman wrote:
> Alexander Graf <graf@amazon.com> writes:
>
>> Kexec today considers itself purely a boot loader: When we enter the new
>> kernel, any state the previous kernel left behind is irrelevant and the
>> new kernel reinitializes the system.
>>
>> However, there are use cases where this mode of operation is not what we
>> actually want. In virtualization hosts for example, we want to use kexec
>> to update the host kernel while virtual machine memory stays untouched.
>> When we add device assignment to the mix, we also need to ensure that
>> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
>> need to do the same for the PCI subsystem. If we want to kexec while an
>> SEV-SNP enabled virtual machine is running, we need to preserve the VM
>> context pages and physical memory. See James' and my Linux Plumbers
>> Conference 2023 presentation for details:
>>
>>    https://lpc.events/event/17/contributions/1485/
>>
>> To start us on the journey to support all the use cases above, this
>> patch implements basic infrastructure to allow hand over of kernel state
>> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
>> With this patch set applied, you can read ftrace records from the
>> pre-kexec environment in your post-kexec one. This creates a very powerful
>> debugging and performance analysis tool for kexec. It's also slightly
>> easier to reason about than full blown VFIO state preservation.
>>
>> == Alternatives ==
>>
>> There are alternative approaches to (parts of) the problems above:
>>
>>    * Memory Pools [1] - preallocated persistent memory region + allocator
>>    * PRMEM [2] - resizable persistent memory regions with fixed metadata
>>                  pointer on the kernel command line + allocator
>>    * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>>                    address location on the kernel command line
>>    * PKRAM [4] - handover of user space pages using a fixed metadata page
>>                  specified via command line
>>
>> All of the approaches above fundamentally have the same problem: They
>> require the administrator to explicitly carve out a physical memory
>> location because they have no mechanism outside of the kernel command
>> line to pass data (including memory reservations) between kexec'ing
>> kernels.
>>
>> KHO provides that base foundation. We will determine later whether we
>> still need any of the approaches above for fast bulk memory handover of for
>> example IOMMU page tables. But IMHO they would all be users of KHO, with
>> KHO providing the foundational primitive to pass metadata and bulk memory
>> reservations as well as provide easy versioning for data.
> What you are describe in many ways is the same problem as
> kexec-on-panic.  The goal of leaving devices running absolutely requires
> carving out memory for the new kernel to live in while it is coming up
> so that DMA from a device that was not shutdown down does not stomp the
> kernel coming up.


Yes, part of the problem is similar: We need a safe space to boot from 
that doesn't overwrite existing data. What happens after is different: 
With panics, you're trying to rescue previous state for post-mortem 
analysis. You may even have intrinsic knowledge of the environment you 
came from, so you can optimize that rescuing. Nobody wants to continue 
running the system as if nothing happened after a panic.

With KHO, the kernels establish an ABI between each other to communicate 
any state that needs to get preserved and the rest gets reinitialized. 
After KHO, the new kernel continues executing workloads that were 
running before.

The ABI is important because the next environment may not have a chance 
to know about the previous environment's setup. Think for example of 
roll-out and roll-back scenarios: If I roll back into my previous 
environment because I determined something didn't work as expected after 
update, I'm moving the system into an environment that was built when 
the kexec source environment didn't even exist yet.


> If I understand the virtualization case some of those virtual machines
> are going to have virtual NICs that are going to want to DMA memory to
> the host system.  Which if I understand things correctly means that


No, to the *guest* system. This is about device assignment: The guest is 
in full control of the NICs that do DMA, so we have no chance to quiesce 
them.


> among the devices you explicitly want to keep running there is a not
> a way to avoid the chance of DMA coming in while the kernel is being
> changed.


Correct, because the host doesn't own the driver :).


> There is also a huge maintenance challenge associated with all of this.
>
> If you go with something that is essentially kexec-on-panic and then
> add a little bit to help find things in the memory of the previous
> kernel while the new kernel is coming up I can see it as a possibility.


That's roughly what the patch set is doing, yes. It avoids a static 
allocation ahead of time for next-kernel memory, because I only know the 
size of all components when we're actually doing the kexec. But the 
principle is similar.

The bit where the new kernel finds bits in the old memory is the KHO DT: 
A flattened device tree structure the old kernel passes to the new 
kernel. That contains all memory locations as well as additional 
metadata to "help find things" in a way that doesn't immediately break 
on every kernel change.


> As an example I think preserving ftrace data of kexec seems bizarre.
> I don't see how that is an interesting use case at all.  Not in
> the situation of preserving virtual machines, and not in the situation
> of kexec on panic.


It's super useful as self debugging aid: I already used it to profile 
the kexec path to find a few performance issues :). It's also really 
helpful - even without device assignment support yet - when you use it 
in combination with KVM trace points: You have a VM running backed by a 
DAX pmem device, then serialize its virtual device state, kexec, restore 
from the virtual device state, then the VM misbehaves.

With ftrace handover in place, you get a full trace of the flow which 
simplifies debugging of issues that happen during/because of the 
serialization/deserialization flow of KVM state.

But the main reason I chose ftrace to start with is that all other use 
cases require another concept: fd preservation. All the typical 
"objects" you want to preserve across kexec are anonymous file 
descriptors. So we need to also build a way in Linux that allows user 
space to request the kernel to preserve an fd using the kexec handover 
framework in this patch set. But that is another big discussion I wanted 
to keep separate: Ftrace is from kernel, to kernel and hence "easy".


> If you are doing an orderly shutdown and kernel switch you should be
> able to manually change the memory.  If you are not doing an orderly
> shutdown then I really don't get it.


I don't follow the paragraph above?


> I don't hate the capability you are trying to build.
>
> I have not read or looked at most of this so I am probably
> missing subtle details.
>
> As you are currently describing things I have the sense you have
> completely misframed the problem and are trying to solve the wrong parts
> of the problem.


Very well possible :). I hope the above clarifies it a bit. If not, 
please let me know where exactly it's unclear so I can elaborate.

If you have a few minutes, it would also be great if you could have a 
look at our slides [1] or even video [2] from LPC 2023 which go into 
detail of the end problem. Beware that I'm consciously *not* trying to 
solve the end problem yet: I want to take baby steps towards it. Nobody 
wants to review an 80 patches patch set where everything depends on 
everything else.


Alex


[1] 
https://lpc.events/event/17/contributions/1485/attachments/1296/2650/jgowans-preserving-across-kexec.pdf
[2] https://www.youtube.com/watch?v=cYrlV4bK1Y4




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879