mbox series

[v3,00/17] kexec: Allow preservation of ftrace buffers

Message ID 20240117144704.602-1-graf@amazon.com (mailing list archive)
Headers show
Series kexec: Allow preservation of ftrace buffers | expand

Message

Alexander Graf Jan. 17, 2024, 2:46 p.m. UTC
Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See James' and my Linux Plumbers
Conference 2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this
patch implements basic infrastructure to allow hand over of kernel state
across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
With this patch set applied, you can read ftrace records from the
pre-kexec environment in your post-kexec one. This creates a very powerful
debugging and performance analysis tool for kexec. It's also slightly
easier to reason about than full blown VFIO state preservation.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
                pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
                  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
                specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.

KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch region" available
for kexec: A physically contiguous memory region that is guaranteed to
not have any memory that KHO would preserve.  The new kernel bootstraps
itself using the scratch region and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.

== Limitations ==

I currently only implemented file based kexec. The kernel interfaces
in the patch set are already in place to support user space kexec as well,
but I have not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho_scratch=" command
line parameter set: "kho_scratch=512M". KHO requires a scratch region.

Make sure to fill ftrace with contents that you want to observe after
kexec.  Then, before you invoke file based "kexec -l", activate KHO:

  # echo 1 > /sys/kernel/kho/active
  # kexec -l Image --initrd=initrd -s
  # kexec -e

The new kernel will boot up and contain the previous kernel's trace
buffers in /sys/kernel/debug/tracing/trace.

== Changelog ==

v1 -> v2:
  - Removed: tracing: Introduce names for ring buffers
  - Removed: tracing: Introduce names for events
  - New: kexec: Add config option for KHO
  - New: kexec: Add documentation for KHO
  - New: tracing: Initialize fields before registering
  - New: devicetree: Add bindings for ftrace KHO
  - test bot warning fixes
  - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - s/kho_reserve/kho_reserve_scratch/g
  - Remove / reduce ifdefs
  - Select crc32
  - Leave anything that requires a name in trace.c to keep buffers
    unnamed entities
  - Put events as array into a property, use fingerprint instead of
    names to identify them
  - Reduce footprint without CONFIG_FTRACE_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - make kho_get_fdt() const
  - Add stubs for return_mem and claim_mem
  - make kho_get_fdt() const
  - Get events as array from a property, use fingerprint instead of
    names to identify events
  - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - s/kho_reserve/kho_reserve_scratch/g
  - Leave the node generation code that needs to know the name in
    trace.c so that ring buffers can stay anonymous
  - s/kho_reserve/kho_reserve_scratch/g
  - Move kho enums out of ifdef
  - Move from names to fdt offsets. That way, trace.c can find the trace
    array offset and then the ring buffer code only needs to read out
    its per-CPU data. That way it can stay oblivient to its name.
  - Make kho_get_fdt() const

v2 -> v3:

  - Fix make dt_binding_check
  - Add descriptions for each object
  - s/trace_flags/trace-flags/
  - s/global_trace/global-trace/
  - Make all additionalProperties false
  - Change subject to reflect subsysten (dt-bindings)
  - Fix indentation
  - Remove superfluous examples
  - Convert to 64bit syntax
  - Move to kho directory
  - s/"global_trace"/"global-trace"/
  - s/"global_trace"/"global-trace"/
  - s/"trace_flags"/"trace-flags"/
  - Fix wording
  - Add Documentation to MAINTAINERS file
  - Remove kho reference on read error
  - Move handover_dt unmap up
  - s/reserve_scratch_mem/mark_phys_as_cma/
  - Remove ifdeffery
  - Remove superfluous comment

Alexander Graf (17):
  mm,memblock: Add support for scratch memory
  memblock: Declare scratch memory as CMA
  kexec: Add Kexec HandOver (KHO) generation helpers
  kexec: Add KHO parsing support
  kexec: Add KHO support to kexec file loads
  kexec: Add config option for KHO
  kexec: Add documentation for KHO
  arm64: Add KHO support
  x86: Add KHO support
  tracing: Initialize fields before registering
  tracing: Introduce kho serialization
  tracing: Add kho serialization of trace buffers
  tracing: Recover trace buffers from kexec handover
  tracing: Add kho serialization of trace events
  tracing: Recover trace events from kexec handover
  tracing: Add config option for kexec handover
  Documentation: KHO: Add ftrace bindings

 Documentation/ABI/testing/sysfs-firmware-kho  |   9 +
 Documentation/ABI/testing/sysfs-kernel-kho    |  53 ++
 .../admin-guide/kernel-parameters.txt         |  10 +
 .../kho/bindings/ftrace/ftrace-array.yaml     |  38 ++
 .../kho/bindings/ftrace/ftrace-cpu.yaml       |  43 ++
 Documentation/kho/bindings/ftrace/ftrace.yaml |  62 +++
 Documentation/kho/concepts.rst                |  88 +++
 Documentation/kho/index.rst                   |  19 +
 Documentation/kho/usage.rst                   |  57 ++
 Documentation/subsystem-apis.rst              |   1 +
 MAINTAINERS                                   |   3 +
 arch/arm64/Kconfig                            |   3 +
 arch/arm64/kernel/setup.c                     |   2 +
 arch/arm64/mm/init.c                          |   8 +
 arch/x86/Kconfig                              |   3 +
 arch/x86/boot/compressed/kaslr.c              |  55 ++
 arch/x86/include/uapi/asm/bootparam.h         |  15 +-
 arch/x86/kernel/e820.c                        |   9 +
 arch/x86/kernel/kexec-bzimage64.c             |  39 ++
 arch/x86/kernel/setup.c                       |  46 ++
 arch/x86/mm/init_32.c                         |   7 +
 arch/x86/mm/init_64.c                         |   7 +
 drivers/of/fdt.c                              |  39 ++
 drivers/of/kexec.c                            |  54 ++
 include/linux/kexec.h                         |  58 ++
 include/linux/memblock.h                      |  19 +
 include/linux/ring_buffer.h                   |  17 +-
 include/linux/trace_events.h                  |   1 +
 include/uapi/linux/kexec.h                    |   6 +
 kernel/Kconfig.kexec                          |  13 +
 kernel/Makefile                               |   2 +
 kernel/kexec_file.c                           |  41 ++
 kernel/kexec_kho_in.c                         | 298 ++++++++++
 kernel/kexec_kho_out.c                        | 526 ++++++++++++++++++
 kernel/trace/Kconfig                          |  14 +
 kernel/trace/ring_buffer.c                    | 243 +++++++-
 kernel/trace/trace.c                          |  96 +++-
 kernel/trace/trace_events.c                   |  14 +-
 kernel/trace/trace_events_synth.c             |  14 +-
 kernel/trace/trace_events_user.c              |   4 +
 kernel/trace/trace_output.c                   | 247 +++++++-
 kernel/trace/trace_output.h                   |   5 +
 kernel/trace/trace_probe.c                    |   4 +
 mm/Kconfig                                    |   4 +
 mm/memblock.c                                 |  79 ++-
 45 files changed, 2351 insertions(+), 24 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
 create mode 100644 Documentation/kho/bindings/ftrace/ftrace-array.yaml
 create mode 100644 Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
 create mode 100644 Documentation/kho/bindings/ftrace/ftrace.yaml
 create mode 100644 Documentation/kho/concepts.rst
 create mode 100644 Documentation/kho/index.rst
 create mode 100644 Documentation/kho/usage.rst
 create mode 100644 kernel/kexec_kho_in.c
 create mode 100644 kernel/kexec_kho_out.c

Comments

Philipp Rudo Jan. 29, 2024, 4:34 p.m. UTC | #1
Hi Alex,

adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
get KHO to work).

Fist of all I believe that having a generic framework to pass information from
one kernel to the other across kexec would be a good thing. But I'm afraid that
you are ignoring some fundamental problems which makes it extremely hard, if
not impossible, to reliably transfer the kernel's state from one kernel to the
other.

One thing I don't understand is how reusing the scratch area is working. Sure
you pass it's location via the dt/boot_params but I don't see any code that
makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
runs inside that area and I don't believe the CMA design ever considered that
the kernel image could be included in a CMA area.

Staying at reusing the scratch area. One thing that is broken for sure is that
you reuse the scratch area without ever checking the kho_scratch parameter of
the 2nd kernel's command line. Remember, with kexec you are dealing with two
different kernels with two different command lines. Meaning you can only reuse
the scratch area if the requested size in the 2nd kernel is identical to the
one of the 1st kernel. In all other cases you need to adjust the scratch area's
size or reserve a new one.

This directly leads to the next problem. In kho_reserve_previous_mem you are
reusing the different memory regions wherever the 1st kernel allocated them.
But that also means you are handing over the 1st kernel's memory
fragmentation to the 2nd kernel and you do that extremely early during boot.
Which means that users who need to allocate large continuous physical memory,
like the scratch area or the crashkernel memory, will have increasing chance to
not find a suitable area. Which IMHO is unacceptable.

Finally, and that's the big elephant in the room, is your lax handling of the
unstable kernel internal ABI. Remember, you are dealing with two different
kernels, that also means two different source levels and two different configs.
So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
doesn't means that they have the same struct buffer_page. But that's what your
code implicitly assumes. For KHO ever to make it upstream you need to make sure
that both kernels are "speaking the same language".

Personally I see two possible solutions:

1) You introduce a stable intermediate format for every subsystem similar to
what IMA_KEXEC does. This should work for simple types like struct buffer_page
but for complex ones like struct vfio_device that's basically impossible.

2) You also hand over the ABI version for every given type (basically just a
hash over all fields including all the dependencies). So the 2nd kernel can
verify that the data handed over is in a format it can handle and if not bail
out with a descriptive error message rather than reading garbage. Plus side is
that once such a system is in place you can reuse it to automatically resolve
all dependencies so you no longer need to manually store the buffer_page and
its buffer_data_page separately.
Down side is that traversing the debuginfo (including the ones from modules) is
not a simple task and I expect that such a system will be way more complex than
the rest of KHO. In addition there are some cases that the versioning won't be
able to capture. For example if a type contains a "void *"-field. Then although
the definition of the type is identical in both kernels the field can be cast
to different types when used. An other problem will be function pointers which
you first need to resolve in the 1st kernel and then map to the identical
function in the 2nd kernel. This will become particularly "fun" when the
function is part of a module that isn't loaded at the time when you try to
recreate the kernel's state.

So to summarize, while it would be nice to have a generic framework like KHO to
pass data from one kernel to the other via kexec there are good reasons why it
doesn't exist, yet.

Thanks
Philipp


On Wed, 17 Jan 2024 14:46:47 +0000
Alexander Graf <graf@amazon.com> wrote:

> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
> 
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
> 
>   https://lpc.events/event/17/contributions/1485/
> 
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
> 
> == Alternatives ==
> 
> There are alternative approaches to (parts of) the problems above:
> 
>   * Memory Pools [1] - preallocated persistent memory region + allocator
>   * PRMEM [2] - resizable persistent memory regions with fixed metadata
>                 pointer on the kernel command line + allocator
>   * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>                   address location on the kernel command line
>   * PKRAM [4] - handover of user space pages using a fixed metadata page
>                 specified via command line
> 
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
> 
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
> 
> == Overview ==
> 
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch region" available
> for kexec: A physically contiguous memory region that is guaranteed to
> not have any memory that KHO would preserve.  The new kernel bootstraps
> itself using the scratch region and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
> 
> == Limitations ==
> 
> I currently only implemented file based kexec. The kernel interfaces
> in the patch set are already in place to support user space kexec as well,
> but I have not implemented it yet inside kexec tools.
> 
> == How to Use ==
> 
> To use the code, please boot the kernel with the "kho_scratch=" command
> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
> 
> Make sure to fill ftrace with contents that you want to observe after
> kexec.  Then, before you invoke file based "kexec -l", activate KHO:
> 
>   # echo 1 > /sys/kernel/kho/active
>   # kexec -l Image --initrd=initrd -s
>   # kexec -e
> 
> The new kernel will boot up and contain the previous kernel's trace
> buffers in /sys/kernel/debug/tracing/trace.
> 
> == Changelog ==
> 
> v1 -> v2:
>   - Removed: tracing: Introduce names for ring buffers
>   - Removed: tracing: Introduce names for events
>   - New: kexec: Add config option for KHO
>   - New: kexec: Add documentation for KHO
>   - New: tracing: Initialize fields before registering
>   - New: devicetree: Add bindings for ftrace KHO
>   - test bot warning fixes
>   - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
>   - s/kho_reserve_mem/kho_reserve_previous_mem/g
>   - s/kho_reserve/kho_reserve_scratch/g
>   - Remove / reduce ifdefs
>   - Select crc32
>   - Leave anything that requires a name in trace.c to keep buffers
>     unnamed entities
>   - Put events as array into a property, use fingerprint instead of
>     names to identify them
>   - Reduce footprint without CONFIG_FTRACE_KHO
>   - s/kho_reserve_mem/kho_reserve_previous_mem/g
>   - make kho_get_fdt() const
>   - Add stubs for return_mem and claim_mem
>   - make kho_get_fdt() const
>   - Get events as array from a property, use fingerprint instead of
>     names to identify events
>   - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
>   - s/kho_reserve_mem/kho_reserve_previous_mem/g
>   - s/kho_reserve/kho_reserve_scratch/g
>   - Leave the node generation code that needs to know the name in
>     trace.c so that ring buffers can stay anonymous
>   - s/kho_reserve/kho_reserve_scratch/g
>   - Move kho enums out of ifdef
>   - Move from names to fdt offsets. That way, trace.c can find the trace
>     array offset and then the ring buffer code only needs to read out
>     its per-CPU data. That way it can stay oblivient to its name.
>   - Make kho_get_fdt() const
> 
> v2 -> v3:
> 
>   - Fix make dt_binding_check
>   - Add descriptions for each object
>   - s/trace_flags/trace-flags/
>   - s/global_trace/global-trace/
>   - Make all additionalProperties false
>   - Change subject to reflect subsysten (dt-bindings)
>   - Fix indentation
>   - Remove superfluous examples
>   - Convert to 64bit syntax
>   - Move to kho directory
>   - s/"global_trace"/"global-trace"/
>   - s/"global_trace"/"global-trace"/
>   - s/"trace_flags"/"trace-flags"/
>   - Fix wording
>   - Add Documentation to MAINTAINERS file
>   - Remove kho reference on read error
>   - Move handover_dt unmap up
>   - s/reserve_scratch_mem/mark_phys_as_cma/
>   - Remove ifdeffery
>   - Remove superfluous comment
> 
> Alexander Graf (17):
>   mm,memblock: Add support for scratch memory
>   memblock: Declare scratch memory as CMA
>   kexec: Add Kexec HandOver (KHO) generation helpers
>   kexec: Add KHO parsing support
>   kexec: Add KHO support to kexec file loads
>   kexec: Add config option for KHO
>   kexec: Add documentation for KHO
>   arm64: Add KHO support
>   x86: Add KHO support
>   tracing: Initialize fields before registering
>   tracing: Introduce kho serialization
>   tracing: Add kho serialization of trace buffers
>   tracing: Recover trace buffers from kexec handover
>   tracing: Add kho serialization of trace events
>   tracing: Recover trace events from kexec handover
>   tracing: Add config option for kexec handover
>   Documentation: KHO: Add ftrace bindings
> 
>  Documentation/ABI/testing/sysfs-firmware-kho  |   9 +
>  Documentation/ABI/testing/sysfs-kernel-kho    |  53 ++
>  .../admin-guide/kernel-parameters.txt         |  10 +
>  .../kho/bindings/ftrace/ftrace-array.yaml     |  38 ++
>  .../kho/bindings/ftrace/ftrace-cpu.yaml       |  43 ++
>  Documentation/kho/bindings/ftrace/ftrace.yaml |  62 +++
>  Documentation/kho/concepts.rst                |  88 +++
>  Documentation/kho/index.rst                   |  19 +
>  Documentation/kho/usage.rst                   |  57 ++
>  Documentation/subsystem-apis.rst              |   1 +
>  MAINTAINERS                                   |   3 +
>  arch/arm64/Kconfig                            |   3 +
>  arch/arm64/kernel/setup.c                     |   2 +
>  arch/arm64/mm/init.c                          |   8 +
>  arch/x86/Kconfig                              |   3 +
>  arch/x86/boot/compressed/kaslr.c              |  55 ++
>  arch/x86/include/uapi/asm/bootparam.h         |  15 +-
>  arch/x86/kernel/e820.c                        |   9 +
>  arch/x86/kernel/kexec-bzimage64.c             |  39 ++
>  arch/x86/kernel/setup.c                       |  46 ++
>  arch/x86/mm/init_32.c                         |   7 +
>  arch/x86/mm/init_64.c                         |   7 +
>  drivers/of/fdt.c                              |  39 ++
>  drivers/of/kexec.c                            |  54 ++
>  include/linux/kexec.h                         |  58 ++
>  include/linux/memblock.h                      |  19 +
>  include/linux/ring_buffer.h                   |  17 +-
>  include/linux/trace_events.h                  |   1 +
>  include/uapi/linux/kexec.h                    |   6 +
>  kernel/Kconfig.kexec                          |  13 +
>  kernel/Makefile                               |   2 +
>  kernel/kexec_file.c                           |  41 ++
>  kernel/kexec_kho_in.c                         | 298 ++++++++++
>  kernel/kexec_kho_out.c                        | 526 ++++++++++++++++++
>  kernel/trace/Kconfig                          |  14 +
>  kernel/trace/ring_buffer.c                    | 243 +++++++-
>  kernel/trace/trace.c                          |  96 +++-
>  kernel/trace/trace_events.c                   |  14 +-
>  kernel/trace/trace_events_synth.c             |  14 +-
>  kernel/trace/trace_events_user.c              |   4 +
>  kernel/trace/trace_output.c                   | 247 +++++++-
>  kernel/trace/trace_output.h                   |   5 +
>  kernel/trace/trace_probe.c                    |   4 +
>  mm/Kconfig                                    |   4 +
>  mm/memblock.c                                 |  79 ++-
>  45 files changed, 2351 insertions(+), 24 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
>  create mode 100644 Documentation/kho/bindings/ftrace/ftrace-array.yaml
>  create mode 100644 Documentation/kho/bindings/ftrace/ftrace-cpu.yaml
>  create mode 100644 Documentation/kho/bindings/ftrace/ftrace.yaml
>  create mode 100644 Documentation/kho/concepts.rst
>  create mode 100644 Documentation/kho/index.rst
>  create mode 100644 Documentation/kho/usage.rst
>  create mode 100644 kernel/kexec_kho_in.c
>  create mode 100644 kernel/kexec_kho_out.c
>
Alexander Graf Feb. 2, 2024, 12:58 p.m. UTC | #2
Hi Philipp,

On 29.01.24 17:34, Philipp Rudo wrote:
> Hi Alex,
>
> adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
> get KHO to work).
>
> Fist of all I believe that having a generic framework to pass information from
> one kernel to the other across kexec would be a good thing. But I'm afraid that


Thanks, I'm happy to hear that you agree with the basic motivation :). 
There are fundamentally 2 problems with passing data:

   * Passing structured data in a cross-architecture way
   * Passing memory

KHO tackles both. It proposes a common FDT based format that allows us 
to pass per-subsystem properties. That way, a subsystem does not need to 
know whether it's running on ARM, x86, RISC-V or s390x. It just gains 
awareness for KHO and can pass data.

On top of that, it proposes a standardized "mem" property (and some 
magic around that) which allows subsystems to pass memory.


> you are ignoring some fundamental problems which makes it extremely hard, if
> not impossible, to reliably transfer the kernel's state from one kernel to the
> other.
>
> One thing I don't understand is how reusing the scratch area is working. Sure
> you pass it's location via the dt/boot_params but I don't see any code that
> makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
> kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
> runs inside that area and I don't believe the CMA design ever considered that
> the kernel image could be included in a CMA area.


That one took me a lot to figure out sensibly (with recursion all the 
way down) while building KHO :). I hope I detailed it sensibly in the 
documentation - please let me know how to improve it in case it's 
unclear: https://lore.kernel.org/lkml/20240117144704.602-8-graf@amazon.com/

Let me explain inline using different words as well what happens:

The first (and only the first) kernel that boots allocates a CMA region 
as "scratch region". It loads the new kernel into that region. It passes 
that region as "scratch region" to the next kernel. The next kernel now 
takes it and marks every page block that the scratch region spans as CMA:

https://lore.kernel.org/lkml/20240117144704.602-3-graf@amazon.com/

The CMA hint doesn't mean we create an actual CMA region. It mostly 
means that the kernel won't use this memory for any kernel allocations. 
Kernel allocations up to this point are allocations we don't need to 
pass on with KHO again. Kernel allocations past that point may be 
allocations that we want to pass, so we just never place them into the 
"scratch region" again.

And because we now already have a scratch region from the previous 
kernel, we keep reusing that forever with any new KHO kexec.


> Staying at reusing the scratch area. One thing that is broken for sure is that
> you reuse the scratch area without ever checking the kho_scratch parameter of
> the 2nd kernel's command line. Remember, with kexec you are dealing with two
> different kernels with two different command lines. Meaning you can only reuse
> the scratch area if the requested size in the 2nd kernel is identical to the
> one of the 1st kernel. In all other cases you need to adjust the scratch area's
> size or reserve a new one.


Hm. So you're saying a user may want to change the size of the scratch 
area with a KHO kexec. That's insanely risky because you (as rightfully 
pointed out below) may have significant fragmentation at that point. And 
we will only know when we're in the new kernel so it's too late to 
abort. IMHO it's better to just declare the scratch region as immutable 
during KHO to avoid that pitfall.


> This directly leads to the next problem. In kho_reserve_previous_mem you are
> reusing the different memory regions wherever the 1st kernel allocated them.
> But that also means you are handing over the 1st kernel's memory
> fragmentation to the 2nd kernel and you do that extremely early during boot.
> Which means that users who need to allocate large continuous physical memory,
> like the scratch area or the crashkernel memory, will have increasing chance to
> not find a suitable area. Which IMHO is unacceptable.


Correct :). It basically means you want to pass large allocations from 
the 1st kernel that you want to preserve on to the next. So if the 1st 
kernel allocated a large crash area, it's safest to pass that allocation 
using KHO to ensure the next kernel also has the region fully reserved. 
Otherwise the next kernel may accidentally place data into the 
previously reserved crash region (which would be contiguously free at 
early init of the 2nd kernel) and fragment it again.


> Finally, and that's the big elephant in the room, is your lax handling of the
> unstable kernel internal ABI. Remember, you are dealing with two different
> kernels, that also means two different source levels and two different configs.
> So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
> doesn't means that they have the same struct buffer_page. But that's what your
> code implicitly assumes. For KHO ever to make it upstream you need to make sure
> that both kernels are "speaking the same language".


Wow, I hope it didn't come across as that! The whole point of using FDT 
and compatible strings in KHO is to solve exactly that problem. Any time 
a passed over data structure changes incompatibly, you would need to 
modify the compatible string of the subsystem that owns the now 
incompatible data.

So in the example of struct buffer_page, it means that if anyone changes 
the few bits we care about in struct buffer_page, we need to ensure that 
the new kernel emits "ftrace,cpu-v2" compatible strings. We can at that 
point choose whether we want to implement compat handling for 
"ftrace,cpu-v1" style struct buffer_pages or only support same version 
ingestion.

The one thing that we could improve on here today IMHO is to have 
compile time errors if any part of struct buffer_page changes 
semantically: So we'd create a few defines for the bits we want in 
"ftrace,cpu-v1" as well as size of struct buffer_page and then compare 
them to what the struct offsets are at compile time to ensure they stay 
identical.

Please let me know how I can clarify that more in the documentation. It 
really is the absolute core of KHO.


> Personally I see two possible solutions:
>
> 1) You introduce a stable intermediate format for every subsystem similar to
> what IMA_KEXEC does. This should work for simple types like struct buffer_page
> but for complex ones like struct vfio_device that's basically impossible.


I don't see why. The only reason KHO passes struct buffer_page as memory 
is because we want to be able to produce traces even after KHO 
serialization is done. For vfio_device, I think it's perfectly 
reasonable to serialize any data we need to preserve directly into FDT 
properties.


> 2) You also hand over the ABI version for every given type (basically just a
> hash over all fields including all the dependencies). So the 2nd kernel can
> verify that the data handed over is in a format it can handle and if not bail
> out with a descriptive error message rather than reading garbage. Plus side is
> that once such a system is in place you can reuse it to automatically resolve
> all dependencies so you no longer need to manually store the buffer_page and
> its buffer_data_page separately.
> Down side is that traversing the debuginfo (including the ones from modules) is
> not a simple task and I expect that such a system will be way more complex than
> the rest of KHO. In addition there are some cases that the versioning won't be
> able to capture. For example if a type contains a "void *"-field. Then although
> the definition of the type is identical in both kernels the field can be cast
> to different types when used. An other problem will be function pointers which
> you first need to resolve in the 1st kernel and then map to the identical
> function in the 2nd kernel. This will become particularly "fun" when the
> function is part of a module that isn't loaded at the time when you try to
> recreate the kernel's state.


The whole point of KHO is to leave it to the subsystem which path they 
want to take. The subsystem can either pass binary data and validate as 
part of FDT properties (like compatible strings). That data can be 
identical to today's in-kernel data structures (usually a bad idea) or 
can be a new intermediate data format. But the subsystem can also choose 
to fully serialize into FDT properties and not pass any memory at all 
for state that would be in structs. Or something in between.


> So to summarize, while it would be nice to have a generic framework like KHO to
> pass data from one kernel to the other via kexec there are good reasons why it
> doesn't exist, yet.


I hope my explanations above clarify things a bit. Let me know if you're 
at FOSDEM, happy to talk about the internals there as well :)


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Oleksij Rempel Feb. 6, 2024, 8:17 a.m. UTC | #3
Hi Alexander,

Nice work!

On Wed, Jan 17, 2024 at 02:46:47PM +0000, Alexander Graf wrote:
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
> 
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:
> 
>   https://lpc.events/event/17/contributions/1485/
> 
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
> 
> == Alternatives ==
> 
> There are alternative approaches to (parts of) the problems above:
> 
>   * Memory Pools [1] - preallocated persistent memory region + allocator
>   * PRMEM [2] - resizable persistent memory regions with fixed metadata
>                 pointer on the kernel command line + allocator
>   * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>                   address location on the kernel command line
>   * PKRAM [4] - handover of user space pages using a fixed metadata page
>                 specified via command line
> 
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
> 
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
> 
> == Overview ==
> 
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch region" available
> for kexec: A physically contiguous memory region that is guaranteed to
> not have any memory that KHO would preserve.  The new kernel bootstraps
> itself using the scratch region and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
> 
> == Limitations ==
> 
> I currently only implemented file based kexec. The kernel interfaces
> in the patch set are already in place to support user space kexec as well,
> but I have not implemented it yet inside kexec tools.
> 
> == How to Use ==
> 
> To use the code, please boot the kernel with the "kho_scratch=" command
> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
> 
> Make sure to fill ftrace with contents that you want to observe after
> kexec.  Then, before you invoke file based "kexec -l", activate KHO:
> 
>   # echo 1 > /sys/kernel/kho/active
>   # kexec -l Image --initrd=initrd -s
>   # kexec -e
> 
> The new kernel will boot up and contain the previous kernel's trace
> buffers in /sys/kernel/debug/tracing/trace.

Assuming:
- we wont to start tracing as early as possible, before rootfs
  or initrd would be able to configure it.
- traces are stored on a different device, not RAM. For example NVMEM.
- Location of NVMEM is different for different board types, but
  bootloader is able to give the right configuration to the kernel.

What would be the best, acceptable for mainline, way to provide this
kind of configuration? At least part of this information do not
describes devices or device states, this would not fit in to devicetree
universe. Amount of possible information would not fit in to bootconfig
too.

Other more or less overlapping use case I have in mind is a netbootable
embedded system with a requirement to boot as fast as possible. Since
bootloader already established a link and got all needed ip
configuration, it would be able to hand over etherent controller and ip
configuration states. Wille be the KHO the way to go for this use case?

Regards,
Oleksij
Alexander Graf Feb. 6, 2024, 1:43 p.m. UTC | #4
Hey Oleksij!

On 06.02.24 09:17, Oleksij Rempel wrote:
> Hi Alexander,
>
> Nice work!
>
> On Wed, Jan 17, 2024 at 02:46:47PM +0000, Alexander Graf wrote:
>> Kexec today considers itself purely a boot loader: When we enter the new
>> kernel, any state the previous kernel left behind is irrelevant and the
>> new kernel reinitializes the system.
>>
>> However, there are use cases where this mode of operation is not what we
>> actually want. In virtualization hosts for example, we want to use kexec
>> to update the host kernel while virtual machine memory stays untouched.
>> When we add device assignment to the mix, we also need to ensure that
>> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
>> need to do the same for the PCI subsystem. If we want to kexec while an
>> SEV-SNP enabled virtual machine is running, we need to preserve the VM
>> context pages and physical memory. See James' and my Linux Plumbers
>> Conference 2023 presentation for details:
>>
>>    https://lpc.events/event/17/contributions/1485/
>>
>> To start us on the journey to support all the use cases above, this
>> patch implements basic infrastructure to allow hand over of kernel state
>> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
>> With this patch set applied, you can read ftrace records from the
>> pre-kexec environment in your post-kexec one. This creates a very powerful
>> debugging and performance analysis tool for kexec. It's also slightly
>> easier to reason about than full blown VFIO state preservation.
>>
>> == Alternatives ==
>>
>> There are alternative approaches to (parts of) the problems above:
>>
>>    * Memory Pools [1] - preallocated persistent memory region + allocator
>>    * PRMEM [2] - resizable persistent memory regions with fixed metadata
>>                  pointer on the kernel command line + allocator
>>    * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>>                    address location on the kernel command line
>>    * PKRAM [4] - handover of user space pages using a fixed metadata page
>>                  specified via command line
>>
>> All of the approaches above fundamentally have the same problem: They
>> require the administrator to explicitly carve out a physical memory
>> location because they have no mechanism outside of the kernel command
>> line to pass data (including memory reservations) between kexec'ing
>> kernels.
>>
>> KHO provides that base foundation. We will determine later whether we
>> still need any of the approaches above for fast bulk memory handover of for
>> example IOMMU page tables. But IMHO they would all be users of KHO, with
>> KHO providing the foundational primitive to pass metadata and bulk memory
>> reservations as well as provide easy versioning for data.
>>
>> == Overview ==
>>
>> We introduce a metadata file that the kernels pass between each other. How
>> they pass it is architecture specific. The file's format is a Flattened
>> Device Tree (fdt) which has a generator and parser already included in
>> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
>> kernel invokes callbacks to every driver that supports KHO to serialize
>> its state. When the actual kexec happens, the fdt is part of the image
>> set that we boot into. In addition, we keep a "scratch region" available
>> for kexec: A physically contiguous memory region that is guaranteed to
>> not have any memory that KHO would preserve.  The new kernel bootstraps
>> itself using the scratch region and sets all handed over memory as in use.
>> When drivers initialize that support KHO, they introspect the fdt and
>> recover their state from it. This includes memory reservations, where the
>> driver can either discard or claim reservations.
>>
>> == Limitations ==
>>
>> I currently only implemented file based kexec. The kernel interfaces
>> in the patch set are already in place to support user space kexec as well,
>> but I have not implemented it yet inside kexec tools.
>>
>> == How to Use ==
>>
>> To use the code, please boot the kernel with the "kho_scratch=" command
>> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
>>
>> Make sure to fill ftrace with contents that you want to observe after
>> kexec.  Then, before you invoke file based "kexec -l", activate KHO:
>>
>>    # echo 1 > /sys/kernel/kho/active
>>    # kexec -l Image --initrd=initrd -s
>>    # kexec -e
>>
>> The new kernel will boot up and contain the previous kernel's trace
>> buffers in /sys/kernel/debug/tracing/trace.
> Assuming:
> - we wont to start tracing as early as possible, before rootfs
>    or initrd would be able to configure it.
> - traces are stored on a different device, not RAM. For example NVMEM.
> - Location of NVMEM is different for different board types, but
>    bootloader is able to give the right configuration to the kernel.


Let me try to really understand what you're tracing here. Are we talking 
about exposing boot loader traces into Linux [1]? In that case, I think 
a mechanism like [2] is what you're looking for.

Or do you want to transfer genuine Linux ftrace traces? In that case, 
why would you want to store them outside of RAM?


>
> What would be the best, acceptable for mainline, way to provide this
> kind of configuration? At least part of this information do not
> describes devices or device states, this would not fit in to devicetree
> universe. Amount of possible information would not fit in to bootconfig
> too.


We have precedence for configuration in device tree: You can use device 
tree to describe partitions on a NAND device, you can use it to specify 
MAC address overrides of devices attached to USB, etc etc. At the end of 
the day when people say they don't want configuration in device tree, 
what they mean is that device tree should be a hand over data structure 
from firmware to kernel, not from OS integrator to kernel :). If your 
firmware is the place that knows about offsets and you need to pass 
those offsets, IMHO DT is a good fit.


> Other more or less overlapping use case I have in mind is a netbootable
> embedded system with a requirement to boot as fast as possible. Since
> bootloader already established a link and got all needed ip
> configuration, it would be able to hand over etherent controller and ip
> configuration states. Wille be the KHO the way to go for this use case?


That's an interesting one too. I would lean towards "try with normal 
device tree first" here as well. It's again a very clear case of 
"firmware wants to tell OS about things it knows, but the OS doesn't 
know" to me. That means device tree should be fine to describe it.


Alex

[1] https://www.youtube.com/watch?v=RaFm5FfzFaM / 
https://edk2.groups.io/g/devel/topic/91368904
[2] 
https://github.com/agraf/linux/commit/b1fe0c296ec923e9b1f544862b0eb9365a8da7cb

>
> Regards,
> Oleksij
> --
> Pengutronix e.K.                           |                             |
> Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
> 31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
> Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Oleksij Rempel Feb. 6, 2024, 2:40 p.m. UTC | #5
On Tue, Feb 06, 2024 at 02:43:15PM +0100, Alexander Graf wrote:
> Hey Oleksij!
> 
> On 06.02.24 09:17, Oleksij Rempel wrote:
> > Hi Alexander,
> > 
> > Nice work!
> > 
> > On Wed, Jan 17, 2024 at 02:46:47PM +0000, Alexander Graf wrote:
> > > Make sure to fill ftrace with contents that you want to observe after
> > > kexec.  Then, before you invoke file based "kexec -l", activate KHO:
> > > 
> > >    # echo 1 > /sys/kernel/kho/active
> > >    # kexec -l Image --initrd=initrd -s
> > >    # kexec -e
> > > 
> > > The new kernel will boot up and contain the previous kernel's trace
> > > buffers in /sys/kernel/debug/tracing/trace.
> > Assuming:
> > - we wont to start tracing as early as possible, before rootfs
> >    or initrd would be able to configure it.
> > - traces are stored on a different device, not RAM. For example NVMEM.
> > - Location of NVMEM is different for different board types, but
> >    bootloader is able to give the right configuration to the kernel.
> 
> 
> Let me try to really understand what you're tracing here. Are we talking
> about exposing boot loader traces into Linux [1]? In that case, I think a
> mechanism like [2] is what you're looking for.
> 
> Or do you want to transfer genuine Linux ftrace traces? In that case, why
> would you want to store them outside of RAM?

The high level object of what i need is to find how embedded systems in
fields do break. Since this devices should be always on, there are
different situations where system may reboot. For example, voltage
related issues, temperature, scheduled system updates, HW or SW errors.

To get better understand on what is going on, information should be
collected. But there are some limitations:
- voltage drops can be recorder only with prepared HW:
  https://www.spinics.net/lists/devicetree/msg644030.html

- In case of voltage drops RAM or block devices can't be used. Instead,
  some variant of NVMEM should be used. In my case, NVMEM has 8 bits of
  storage :) So, only one entry of the "trace" is compressed to this storage.
  https://lore.kernel.org/all/20240124122204.730370-1-o.rempel@pengutronix.de
  The reset reason information is provide by kernel and used by firmware
  and kernel on next reboot

The implementation is not a big deal. The problematic part is the way
how the system should get information about existence of recorder and
where the recorder should stored things, for example NVMEM cell.

In my initial implementation I used devicetree to configure the software
based recorder and linked it with NVMEM cell. But it is against the DT
purpose to describe only HW and it makes this recorder unusable for
not DT basd systems.

Krzysztof is suggesting to configure it from initrd. This has own
limitations as well:
 - record can't be used before initrd.
 - we have multiple configuration point of board specific information - 
   firmware (bootloader) and initrd.
 - initrd take place and reduce boot time for device which do not needed
   it before.

Other variants like kernel command-line and/or module parameters seems
to be not acceptable depending maintainer. So, I'm still seeking
proper, acceptable, portable way to hand over not HW specific
information to the kernel.

> > What would be the best, acceptable for mainline, way to provide this
> > kind of configuration? At least part of this information do not
> > describes devices or device states, this would not fit in to devicetree
> > universe. Amount of possible information would not fit in to bootconfig
> > too.
> 
> 
> We have precedence for configuration in device tree: You can use device tree
> to describe partitions on a NAND device, you can use it to specify MAC
> address overrides of devices attached to USB, etc etc. At the end of the day
> when people say they don't want configuration in device tree, what they mean
> is that device tree should be a hand over data structure from firmware to
> kernel, not from OS integrator to kernel :). If your firmware is the place
> that knows about offsets and you need to pass those offsets, IMHO DT is a
> good fit.

Yes, the layout of the NVMEM can be described in the DT. How can I tell
the system that this NVMEM cell should be used by some recorder or
tracer? Before sysfs is available any how. @Krzysztof ?

> > Other more or less overlapping use case I have in mind is a netbootable
> > embedded system with a requirement to boot as fast as possible. Since
> > bootloader already established a link and got all needed ip
> > configuration, it would be able to hand over etherent controller and ip
> > configuration states. Wille be the KHO the way to go for this use case?
> 
> 
> That's an interesting one too. I would lean towards "try with normal device
> tree first" here as well. It's again a very clear case of "firmware wants to
> tell OS about things it knows, but the OS doesn't know" to me. That means
> device tree should be fine to describe it.

I can imagine description of PHY and MAC state. But IP configuration
state of the firmware seems to be out of DT scope?

Regards,
Oleksij
Philipp Rudo Feb. 9, 2024, 4:59 p.m. UTC | #6
Hi Alex,

On Fri, 2 Feb 2024 13:58:52 +0100
Alexander Graf <graf@amazon.com> wrote:

> Hi Philipp,
> 
> On 29.01.24 17:34, Philipp Rudo wrote:
> > Hi Alex,
> >
> > adding linux-integrity as there are some synergies with IMA_KEXEC (in case we
> > get KHO to work).
> >
> > Fist of all I believe that having a generic framework to pass information from
> > one kernel to the other across kexec would be a good thing. But I'm afraid that  
> 
> 
> Thanks, I'm happy to hear that you agree with the basic motivation :). 
> There are fundamentally 2 problems with passing data:
> 
>    * Passing structured data in a cross-architecture way
>    * Passing memory
> 
> KHO tackles both. It proposes a common FDT based format that allows us 
> to pass per-subsystem properties. That way, a subsystem does not need to 
> know whether it's running on ARM, x86, RISC-V or s390x. It just gains 
> awareness for KHO and can pass data.
> 
> On top of that, it proposes a standardized "mem" property (and some 
> magic around that) which allows subsystems to pass memory.
> 
> 
> > you are ignoring some fundamental problems which makes it extremely hard, if
> > not impossible, to reliably transfer the kernel's state from one kernel to the
> > other.
> >
> > One thing I don't understand is how reusing the scratch area is working. Sure
> > you pass it's location via the dt/boot_params but I don't see any code that
> > makes it a CMA region. So IIUC the scratch area won't be available for the 2nd
> > kernel. Which is probably for the better as IIUC the 2nd kernel gets loaded and
> > runs inside that area and I don't believe the CMA design ever considered that
> > the kernel image could be included in a CMA area.  
> 
> 
> That one took me a lot to figure out sensibly (with recursion all the 
> way down) while building KHO :). I hope I detailed it sensibly in the 
> documentation - please let me know how to improve it in case it's 
> unclear: https://lore.kernel.org/lkml/20240117144704.602-8-graf@amazon.com/
> 
> Let me explain inline using different words as well what happens:
> 
> The first (and only the first) kernel that boots allocates a CMA region 
> as "scratch region". It loads the new kernel into that region. It passes 
> that region as "scratch region" to the next kernel. The next kernel now 
> takes it and marks every page block that the scratch region spans as CMA:
> 
> https://lore.kernel.org/lkml/20240117144704.602-3-graf@amazon.com/
> 
> The CMA hint doesn't mean we create an actual CMA region. It mostly 
> means that the kernel won't use this memory for any kernel allocations. 
> Kernel allocations up to this point are allocations we don't need to 
> pass on with KHO again. Kernel allocations past that point may be 
> allocations that we want to pass, so we just never place them into the 
> "scratch region" again.
> 
> And because we now already have a scratch region from the previous 
> kernel, we keep reusing that forever with any new KHO kexec.

Thanks for the explanation. I've missed the memblock_mark_scratch in
kho_populate. The code makes much more sense now :-)

Having that said, for complex series like this one I like to do the
review on a branch in my local git as that to avoid problems like that
(or at least make them less likely). But your patches didn't apply. Can
you tell me what your base is or make your git branch available. That
would be very helpful to me. Thanks!

> > Staying at reusing the scratch area. One thing that is broken for sure is that
> > you reuse the scratch area without ever checking the kho_scratch parameter of
> > the 2nd kernel's command line. Remember, with kexec you are dealing with two
> > different kernels with two different command lines. Meaning you can only reuse
> > the scratch area if the requested size in the 2nd kernel is identical to the
> > one of the 1st kernel. In all other cases you need to adjust the scratch area's
> > size or reserve a new one.  
> 
> 
> Hm. So you're saying a user may want to change the size of the scratch 
> area with a KHO kexec. That's insanely risky because you (as rightfully 
> pointed out below) may have significant fragmentation at that point. And 
> we will only know when we're in the new kernel so it's too late to 
> abort. IMHO it's better to just declare the scratch region as immutable 
> during KHO to avoid that pitfall.

Yes, a user can set any command line with kexec. My expectation as a
user is that the kernel respects whatever I set on the command line and
doesn't think it knows better and simply ignores what I tell it to do.
So even when you set the scratch area immutable during boot you have to
make sure that in the end kernel respects what the user has set on the
2nd kernel's command line.

> > This directly leads to the next problem. In kho_reserve_previous_mem you are
> > reusing the different memory regions wherever the 1st kernel allocated them.
> > But that also means you are handing over the 1st kernel's memory
> > fragmentation to the 2nd kernel and you do that extremely early during boot.
> > Which means that users who need to allocate large continuous physical memory,
> > like the scratch area or the crashkernel memory, will have increasing chance to
> > not find a suitable area. Which IMHO is unacceptable.  
> 
> 
> Correct :). It basically means you want to pass large allocations from 
> the 1st kernel that you want to preserve on to the next. So if the 1st 
> kernel allocated a large crash area, it's safest to pass that allocation 
> using KHO to ensure the next kernel also has the region fully reserved. 
> Otherwise the next kernel may accidentally place data into the 
> previously reserved crash region (which would be contiguously free at 
> early init of the 2nd kernel) and fragment it again.

I don't think that this is an option. For one your suggestion means that
every "large allocation" (whatever that means) needs to be tracked
manually for it to work together with KHO. In addition there is still
the problem that the 2nd kernel may need a larger allocation than the
1st one. Be it because it's a command line parameter, e.g. kho_scratch
or crashkernel, or just a new feature that requires additional memory
the 2nd kernel has. IMO it's inevitable that KHO finds a way to
remove/reduce memory fragmentation.

> > Finally, and that's the big elephant in the room, is your lax handling of the
> > unstable kernel internal ABI. Remember, you are dealing with two different
> > kernels, that also means two different source levels and two different configs.
> > So only because both the 1st and 2nd kernel have a e.g. struct buffer_page
> > doesn't means that they have the same struct buffer_page. But that's what your
> > code implicitly assumes. For KHO ever to make it upstream you need to make sure
> > that both kernels are "speaking the same language".  
> 
> 
> Wow, I hope it didn't come across as that! The whole point of using FDT 
> and compatible strings in KHO is to solve exactly that problem. Any time 
> a passed over data structure changes incompatibly, you would need to 
> modify the compatible string of the subsystem that owns the now 
> incompatible data.
> 
> So in the example of struct buffer_page, it means that if anyone changes 
> the few bits we care about in struct buffer_page, we need to ensure that 
> the new kernel emits "ftrace,cpu-v2" compatible strings. We can at that 
> point choose whether we want to implement compat handling for 
> "ftrace,cpu-v1" style struct buffer_pages or only support same version 
> ingestion.

Well, it came across like that because there was absolutely no
explanation on when those versions need to be bumped up so far.

> The one thing that we could improve on here today IMHO is to have 
> compile time errors if any part of struct buffer_page changes 
> semantically: So we'd create a few defines for the bits we want in 
> "ftrace,cpu-v1" as well as size of struct buffer_page and then compare 
> them to what the struct offsets are at compile time to ensure they stay 
> identical.

How do you imagine those macros to look like? How do they work with
structs that change their layout depending on the config?

Personally, I highly doubt that any system that manages these different
versions manually will work reliably. It might be possible for
something as simple as struct buffer_page but once it gets more
complicated, e.g. by depending on the kernel config or simply having
more dependencies to common data structures, it will be a constant
source of pain.
Just assume, although extremely unlikely, that struct list_head is
changed. Most likely the person who makes the change won't be from the
ftrace team and thus won't know that he/she/it needs to bump the
version. Even the compile time errors will only help if
CONFIG_FTRACE_KHO is enabled which most like won't be the case.
Ultimately this means that KHO will break silently until someone tries
to kexec in the new kernel with KHO enabled. But even then there will
only be a cryptic error message (if any) as you have basically
introduced a memory corruption to the 2nd kernel. The more complex the
structs become and the deeper the dependency list goes the more likely
it becomes that such a breaking change is made.

The way I see it there is no way around generating the version based on
the actual memory layout for this particular build.

> Please let me know how I can clarify that more in the documentation. It 
> really is the absolute core of KHO.
> 
> 
> > Personally I see two possible solutions:
> >
> > 1) You introduce a stable intermediate format for every subsystem similar to
> > what IMA_KEXEC does. This should work for simple types like struct buffer_page
> > but for complex ones like struct vfio_device that's basically impossible.  
> 
> 
> I don't see why. The only reason KHO passes struct buffer_page as memory 
> is because we want to be able to produce traces even after KHO 
> serialization is done. For vfio_device, I think it's perfectly 
> reasonable to serialize any data we need to preserve directly into FDT 
> properties.
> 
> 
> 
> > 2) You also hand over the ABI version for every given type (basically just a
> > hash over all fields including all the dependencies). So the 2nd kernel can
> > verify that the data handed over is in a format it can handle and if not bail
> > out with a descriptive error message rather than reading garbage. Plus side is
> > that once such a system is in place you can reuse it to automatically resolve
> > all dependencies so you no longer need to manually store the buffer_page and
> > its buffer_data_page separately.
> > Down side is that traversing the debuginfo (including the ones from modules) is
> > not a simple task and I expect that such a system will be way more complex than
> > the rest of KHO. In addition there are some cases that the versioning won't be
> > able to capture. For example if a type contains a "void *"-field. Then although
> > the definition of the type is identical in both kernels the field can be cast
> > to different types when used. An other problem will be function pointers which
> > you first need to resolve in the 1st kernel and then map to the identical
> > function in the 2nd kernel. This will become particularly "fun" when the
> > function is part of a module that isn't loaded at the time when you try to
> > recreate the kernel's state.  
> 
> 
> The whole point of KHO is to leave it to the subsystem which path they 
> want to take. The subsystem can either pass binary data and validate as 
> part of FDT properties (like compatible strings). That data can be 
> identical to today's in-kernel data structures (usually a bad idea) or 
> can be a new intermediate data format. But the subsystem can also choose 
> to fully serialize into FDT properties and not pass any memory at all 
> for state that would be in structs. Or something in between.

That's totally fine. My point is that there are simply too many ways to
fuck it up and break the 2nd kernel. That's why I don't believe that we
can rely on the subsystems to "do it right" and "remember to bump the
version". In other words, KHO needs to provide a reliable, automatic
mechanism with wich the 2nd kernel can decide if it can handle the
passed data or not.

> > So to summarize, while it would be nice to have a generic framework like KHO to
> > pass data from one kernel to the other via kexec there are good reasons why it
> > doesn't exist, yet.  
> 
> 
> I hope my explanations above clarify things a bit. Let me know if you're 
> at FOSDEM, happy to talk about the internals there as well :)

Sorry, I couldn't make it to FOSDEM but I plan to be at LPC later this
year. In fact I had your talk on my list last year. Unfortunately it was
parallel to the kernel debugger mc...

Thanks
Philipp

> Alex
> 
> 
> 
> 
> 
> Amazon Development Center Germany GmbH
> Krausenstr. 38
> 10117 Berlin
> Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
> Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
> Sitz: Berlin
> Ust-ID: DE 289 237 879
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec
Pratyush Yadav Feb. 16, 2024, 3:29 p.m. UTC | #7
Hi Alex,

On Wed, Jan 17 2024, Alexander Graf wrote:

> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See James' and my Linux Plumbers
> Conference 2023 presentation for details:

I am working on handing userspace pages across kexec. This can be useful
for applications with large in-memory state that can be time consuming
to rebuild. If they can hand over their state over kexec, it allows for
kernel upgrades with lower downtime. As a part of this problem, I have
been looking at plugging all of this into CRIU [0] so I don't have to
modify the applications to use this feature. I can just use CRIU to do
the checkpoint and restore quickly over kexec.

I hacked together some patches for this (which are not yet polished
enough to publish) and ended up implementing something like KHO in a
much more crude way. I have since refactored my patches to use KHO and I
find it quite useful. So thanks for working on this :-)

It was easy enough to get KHO working with my patches though I had to
look into your ftrace patches to get the whole picture. The
documentation can be improved to show how it can be used from the
driver/subsystem perspective. For example, I had to read your ftrace
patches to figure out I should use kho_get_fdt(), or that I should
register a notifier via kho_register_notifier(). I would be happy to
contribute some documentation improvements.

Have you done any analysis on the performance or memory overhead? If
yes, it would be nice to look at some data. I have some concerns with
performance and memory overhead, especially for more fragmented memory
but I don't yet have numbers to present you.

[0] https://github.com/checkpoint-restore/criu

>
>   https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this
> patch implements basic infrastructure to allow hand over of kernel state
> across kexec (Kexec HandOver, aka KHO). As example target, we use ftrace:
> With this patch set applied, you can read ftrace records from the
> pre-kexec environment in your post-kexec one. This creates a very powerful
> debugging and performance analysis tool for kexec. It's also slightly
> easier to reason about than full blown VFIO state preservation.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
>   * Memory Pools [1] - preallocated persistent memory region + allocator
>   * PRMEM [2] - resizable persistent memory regions with fixed metadata
>                 pointer on the kernel command line + allocator
>   * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>                   address location on the kernel command line
>   * PKRAM [4] - handover of user space pages using a fixed metadata page
>                 specified via command line

FYI, you forgot to paste the links in v3 but I can find them from v2.

From all these options, PKRAM seems somewhat useful for my use case but
with CRIU it would need to copy all the application pages to the PKRAM
FS and would need at least as much free memory as application memory.

Instead, I have built a simple system that gives an API to userspace to
hand over its pages and to request them back. It then keeps track of the
PID and PA -> VA mappings (essentially a page table). This lets me keep
the pages in-place and avoid needing lots of free memory or expensive
copying. KHO plays a crucial role there in handing those pages and page
tables across to the next kernel.

The FDT format works fairly well for my use case. Since page tables are
a stable data structure, I don't need to worry about their format
changing between kernel versions and can directly pass them through.
This might not be true for many other data structures so subsystems
using those either need to serialize them to FDT or invent their own
serialization formats.

I also wonder how the "mem" array will work for more fragmented
allocations. It might grow very large with lots of scattered elements. I
wonder how both KHO's parsing and memblock will behave in this case. I
have not yet tried stressing it so I can't say for myself.

>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
>
> == Overview ==
>
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch region" available
> for kexec: A physically contiguous memory region that is guaranteed to
> not have any memory that KHO would preserve.  The new kernel bootstraps
> itself using the scratch region and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
>
> == Limitations ==
>
> I currently only implemented file based kexec. The kernel interfaces
> in the patch set are already in place to support user space kexec as well,
> but I have not implemented it yet inside kexec tools.
>
> == How to Use ==
>
> To use the code, please boot the kernel with the "kho_scratch=" command
> line parameter set: "kho_scratch=512M". KHO requires a scratch region.
>
> Make sure to fill ftrace with contents that you want to observe after
> kexec.  Then, before you invoke file based "kexec -l", activate KHO:
>
>   # echo 1 > /sys/kernel/kho/active
>   # kexec -l Image --initrd=initrd -s
>   # kexec -e
>
> The new kernel will boot up and contain the previous kernel's trace
> buffers in /sys/kernel/debug/tracing/trace.
>
[...]

Overall, I think KHO is quite useful and I would be happy to see it
evolve and eventually make it into the kernel. It would certainly make
my life a lot easier.

Since I have used it in my patches, I have done some basic testing for
it. Nothing fancy, just handed a few pages across. It works as
advertised. For that,

Tested-by: Pratyush Yadav <ptyadav@amazon.de>