diff mbox series

[v4,09/14] kexec: Add documentation for KHO

Message ID 20250206132754.2596694-10-rppt@kernel.org (mailing list archive)
State New
Headers show
Series kexec: introduce Kexec HandOver (KHO) | expand

Commit Message

Mike Rapoport Feb. 6, 2025, 1:27 p.m. UTC
From: Alexander Graf <graf@amazon.com>

With KHO in place, let's add documentation that describes what it is and
how to use it.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 Documentation/kho/concepts.rst   | 80 ++++++++++++++++++++++++++++++++
 Documentation/kho/index.rst      | 19 ++++++++
 Documentation/kho/usage.rst      | 60 ++++++++++++++++++++++++
 Documentation/subsystem-apis.rst |  1 +
 MAINTAINERS                      |  1 +
 5 files changed, 161 insertions(+)
 create mode 100644 Documentation/kho/concepts.rst
 create mode 100644 Documentation/kho/index.rst
 create mode 100644 Documentation/kho/usage.rst

Comments

Jason Gunthorpe Feb. 10, 2025, 7:26 p.m. UTC | #1
On Thu, Feb 06, 2025 at 03:27:49PM +0200, Mike Rapoport wrote:

> +KHO introduces a new concept to its device tree: ``mem`` properties. A
> +``mem`` property can be inside any subnode in the device tree. 

I do not think this is a good idea.

It should be core infrastructure, totally unrelated to any per-device
fdt nodes, to carry the memory map.

IOW a full DT that looks something more like:

/dts-v1/;

/ {
  compatible = "linux-kho,v1";
  allocated-memory {
        <>
  };

  ftracebuffer {
  	       compatible = "linux-kho,ftracem,v1";
	       ftrace-buffer-phys = <..>;
	       ftrace-buffer-len = <..>;
	       ..etc..
  };
};

Where allocated_memory will remove all memory from the buddy allocator
very early on in an efficient way. that process should not be walking
the fdt to find mem nodes.

> +After boot, drivers can call the kho subsystem to transfer ownership of memory
> +that was reserved via a ``mem`` property to themselves to continue using memory
> +from the previous execution.

And this transfer should be done by phys that the node itself
describes.

Ie if ftrace has a single high order folio to store it's ftrace buffer
then I would expect code like:

allocate ftrace:
  buffer = folio_alloc(..);

activate callback:
   kho_preserve_folio(buffer)
   fdt...("ftrace-buffer-phys", virt_to_phys(buffer))

restore callback:
   buffer_phys = fdt..("ftrace-buffer-phys")
   buffer = kho_restore_folio(buffer_phys)
   [..]

destroy ftrace:
   folio_put(buffer);

And kho will take care to restore the struct folio, put back the
order, etc, etc.

Similar for slab.

I think this sort of memory-based operation should be the very basic
core building primitive here.

So the allocated-memory node should preserve information about KHO'd
folios, their order and so on.

It doesn't matter what part of the FDT owns those folios, all the core
kernel should do is keep track of them and at some point check that
all preserved folios have been claimed.

> +We guarantee that we always have such regions through the scratch regions: On
> +first boot KHO allocates several physically contiguous memory regions. Since
> +after kexec these regions will be used by early memory allocations, there is a
> +scratch region per NUMA node plus a scratch region to satisfy allocations
> +requests that do not require particilar NUMA node assignment.

This plan sounds great, way better than the pmem approaches/etc.

> +To enable user space based kexec file loader, the kernel needs to be able to
> +provide the device tree that describes the previous kernel's state before
> +performing the actual kexec. The process of generating that device tree is
> +called serialization. When the device tree is generated, some properties
> +of the system may become immutable because they are already written down
> +in the device tree. That state is called the KHO active phase.

This should have a whole state diagram as we've talked a few
times. There is alot more to worry about here than just 'activate'.

Jason
diff mbox series

Patch

diff --git a/Documentation/kho/concepts.rst b/Documentation/kho/concepts.rst
new file mode 100644
index 000000000000..232bddacc0ef
--- /dev/null
+++ b/Documentation/kho/concepts.rst
@@ -0,0 +1,80 @@ 
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+=======================
+Kexec Handover Concepts
+=======================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+It introduces multiple concepts:
+
+KHO Device Tree
+---------------
+
+Every KHO kexec carries a KHO specific flattened device tree blob that
+describes the state of the system. Device drivers can register to KHO to
+serialize their state before kexec. After KHO, device drivers can read
+the device tree and extract previous state.
+
+KHO only uses the fdt container format and libfdt library, but does not
+adhere to the same property semantics that normal device trees do: Properties
+are passed in native endianness and standardized properties like ``regs`` and
+``ranges`` do not exist, hence there are no ``#...-cells`` properties.
+
+KHO introduces a new concept to its device tree: ``mem`` properties. A
+``mem`` property can be inside any subnode in the device tree. When present,
+it contains an array of physical memory ranges that the new kernel must mark
+as reserved on boot. It is recommended, but not required, to make these ranges
+as physically contiguous as possible to reduce the number of array elements ::
+
+    struct kho_mem {
+            __u64 addr;
+            __u64 len;
+    };
+
+After boot, drivers can call the kho subsystem to transfer ownership of memory
+that was reserved via a ``mem`` property to themselves to continue using memory
+from the previous execution.
+
+The KHO device tree follows the in-Linux schema requirements. Any element in
+the device tree is documented via device tree schema yamls that explain what
+data gets transferred.
+
+Scratch Regions
+---------------
+
+To boot into kexec, we need to have a physically contiguous memory range that
+contains no handed over memory. Kexec then places the target kernel and initrd
+into that region. The new kernel exclusively uses this region for memory
+allocations before during boot up to the initialization of the page allocator.
+
+We guarantee that we always have such regions through the scratch regions: On
+first boot KHO allocates several physically contiguous memory regions. Since
+after kexec these regions will be used by early memory allocations, there is a
+scratch region per NUMA node plus a scratch region to satisfy allocations
+requests that do not require particilar NUMA node assignment.
+By default, size of the scratch region is calculated based on amount of memory
+allocated during boot. The ``kho_scratch`` kernel command line option may be used to explicitly define size of the scratch regions.
+The scratch regions are declared as CMA when page allocator is initialized so
+that their memory can be used during system lifetime. CMA gives us the
+guarantee that no handover pages land in that region, because handover pages
+must be at a static physical memory location and CMA enforces that only
+movable pages can be located inside.
+
+After KHO kexec, we ignore the ``kho_scratch`` kernel command line option and
+instead reuse the exact same region that was originally allocated. This allows
+us to recursively execute any amount of KHO kexecs. Because we used this region
+for boot memory allocations and as target memory for kexec blobs, some parts
+of that memory region may be reserved. These reservations are irrenevant for
+the next KHO, because kexec can overwrite even the original kernel.
+
+KHO active phase
+----------------
+
+To enable user space based kexec file loader, the kernel needs to be able to
+provide the device tree that describes the previous kernel's state before
+performing the actual kexec. The process of generating that device tree is
+called serialization. When the device tree is generated, some properties
+of the system may become immutable because they are already written down
+in the device tree. That state is called the KHO active phase.
diff --git a/Documentation/kho/index.rst b/Documentation/kho/index.rst
new file mode 100644
index 000000000000..5e7eeeca8520
--- /dev/null
+++ b/Documentation/kho/index.rst
@@ -0,0 +1,19 @@ 
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+========================
+Kexec Handover Subsystem
+========================
+
+.. toctree::
+   :maxdepth: 1
+
+   concepts
+   usage
+
+.. only::  subproject and html
+
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/kho/usage.rst b/Documentation/kho/usage.rst
new file mode 100644
index 000000000000..e7300fbb309c
--- /dev/null
+++ b/Documentation/kho/usage.rst
@@ -0,0 +1,60 @@ 
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+====================
+Kexec Handover Usage
+====================
+
+Kexec HandOver (KHO) is a mechanism that allows Linux to preserve state -
+arbitrary properties as well as memory locations - across kexec.
+
+This document expects that you are familiar with the base KHO
+:ref:`Documentation/kho/concepts.rst <concepts>`. If you have not read
+them yet, please do so now.
+
+Prerequisites
+-------------
+
+KHO is available when the ``CONFIG_KEXEC_HANDOVER`` config option is set to y
+at compile time. Every KHO producer may have its own config option that you
+need to enable if you would like to preserve their respective state across
+kexec.
+
+To use KHO, please boot the kernel with the ``kho=on`` command line
+parameter. You may use ``kho_scratch`` parameter to define size of the
+scratch regions. For example ``kho_scratch=512M,512M`` will reserve a 512
+MiB for a global scratch region and 512 MiB per NUMA node scratch regions
+on boot.
+
+Perform a KHO kexec
+-------------------
+
+Before you can perform a KHO kexec, you need to move the system into the
+:ref:`Documentation/kho/concepts.rst <KHO active phase>` ::
+
+  $ echo 1 > /sys/kernel/kho/active
+
+After this command, the KHO device tree is available in ``/sys/kernel/kho/dt``.
+
+Next, load the target payload and kexec into it. It is important that you
+use the ``-s`` parameter to use the in-kernel kexec file loader, as user
+space kexec tooling currently has no support for KHO with the user space
+based file loader ::
+
+  # kexec -l Image --initrd=initrd -s
+  # kexec -e
+
+The new kernel will boot up and contain some of the previous kernel's state.
+
+For example, if you used ``reserve_mem`` command line parameter to create
+an early memory reservation, the new kernel will have that memory at the
+same physical address as the old kernel.
+
+Abort a KHO exec
+----------------
+
+You can move the system out of KHO active phase again by calling ::
+
+  $ echo 1 > /sys/kernel/kho/active
+
+After this command, the KHO device tree is no longer available in
+``/sys/kernel/kho/dt``.
diff --git a/Documentation/subsystem-apis.rst b/Documentation/subsystem-apis.rst
index b52ad5b969d4..5fc69d6ff9f0 100644
--- a/Documentation/subsystem-apis.rst
+++ b/Documentation/subsystem-apis.rst
@@ -90,3 +90,4 @@  Other subsystems
    peci/index
    wmi/index
    tee/index
+   kho/index
diff --git a/MAINTAINERS b/MAINTAINERS
index e1e01b2a3727..82c2ef421c00 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12828,6 +12828,7 @@  S:	Maintained
 W:	http://kernel.org/pub/linux/utils/kernel/kexec/
 F:	Documentation/ABI/testing/sysfs-firmware-kho
 F:	Documentation/ABI/testing/sysfs-kernel-kho
+F:	Documentation/kho/
 F:	include/linux/kexec.h
 F:	include/uapi/linux/kexec.h
 F:	kernel/kexec*