mbox series

[v4,00/14] kexec: introduce Kexec HandOver (KHO)

Message ID 20250206132754.2596694-1-rppt@kernel.org (mailing list archive)
Headers show
Series kexec: introduce Kexec HandOver (KHO) | expand

Message

Mike Rapoport Feb. 6, 2025, 1:27 p.m. UTC
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>

Hi,

This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
just to make things simpler instead of ftrace we decided to preserve
"reserve_mem" regions.

The patches are also available in git:
https://git.kernel.org/rppt/h/kho/v4


Kexec today considers itself purely a boot loader: When we enter the new
kernel, any state the previous kernel left behind is irrelevant and the
new kernel reinitializes the system.

However, there are use cases where this mode of operation is not what we
actually want. In virtualization hosts for example, we want to use kexec
to update the host kernel while virtual machine memory stays untouched.
When we add device assignment to the mix, we also need to ensure that
IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
need to do the same for the PCI subsystem. If we want to kexec while an
SEV-SNP enabled virtual machine is running, we need to preserve the VM
context pages and physical memory. See "pkernfs: Persisting guest memory
and kernel/device state safely across kexec" Linux Plumbers
Conference 2023 presentation for details:

  https://lpc.events/event/17/contributions/1485/

To start us on the journey to support all the use cases above, this patch
implements basic infrastructure to allow hand over of kernel state across
kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
memblock's reserve_mem.
With this patch set applied, memory that was reserved using "reserve_mem"
command line options remains intact after kexec and it is guaranteed to
reside at the same physical address.

== Alternatives ==

There are alternative approaches to (parts of) the problems above:

  * Memory Pools [1] - preallocated persistent memory region + allocator
  * PRMEM [2] - resizable persistent memory regions with fixed metadata
                pointer on the kernel command line + allocator
  * Pkernfs [3] - preallocated file system for in-kernel data with fixed
                  address location on the kernel command line
  * PKRAM [4] - handover of user space pages using a fixed metadata page
                specified via command line

All of the approaches above fundamentally have the same problem: They
require the administrator to explicitly carve out a physical memory
location because they have no mechanism outside of the kernel command
line to pass data (including memory reservations) between kexec'ing
kernels.

KHO provides that base foundation. We will determine later whether we
still need any of the approaches above for fast bulk memory handover of for
example IOMMU page tables. But IMHO they would all be users of KHO, with
KHO providing the foundational primitive to pass metadata and bulk memory
reservations as well as provide easy versioning for data.

== Overview ==

We introduce a metadata file that the kernels pass between each other. How
they pass it is architecture specific. The file's format is a Flattened
Device Tree (fdt) which has a generator and parser already included in
Linux. When the root user enables KHO through /sys/kernel/kho/active, the
kernel invokes callbacks to every driver that supports KHO to serialize
its state. When the actual kexec happens, the fdt is part of the image
set that we boot into. In addition, we keep a "scratch regions" available
for kexec: A physically contiguous memory regions that is guaranteed to
not have any memory that KHO would preserve.  The new kernel bootstraps
itself using the scratch regions and sets all handed over memory as in use.
When drivers initialize that support KHO, they introspect the fdt and
recover their state from it. This includes memory reservations, where the
driver can either discard or claim reservations.

== Limitations ==

Currently KHO is only implemented for file based kexec. The kernel
interfaces in the patch set are already in place to support user space
kexec as well, but it is still not implemented it yet inside kexec tools.

== How to Use ==

To use the code, please boot the kernel with the "kho=on" command line
parameter.
KHO will automatically create scratch regions. If you want to set the
scratch size explicitly you can use "kho_scratch=" command line parameter.
For instance, "kho_scratch=512M,256M" will create a global scratch area of
512Mib and per-node scrath areas of 256Mib.

Make sure to to have a reserved memory range requested with reserv_mem
command line option. Then before you invoke file based "kexec -l", activate
KHO:

  # echo 1 > /sys/kernel/kho/active
  # kexec -l Image --initrd=initrd -s
  # kexec -e

The new kernel will boot up and contain the previous kernel's reserve_mem
contents at the same physical address as the first kernel.

== Changelog ==

v3 -> v4:
  - Major rework of scrach management. Rather than force scratch memory
    allocations only very early in boot now we rely on scratch for all
    memblock allocations.
  - Use simple example usecase (reserv_mem instead of ftrace)
  - merge all KHO functionality into a single kernel/kexec_handover.c file
  - rename CONFIG_KEXEC_KHO to CONFIG_KEXEC_HANDOVER

v1 -> v2:
  - Removed: tracing: Introduce names for ring buffers
  - Removed: tracing: Introduce names for events
  - New: kexec: Add config option for KHO
  - New: kexec: Add documentation for KHO
  - New: tracing: Initialize fields before registering
  - New: devicetree: Add bindings for ftrace KHO
  - test bot warning fixes
  - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - s/kho_reserve/kho_reserve_scratch/g
  - Remove / reduce ifdefs
  - Select crc32
  - Leave anything that requires a name in trace.c to keep buffers
    unnamed entities
  - Put events as array into a property, use fingerprint instead of
    names to identify them
  - Reduce footprint without CONFIG_FTRACE_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - make kho_get_fdt() const
  - Add stubs for return_mem and claim_mem
  - make kho_get_fdt() const
  - Get events as array from a property, use fingerprint instead of
    names to identify events
  - Change kconfig option to ARCH_SUPPORTS_KEXEC_KHO
  - s/kho_reserve_mem/kho_reserve_previous_mem/g
  - s/kho_reserve/kho_reserve_scratch/g
  - Leave the node generation code that needs to know the name in
    trace.c so that ring buffers can stay anonymous
  - s/kho_reserve/kho_reserve_scratch/g
  - Move kho enums out of ifdef
  - Move from names to fdt offsets. That way, trace.c can find the trace
    array offset and then the ring buffer code only needs to read out
    its per-CPU data. That way it can stay oblivient to its name.
  - Make kho_get_fdt() const

v2 -> v3:

  - Fix make dt_binding_check
  - Add descriptions for each object
  - s/trace_flags/trace-flags/
  - s/global_trace/global-trace/
  - Make all additionalProperties false
  - Change subject to reflect subsysten (dt-bindings)
  - Fix indentation
  - Remove superfluous examples
  - Convert to 64bit syntax
  - Move to kho directory
  - s/"global_trace"/"global-trace"/
  - s/"global_trace"/"global-trace"/
  - s/"trace_flags"/"trace-flags"/
  - Fix wording
  - Add Documentation to MAINTAINERS file
  - Remove kho reference on read error
  - Move handover_dt unmap up
  - s/reserve_scratch_mem/mark_phys_as_cma/
  - Remove ifdeffery
  - Remove superfluous comment

Alexander Graf (9):
  memblock: Add support for scratch memory
  kexec: Add Kexec HandOver (KHO) generation helpers
  kexec: Add KHO parsing support
  kexec: Add KHO support to kexec file loads
  kexec: Add config option for KHO
  kexec: Add documentation for KHO
  arm64: Add KHO support
  x86: Add KHO support
  memblock: Add KHO support for reserve_mem

Mike Rapoport (Microsoft) (5):
  mm/mm_init: rename init_reserved_page to init_deferred_page
  memblock: add MEMBLOCK_RSRV_KERN flag
  memblock: introduce memmap_init_kho_scratch()
  x86/setup: use memblock_reserve_kern for memory used by kernel
  Documentation: KHO: Add memblock bindings

 Documentation/ABI/testing/sysfs-firmware-kho  |   9 +
 Documentation/ABI/testing/sysfs-kernel-kho    |  53 ++
 .../admin-guide/kernel-parameters.txt         |  24 +
 .../kho/bindings/memblock/reserve_mem.yaml    |  41 +
 .../bindings/memblock/reserve_mem_map.yaml    |  42 +
 Documentation/kho/concepts.rst                |  80 ++
 Documentation/kho/index.rst                   |  19 +
 Documentation/kho/usage.rst                   |  60 ++
 Documentation/subsystem-apis.rst              |   1 +
 MAINTAINERS                                   |   3 +
 arch/arm64/Kconfig                            |   3 +
 arch/x86/Kconfig                              |   3 +
 arch/x86/boot/compressed/kaslr.c              |  52 +-
 arch/x86/include/asm/setup.h                  |   4 +
 arch/x86/include/uapi/asm/setup_data.h        |  13 +-
 arch/x86/kernel/e820.c                        |  18 +
 arch/x86/kernel/kexec-bzimage64.c             |  36 +
 arch/x86/kernel/setup.c                       |  39 +-
 arch/x86/realmode/init.c                      |   2 +
 drivers/of/fdt.c                              |  36 +
 drivers/of/kexec.c                            |  42 +
 include/linux/cma.h                           |   2 +
 include/linux/kexec.h                         |  37 +
 include/linux/kexec_handover.h                |  10 +
 include/linux/memblock.h                      |  38 +-
 kernel/Kconfig.kexec                          |  13 +
 kernel/Makefile                               |   1 +
 kernel/kexec_file.c                           |  19 +
 kernel/kexec_handover.c                       | 808 ++++++++++++++++++
 kernel/kexec_internal.h                       |  16 +
 mm/Kconfig                                    |   4 +
 mm/internal.h                                 |   5 +-
 mm/memblock.c                                 | 247 +++++-
 mm/mm_init.c                                  |  19 +-
 34 files changed, 1775 insertions(+), 24 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-firmware-kho
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
 create mode 100644 Documentation/kho/bindings/memblock/reserve_mem.yaml
 create mode 100644 Documentation/kho/bindings/memblock/reserve_mem_map.yaml
 create mode 100644 Documentation/kho/concepts.rst
 create mode 100644 Documentation/kho/index.rst
 create mode 100644 Documentation/kho/usage.rst
 create mode 100644 include/linux/kexec_handover.h
 create mode 100644 kernel/kexec_handover.c


base-commit: 2014c95afecee3e76ca4a56956a936e23283f05b

Comments

Andrew Morton Feb. 7, 2025, 12:29 a.m. UTC | #1
On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:

> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> just to make things simpler instead of ftrace we decided to preserve
> "reserve_mem" regions.
> 
> The patches are also available in git:
> https://git.kernel.org/rppt/h/kho/v4
> 
> 
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.

I tossed this into mm.git for some testing and exposure.

What merge path are you anticipating?

Review activity seems pretty thin thus far?
Pasha Tatashin Feb. 7, 2025, 1:28 a.m. UTC | #2
On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
>
> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > just to make things simpler instead of ftrace we decided to preserve
> > "reserve_mem" regions.
> >
> > The patches are also available in git:
> > https://git.kernel.org/rppt/h/kho/v4
> >
> >
> > Kexec today considers itself purely a boot loader: When we enter the new
> > kernel, any state the previous kernel left behind is irrelevant and the
> > new kernel reinitializes the system.
>
> I tossed this into mm.git for some testing and exposure.
>
> What merge path are you anticipating?
>
> Review activity seems pretty thin thus far?

KHO is going to be discussed at the upcoming lsfmm, we are also
planning to send v5 of this patch series (discussed with Mike
Rapoport) in a couple of weeks. It will include enhancements needed
for the hypervisor live update scenario:

1. Allow nodes to be added to the KHO tree at any time
2. Remove "activate" (I will also send a live update framework that
provides the activate functionality).
3. Allow serialization during shutdown.
4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
not be used during live update blackout time.
5. Enable multithreaded serialization by using hash-table as an
intermediate step before conversion to FDT.

Pasha
Andrew Morton Feb. 7, 2025, 4:50 a.m. UTC | #3
My x86_64 allmodconfig sayeth:

WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0xca (section: .text) -> memblock_alloc_try_nid (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0xf5 (section: .text) -> scratch_scale (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x100 (section: .text) -> scratch_size_global (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x11d (section: .text) -> scratch_size_global (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x129 (section: .text) -> scratch_size_pernode (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x14e (section: .text) -> memblock_phys_alloc_range (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x261 (section: .text) -> scratch_size_pernode (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x26d (section: .text) -> scratch_size_pernode (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x29b (section: .text) -> memblock_alloc_range_nid (section: .init.text)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x334 (section: .text) -> scratch_scale (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x33f (section: .text) -> scratch_size_global (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x363 (section: .text) -> scratch_scale (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x371 (section: .text) -> scratch_size_global (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x3a1 (section: .text) -> scratch_scale (section: .init.data)
WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0x3af (section: .text) -> scratch_size_global (section: .init.data)
Mike Rapoport Feb. 7, 2025, 8:01 a.m. UTC | #4
On Thu, Feb 06, 2025 at 08:50:30PM -0800, Andrew Morton wrote:
> My x86_64 allmodconfig sayeth:
> 
> WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0xca (section: .text) -> memblock_alloc_try_nid (section: .init.text)
> WARNING: modpost: vmlinux: section mismatch in reference: kho_reserve_scratch+0xf5 (section: .text) -> scratch_scale (section: .init.data)

This should fix it:

From 176767698d4ac5b7cddffe16677b60cb18dce786 Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Fri, 7 Feb 2025 09:57:09 +0200
Subject: [PATCH] kho: make kho_reserve_scratch and kho_init_reserved_pages
 __init

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 kernel/kexec_handover.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index c21ea2a09d47..e0b92011afe2 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -620,7 +620,7 @@ static phys_addr_t __init scratch_size(int nid)
  * active. This CMA region will only be used for movable pages which are not a
  * problem for us during KHO because we can just move them somewhere else.
  */
-static void kho_reserve_scratch(void)
+static void __init kho_reserve_scratch(void)
 {
 	phys_addr_t addr, size;
 	int nid, i = 1;
@@ -672,7 +672,7 @@ static void kho_reserve_scratch(void)
  * Scan the DT for any memory ranges and make sure they are reserved in
  * memblock, otherwise they will end up in a weird state on free lists.
  */
-static void kho_init_reserved_pages(void)
+static void __init kho_init_reserved_pages(void)
 {
 	const void *fdt = kho_get_fdt();
 	int offset = 0, depth = 0, initial_depth = 0, len;
Mike Rapoport Feb. 7, 2025, 8:06 a.m. UTC | #5
On Thu, Feb 06, 2025 at 04:29:39PM -0800, Andrew Morton wrote:
> On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> 
> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > just to make things simpler instead of ftrace we decided to preserve
> > "reserve_mem" regions.
> > 
> > The patches are also available in git:
> > https://git.kernel.org/rppt/h/kho/v4
> > 
> > 
> > Kexec today considers itself purely a boot loader: When we enter the new
> > kernel, any state the previous kernel left behind is irrelevant and the
> > new kernel reinitializes the system.
> 
> I tossed this into mm.git for some testing and exposure.
> 
> What merge path are you anticipating?

I think your tree is the most appropriate, but let's wait for Acks from x86
and arm64 people ;-)

> Review activity seems pretty thin thus far?

Yeah :(
Maybe with Pasha's version on top of that we'll have more people reviewing.

And here is another fixup for a sparse error kbuild reported:


From e1e34b96b96b89a01ee31be223c8dfc2ce1c4cbe Mon Sep 17 00:00:00 2001
From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
Date: Fri, 7 Feb 2025 09:55:03 +0200
Subject: [PATCH] kho: make bin_attr_dt_kern static

Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 kernel/kexec_handover.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
index c26753d613cb..c21ea2a09d47 100644
--- a/kernel/kexec_handover.c
+++ b/kernel/kexec_handover.c
@@ -258,7 +258,7 @@ static ssize_t dt_read(struct file *file, struct kobject *kobj,
 	return count;
 }
 
-struct bin_attribute bin_attr_dt_kern = __BIN_ATTR(dt, 0400, dt_read, NULL, 0);
+static struct bin_attribute bin_attr_dt_kern = __BIN_ATTR(dt, 0400, dt_read, NULL, 0);
 
 static int kho_expose_dt(void *fdt)
 {
Baoquan He Feb. 8, 2025, 1:38 a.m. UTC | #6
On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> >
> > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > just to make things simpler instead of ftrace we decided to preserve
> > > "reserve_mem" regions.
> > >
> > > The patches are also available in git:
> > > https://git.kernel.org/rppt/h/kho/v4
> > >
> > >
> > > Kexec today considers itself purely a boot loader: When we enter the new
> > > kernel, any state the previous kernel left behind is irrelevant and the
> > > new kernel reinitializes the system.
> >
> > I tossed this into mm.git for some testing and exposure.
> >
> > What merge path are you anticipating?
> >
> > Review activity seems pretty thin thus far?
> 
> KHO is going to be discussed at the upcoming lsfmm, we are also
> planning to send v5 of this patch series (discussed with Mike
> Rapoport) in a couple of weeks. It will include enhancements needed
> for the hypervisor live update scenario:

So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
reviewing until v5? Or this series is a infrustructure building, v5 will
add more details as you listed as below. I am a little confused.

> 
> 1. Allow nodes to be added to the KHO tree at any time
> 2. Remove "activate" (I will also send a live update framework that
> provides the activate functionality).
> 3. Allow serialization during shutdown.
> 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> not be used during live update blackout time.
> 5. Enable multithreaded serialization by using hash-table as an
> intermediate step before conversion to FDT.
Mike Rapoport Feb. 8, 2025, 8:41 a.m. UTC | #7
Hi Baoquan,

On Sat, Feb 08, 2025 at 09:38:27AM +0800, Baoquan He wrote:
> On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> > On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > > just to make things simpler instead of ftrace we decided to preserve
> > > > "reserve_mem" regions.
> > > >
> > > > The patches are also available in git:
> > > > https://git.kernel.org/rppt/h/kho/v4
> > > >
> > > >
> > > > Kexec today considers itself purely a boot loader: When we enter the new
> > > > kernel, any state the previous kernel left behind is irrelevant and the
> > > > new kernel reinitializes the system.
> > >
> > > I tossed this into mm.git for some testing and exposure.
> > >
> > > What merge path are you anticipating?
> > >
> > > Review activity seems pretty thin thus far?
> > 
> > KHO is going to be discussed at the upcoming lsfmm, we are also
> > planning to send v5 of this patch series (discussed with Mike
> > Rapoport) in a couple of weeks. It will include enhancements needed
> > for the hypervisor live update scenario:
> 
> So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
> reviewing until v5? Or this series is a infrustructure building, v5 will
> add more details as you listed as below. I am a little confused.

v4 adds the very basic support for kexec handover in the simplest form we
could think of. There were discussions on Linux MM Alignment and Hypervisor
live update meetings and there people agreed about MVP for KHO that v4
essentially implements.

v5 will add more details on top of v4 and I'm not sure there's a consensus
about some of them among the people involved in KHO.
 
> > 1. Allow nodes to be added to the KHO tree at any time
> > 2. Remove "activate" (I will also send a live update framework that
> > provides the activate functionality).
> > 3. Allow serialization during shutdown.
> > 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> > not be used during live update blackout time.
> > 5. Enable multithreaded serialization by using hash-table as an
> > intermediate step before conversion to FDT.
>
Baoquan He Feb. 8, 2025, 11:13 a.m. UTC | #8
On 02/08/25 at 10:41am, Mike Rapoport wrote:
> Hi Baoquan,
> 
> On Sat, Feb 08, 2025 at 09:38:27AM +0800, Baoquan He wrote:
> > On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> > > On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > > > just to make things simpler instead of ftrace we decided to preserve
> > > > > "reserve_mem" regions.
> > > > >
> > > > > The patches are also available in git:
> > > > > https://git.kernel.org/rppt/h/kho/v4
> > > > >
> > > > >
> > > > > Kexec today considers itself purely a boot loader: When we enter the new
> > > > > kernel, any state the previous kernel left behind is irrelevant and the
> > > > > new kernel reinitializes the system.
> > > >
> > > > I tossed this into mm.git for some testing and exposure.
> > > >
> > > > What merge path are you anticipating?
> > > >
> > > > Review activity seems pretty thin thus far?
> > > 
> > > KHO is going to be discussed at the upcoming lsfmm, we are also
> > > planning to send v5 of this patch series (discussed with Mike
> > > Rapoport) in a couple of weeks. It will include enhancements needed
> > > for the hypervisor live update scenario:
> > 
> > So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
> > reviewing until v5? Or this series is a infrustructure building, v5 will
> > add more details as you listed as below. I am a little confused.
> 
> v4 adds the very basic support for kexec handover in the simplest form we
> could think of. There were discussions on Linux MM Alignment and Hypervisor
> live update meetings and there people agreed about MVP for KHO that v4
> essentially implements.
> 
> v5 will add more details on top of v4 and I'm not sure there's a consensus
> about some of them among the people involved in KHO.

Thanks for the information.

Then I will apply v4 and learn the infrastructure and mechanism firstly.

While what sounds more meaningful to me is v4 can be reviewed, then updated
and merged. Then another patchset can be posted to add details, if you have
reached the consensus on the infrastructure part. With that, posting and
reviewing will be much easier. Unless you guys are still discussing the
infrastructure part.

>  
> > > 1. Allow nodes to be added to the KHO tree at any time
> > > 2. Remove "activate" (I will also send a live update framework that
> > > provides the activate functionality).
> > > 3. Allow serialization during shutdown.
> > > 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> > > not be used during live update blackout time.
> > > 5. Enable multithreaded serialization by using hash-table as an
> > > intermediate step before conversion to FDT.
> > 
> 
> -- 
> Sincerely yours,
> Mike.
>
Cong Wang Feb. 8, 2025, 11:39 p.m. UTC | #9
Hi Mike,

On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Hi,
>
> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> just to make things simpler instead of ftrace we decided to preserve
> "reserve_mem" regions.
>
> The patches are also available in git:
> https://git.kernel.org/rppt/h/kho/v4
>
>
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See "pkernfs: Persisting guest memory
> and kernel/device state safely across kexec" Linux Plumbers
> Conference 2023 presentation for details:
>
>   https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this patch
> implements basic infrastructure to allow hand over of kernel state across
> kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> memblock's reserve_mem.
> With this patch set applied, memory that was reserved using "reserve_mem"
> command line options remains intact after kexec and it is guaranteed to
> reside at the same physical address.

Nice work!

One concern there is that using memblock to reserve memory as crashkernel=
is not flexible. I worked on kdump years ago and one of the biggest pains
of kdump is how much memory should be reserved with crashkernel=. And
it is still a pain today.

If we reserve more, that would mean more waste for the 1st kernel. If we
reserve less, that would induce more OOM for the 2nd kernel.

I'd suggest considering using CMA, where the "reserved" memory can be
still reusable for other purposes, just that pages can be migrated out of this
reserved region on demand, that is, when loading a kexec kernel. Of course,
we need to make sure they are not reused by what you want to preserve here,
e.g., IOMMU. So you might need additional work to make it work, but still I
believe this is the right direction.

Just my two cents.

Thanks!
Pasha Tatashin Feb. 9, 2025, 12:13 a.m. UTC | #10
On Sat, Feb 8, 2025 at 6:39 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hi Mike,
>
> On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> >
> > Hi,
> >
> > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > just to make things simpler instead of ftrace we decided to preserve
> > "reserve_mem" regions.
> >
> > The patches are also available in git:
> > https://git.kernel.org/rppt/h/kho/v4
> >
> >
> > Kexec today considers itself purely a boot loader: When we enter the new
> > kernel, any state the previous kernel left behind is irrelevant and the
> > new kernel reinitializes the system.
> >
> > However, there are use cases where this mode of operation is not what we
> > actually want. In virtualization hosts for example, we want to use kexec
> > to update the host kernel while virtual machine memory stays untouched.
> > When we add device assignment to the mix, we also need to ensure that
> > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> > need to do the same for the PCI subsystem. If we want to kexec while an
> > SEV-SNP enabled virtual machine is running, we need to preserve the VM
> > context pages and physical memory. See "pkernfs: Persisting guest memory
> > and kernel/device state safely across kexec" Linux Plumbers
> > Conference 2023 presentation for details:
> >
> >   https://lpc.events/event/17/contributions/1485/
> >
> > To start us on the journey to support all the use cases above, this patch
> > implements basic infrastructure to allow hand over of kernel state across
> > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> > memblock's reserve_mem.
> > With this patch set applied, memory that was reserved using "reserve_mem"
> > command line options remains intact after kexec and it is guaranteed to
> > reside at the same physical address.
>
> Nice work!
>
> One concern there is that using memblock to reserve memory as crashkernel=
> is not flexible. I worked on kdump years ago and one of the biggest pains
> of kdump is how much memory should be reserved with crashkernel=. And
> it is still a pain today.
>
> If we reserve more, that would mean more waste for the 1st kernel. If we
> reserve less, that would induce more OOM for the 2nd kernel.
>
> I'd suggest considering using CMA, where the "reserved" memory can be
> still reusable for other purposes, just that pages can be migrated out of this
> reserved region on demand, that is, when loading a kexec kernel. Of course,
> we need to make sure they are not reused by what you want to preserve here,
> e.g., IOMMU. So you might need additional work to make it work, but still I
> believe this is the right direction.

This is exactly what scratch memory is used for. Unlike crashkernel=,
the entire scratch area is available to user applications as CMA, as
we know that no kernel-reserved memory will come from that area. This
doesn't work for crashkernel=, because in some cases, the user pages
might also need to be preserved in the crash dump. However, if user
pages are going to be discarded from the crash dump (as is done 99% of
the time), then it is better to also make it use CMA or ZONE_MOVABLE
and use only the memory occupied by the crash kernel and do not waste
any memory at all. We have an internal patch at Google that does this,
and I think it would be a good improvement for the upstream kernel to
carry as well.

Pasha

>
> Just my two cents.
>
> Thanks!
Pasha Tatashin Feb. 9, 2025, 12:23 a.m. UTC | #11
On Fri, Feb 7, 2025 at 8:38 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> > On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > > just to make things simpler instead of ftrace we decided to preserve
> > > > "reserve_mem" regions.
> > > >
> > > > The patches are also available in git:
> > > > https://git.kernel.org/rppt/h/kho/v4
> > > >
> > > >
> > > > Kexec today considers itself purely a boot loader: When we enter the new
> > > > kernel, any state the previous kernel left behind is irrelevant and the
> > > > new kernel reinitializes the system.
> > >
> > > I tossed this into mm.git for some testing and exposure.
> > >
> > > What merge path are you anticipating?
> > >
> > > Review activity seems pretty thin thus far?
> >
> > KHO is going to be discussed at the upcoming lsfmm, we are also
> > planning to send v5 of this patch series (discussed with Mike
> > Rapoport) in a couple of weeks. It will include enhancements needed
> > for the hypervisor live update scenario:
>
> So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
> reviewing until v5? Or this series is a infrustructure building, v5 will
> add more details as you listed as below. I am a little confused.

We will modify the existing patches and send as v5 because some
interfaces are going to be changed*.

Otherwise, v5 will make KHO a lot more flexible as it will allow to
use the tree all the time while the system is running instead of only
once during the activation phase.

* Changing interfaces  is optional, but decision whether to change
will be discussed at Hypervisor Live Update on Feb 10th:
https://lore.kernel.org/all/26a4b7ca-93a6-30e2-923b-f551ced03d62@google.com/

>
> >
> > 1. Allow nodes to be added to the KHO tree at any time
> > 2. Remove "activate" (I will also send a live update framework that
> > provides the activate functionality).
> > 3. Allow serialization during shutdown.
> > 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> > not be used during live update blackout time.
> > 5. Enable multithreaded serialization by using hash-table as an
> > intermediate step before conversion to FDT.
>
Cong Wang Feb. 9, 2025, 12:51 a.m. UTC | #12
Hi Mike,

On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@kernel.org> wrote:
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch regions" available
> for kexec: A physically contiguous memory regions that is guaranteed to
> not have any memory that KHO would preserve.  The new kernel bootstraps
> itself using the scratch regions and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.

I have gone through your entire patchset, if you could provide an example
of a specific driver that supports KHO it would help a lot for people to
understand and more importantly help driver developers to adopt.
Even with a simulated driver, e.g. netdevsim, it would be greatly helpful.

Thanks.
Cong Wang Feb. 9, 2025, 1 a.m. UTC | #13
On Sat, Feb 8, 2025 at 4:14 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On Sat, Feb 8, 2025 at 6:39 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Hi Mike,
> >
> > On Thu, Feb 6, 2025 at 5:28 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
> > >
> > > Hi,
> > >
> > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > just to make things simpler instead of ftrace we decided to preserve
> > > "reserve_mem" regions.
> > >
> > > The patches are also available in git:
> > > https://git.kernel.org/rppt/h/kho/v4
> > >
> > >
> > > Kexec today considers itself purely a boot loader: When we enter the new
> > > kernel, any state the previous kernel left behind is irrelevant and the
> > > new kernel reinitializes the system.
> > >
> > > However, there are use cases where this mode of operation is not what we
> > > actually want. In virtualization hosts for example, we want to use kexec
> > > to update the host kernel while virtual machine memory stays untouched.
> > > When we add device assignment to the mix, we also need to ensure that
> > > IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> > > need to do the same for the PCI subsystem. If we want to kexec while an
> > > SEV-SNP enabled virtual machine is running, we need to preserve the VM
> > > context pages and physical memory. See "pkernfs: Persisting guest memory
> > > and kernel/device state safely across kexec" Linux Plumbers
> > > Conference 2023 presentation for details:
> > >
> > >   https://lpc.events/event/17/contributions/1485/
> > >
> > > To start us on the journey to support all the use cases above, this patch
> > > implements basic infrastructure to allow hand over of kernel state across
> > > kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> > > memblock's reserve_mem.
> > > With this patch set applied, memory that was reserved using "reserve_mem"
> > > command line options remains intact after kexec and it is guaranteed to
> > > reside at the same physical address.
> >
> > Nice work!
> >
> > One concern there is that using memblock to reserve memory as crashkernel=
> > is not flexible. I worked on kdump years ago and one of the biggest pains
> > of kdump is how much memory should be reserved with crashkernel=. And
> > it is still a pain today.
> >
> > If we reserve more, that would mean more waste for the 1st kernel. If we
> > reserve less, that would induce more OOM for the 2nd kernel.
> >
> > I'd suggest considering using CMA, where the "reserved" memory can be
> > still reusable for other purposes, just that pages can be migrated out of this
> > reserved region on demand, that is, when loading a kexec kernel. Of course,
> > we need to make sure they are not reused by what you want to preserve here,
> > e.g., IOMMU. So you might need additional work to make it work, but still I
> > believe this is the right direction.
>
> This is exactly what scratch memory is used for. Unlike crashkernel=,
> the entire scratch area is available to user applications as CMA, as
> we know that no kernel-reserved memory will come from that area. This
> doesn't work for crashkernel=, because in some cases, the user pages
> might also need to be preserved in the crash dump. However, if user
> pages are going to be discarded from the crash dump (as is done 99% of
> the time), then it is better to also make it use CMA or ZONE_MOVABLE
> and use only the memory occupied by the crash kernel and do not waste
> any memory at all. We have an internal patch at Google that does this,
> and I think it would be a good improvement for the upstream kernel to
> carry as well.

Good to know CMA is already used, I could not tell from the cover letter.

The case that user-space pages need to be preserved is for scenarios like
RDMA which pins user-space pages for DMA transfer. Since the goal here
is also to preserve hardware states like RDMA's I guess the same concern
remains.

Thanks!
Baoquan He Feb. 9, 2025, 3:07 a.m. UTC | #14
On 02/08/25 at 07:23pm, Pasha Tatashin wrote:
> On Fri, Feb 7, 2025 at 8:38 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 02/06/25 at 08:28pm, Pasha Tatashin wrote:
> > > On Thu, Feb 6, 2025 at 7:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > > This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> > > > > series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> > > > > just to make things simpler instead of ftrace we decided to preserve
> > > > > "reserve_mem" regions.
> > > > >
> > > > > The patches are also available in git:
> > > > > https://git.kernel.org/rppt/h/kho/v4
> > > > >
> > > > >
> > > > > Kexec today considers itself purely a boot loader: When we enter the new
> > > > > kernel, any state the previous kernel left behind is irrelevant and the
> > > > > new kernel reinitializes the system.
> > > >
> > > > I tossed this into mm.git for some testing and exposure.
> > > >
> > > > What merge path are you anticipating?
> > > >
> > > > Review activity seems pretty thin thus far?
> > >
> > > KHO is going to be discussed at the upcoming lsfmm, we are also
> > > planning to send v5 of this patch series (discussed with Mike
> > > Rapoport) in a couple of weeks. It will include enhancements needed
> > > for the hypervisor live update scenario:
> >
> > So is this V4 still a RFC if v5 will be sent by plan? Should we hold the
> > reviewing until v5? Or this series is a infrustructure building, v5 will
> > add more details as you listed as below. I am a little confused.
> 
> We will modify the existing patches and send as v5 because some
> interfaces are going to be changed*.
> 
> Otherwise, v5 will make KHO a lot more flexible as it will allow to
> use the tree all the time while the system is running instead of only
> once during the activation phase.
> 
> * Changing interfaces  is optional, but decision whether to change
> will be discussed at Hypervisor Live Update on Feb 10th:
> https://lore.kernel.org/all/26a4b7ca-93a6-30e2-923b-f551ced03d62@google.com/


Ah, this is what I would like to know about the difference between v4
and v5. Thanks for the information, and looking forward to seeing the v5
update.

> 
> >
> > >
> > > 1. Allow nodes to be added to the KHO tree at any time
> > > 2. Remove "activate" (I will also send a live update framework that
> > > provides the activate functionality).
> > > 3. Allow serialization during shutdown.
> > > 4. Decouple KHO from kexec_file_load(), as kexec_file_load() should
> > > not be used during live update blackout time.
> > > 5. Enable multithreaded serialization by using hash-table as an
> > > intermediate step before conversion to FDT.
> >
>
Krzysztof Kozlowski Feb. 9, 2025, 10:33 a.m. UTC | #15
On 07/02/2025 01:29, Andrew Morton wrote:
> On Thu,  6 Feb 2025 15:27:40 +0200 Mike Rapoport <rppt@kernel.org> wrote:
> 
>> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
>> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
>> just to make things simpler instead of ftrace we decided to preserve
>> "reserve_mem" regions.
>>
>> The patches are also available in git:
>> https://git.kernel.org/rppt/h/kho/v4
>>
>>
>> Kexec today considers itself purely a boot loader: When we enter the new
>> kernel, any state the previous kernel left behind is irrelevant and the
>> new kernel reinitializes the system.
> 
> I tossed this into mm.git for some testing and exposure.
> 
> What merge path are you anticipating?
> 
> Review activity seems pretty thin thus far?


At least for DT ABI because:
1. For some reason this escaped Patchwork. Maybe was blocked by spam
filters, maybe Cc list is too big. No clue.
2. In the same time fallback to Patchwork was avoided by:
Cc-ing wrong address and not using expected (see git log) subject prefixes.

Best regards,
Krzysztof
RuiRui Yang Feb. 17, 2025, 3:19 a.m. UTC | #16
On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
>
> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org>
>
> Hi,
>
> This a next version of Alex's "kexec: Allow preservation of ftrace buffers"
> series (https://lore.kernel.org/all/20240117144704.602-1-graf@amazon.com),
> just to make things simpler instead of ftrace we decided to preserve
> "reserve_mem" regions.
>
> The patches are also available in git:
> https://git.kernel.org/rppt/h/kho/v4
>
>
> Kexec today considers itself purely a boot loader: When we enter the new
> kernel, any state the previous kernel left behind is irrelevant and the
> new kernel reinitializes the system.
>
> However, there are use cases where this mode of operation is not what we
> actually want. In virtualization hosts for example, we want to use kexec
> to update the host kernel while virtual machine memory stays untouched.
> When we add device assignment to the mix, we also need to ensure that
> IOMMU and VFIO states are untouched. If we add PCIe peer to peer DMA, we
> need to do the same for the PCI subsystem. If we want to kexec while an
> SEV-SNP enabled virtual machine is running, we need to preserve the VM
> context pages and physical memory. See "pkernfs: Persisting guest memory
> and kernel/device state safely across kexec" Linux Plumbers
> Conference 2023 presentation for details:
>
>   https://lpc.events/event/17/contributions/1485/
>
> To start us on the journey to support all the use cases above, this patch
> implements basic infrastructure to allow hand over of kernel state across
> kexec (Kexec HandOver, aka KHO). As a really simple example target, we use
> memblock's reserve_mem.
> With this patch set applied, memory that was reserved using "reserve_mem"
> command line options remains intact after kexec and it is guaranteed to
> reside at the same physical address.
>
> == Alternatives ==
>
> There are alternative approaches to (parts of) the problems above:
>
>   * Memory Pools [1] - preallocated persistent memory region + allocator
>   * PRMEM [2] - resizable persistent memory regions with fixed metadata
>                 pointer on the kernel command line + allocator
>   * Pkernfs [3] - preallocated file system for in-kernel data with fixed
>                   address location on the kernel command line
>   * PKRAM [4] - handover of user space pages using a fixed metadata page
>                 specified via command line
>
> All of the approaches above fundamentally have the same problem: They
> require the administrator to explicitly carve out a physical memory
> location because they have no mechanism outside of the kernel command
> line to pass data (including memory reservations) between kexec'ing
> kernels.
>
> KHO provides that base foundation. We will determine later whether we
> still need any of the approaches above for fast bulk memory handover of for
> example IOMMU page tables. But IMHO they would all be users of KHO, with
> KHO providing the foundational primitive to pass metadata and bulk memory
> reservations as well as provide easy versioning for data.
>
> == Overview ==
>
> We introduce a metadata file that the kernels pass between each other. How
> they pass it is architecture specific. The file's format is a Flattened
> Device Tree (fdt) which has a generator and parser already included in
> Linux. When the root user enables KHO through /sys/kernel/kho/active, the
> kernel invokes callbacks to every driver that supports KHO to serialize
> its state. When the actual kexec happens, the fdt is part of the image
> set that we boot into. In addition, we keep a "scratch regions" available
> for kexec: A physically contiguous memory regions that is guaranteed to
> not have any memory that KHO would preserve.  The new kernel bootstraps
> itself using the scratch regions and sets all handed over memory as in use.
> When drivers initialize that support KHO, they introspect the fdt and
> recover their state from it. This includes memory reservations, where the
> driver can either discard or claim reservations.
>
> == Limitations ==
>
> Currently KHO is only implemented for file based kexec. The kernel
> interfaces in the patch set are already in place to support user space
> kexec as well, but it is still not implemented it yet inside kexec tools.
>

What architecture exactly does this KHO work fine?   Device Tree
should be ok on arm*, x86 and power*, but how about s390?

Thanks
Dae
Mike Rapoport Feb. 19, 2025, 7:32 a.m. UTC | #17
On Mon, Feb 17, 2025 at 11:19:45AM +0800, RuiRui Yang wrote:
> On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
> > == Limitations ==
> >
> > Currently KHO is only implemented for file based kexec. The kernel
> > interfaces in the patch set are already in place to support user space
> > kexec as well, but it is still not implemented it yet inside kexec tools.
> >
> 
> What architecture exactly does this KHO work fine?   Device Tree
> should be ok on arm*, x86 and power*, but how about s390?

KHO does not use device tree as the boot protocol, it uses FDT as a data
structure and adds architecture specific bits to the boot structures to
point to that data, very similar to how IMA_KEXEC works.

Currently KHO is implemented on arm64 and x86, but there is no fundamental
reason why it wouldn't work on any architecture that supports kexec.
 
> Thanks
> Dae
>
Dave Young Feb. 19, 2025, 12:49 p.m. UTC | #18
On Wed, 19 Feb 2025 at 15:32, Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Feb 17, 2025 at 11:19:45AM +0800, RuiRui Yang wrote:
> > On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
> > > == Limitations ==
> > >
> > > Currently KHO is only implemented for file based kexec. The kernel
> > > interfaces in the patch set are already in place to support user space
> > > kexec as well, but it is still not implemented it yet inside kexec tools.
> > >
> >
> > What architecture exactly does this KHO work fine?   Device Tree
> > should be ok on arm*, x86 and power*, but how about s390?
>
> KHO does not use device tree as the boot protocol, it uses FDT as a data
> structure and adds architecture specific bits to the boot structures to
> point to that data, very similar to how IMA_KEXEC works.
>
> Currently KHO is implemented on arm64 and x86, but there is no fundamental
> reason why it wouldn't work on any architecture that supports kexec.

Well,  the problem is whether there is a way to  add dtb in the early
boot path,  for X86 it is added via setup_data,  if there is no such
way I'm not sure if it is doable especially for passing some info for
early boot use.  Then the KHO will be only for limited use cases.

>
> > Thanks
> > Dae
> >
>
> --
> Sincerely yours,
> Mike.
>
Alexander Graf Feb. 19, 2025, 1:54 p.m. UTC | #19
On 19.02.25 13:49, Dave Young wrote:
> On Wed, 19 Feb 2025 at 15:32, Mike Rapoport <rppt@kernel.org> wrote:
>> On Mon, Feb 17, 2025 at 11:19:45AM +0800, RuiRui Yang wrote:
>>> On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
>>>> == Limitations ==
>>>>
>>>> Currently KHO is only implemented for file based kexec. The kernel
>>>> interfaces in the patch set are already in place to support user space
>>>> kexec as well, but it is still not implemented it yet inside kexec tools.
>>>>
>>> What architecture exactly does this KHO work fine?   Device Tree
>>> should be ok on arm*, x86 and power*, but how about s390?
>> KHO does not use device tree as the boot protocol, it uses FDT as a data
>> structure and adds architecture specific bits to the boot structures to
>> point to that data, very similar to how IMA_KEXEC works.
>>
>> Currently KHO is implemented on arm64 and x86, but there is no fundamental
>> reason why it wouldn't work on any architecture that supports kexec.
> Well,  the problem is whether there is a way to  add dtb in the early
> boot path,  for X86 it is added via setup_data,  if there is no such
> way I'm not sure if it is doable especially for passing some info for
> early boot use.  Then the KHO will be only for limited use cases.


Every architecture has a platform specific way of passing data into the 
kernel so it can find its command line and initrd. S390x for example has 
struct parmarea. To enable s390x, you would remove some of its padding 
and replace it with a KHO base addr + size, so that the new kernel can 
find the KHO state tree.


Alex
Dave Young Feb. 20, 2025, 1:49 a.m. UTC | #20
On Wed, 19 Feb 2025 at 21:55, Alexander Graf <graf@amazon.com> wrote:
>
>
> On 19.02.25 13:49, Dave Young wrote:
> > On Wed, 19 Feb 2025 at 15:32, Mike Rapoport <rppt@kernel.org> wrote:
> >> On Mon, Feb 17, 2025 at 11:19:45AM +0800, RuiRui Yang wrote:
> >>> On Thu, 6 Feb 2025 at 21:34, Mike Rapoport <rppt@kernel.org> wrote:
> >>>> == Limitations ==
> >>>>
> >>>> Currently KHO is only implemented for file based kexec. The kernel
> >>>> interfaces in the patch set are already in place to support user space
> >>>> kexec as well, but it is still not implemented it yet inside kexec tools.
> >>>>
> >>> What architecture exactly does this KHO work fine?   Device Tree
> >>> should be ok on arm*, x86 and power*, but how about s390?
> >> KHO does not use device tree as the boot protocol, it uses FDT as a data
> >> structure and adds architecture specific bits to the boot structures to
> >> point to that data, very similar to how IMA_KEXEC works.
> >>
> >> Currently KHO is implemented on arm64 and x86, but there is no fundamental
> >> reason why it wouldn't work on any architecture that supports kexec.
> > Well,  the problem is whether there is a way to  add dtb in the early
> > boot path,  for X86 it is added via setup_data,  if there is no such
> > way I'm not sure if it is doable especially for passing some info for
> > early boot use.  Then the KHO will be only for limited use cases.
>
>
> Every architecture has a platform specific way of passing data into the
> kernel so it can find its command line and initrd. S390x for example has
> struct parmarea. To enable s390x, you would remove some of its padding
> and replace it with a KHO base addr + size, so that the new kernel can
> find the KHO state tree.

Ok, thanks for the info,  I cced s390 people maybe they can provide inputs.

Other than the arch concern,   I'm not so excited about the KHO
because for kexec reboot there is a fundamental problem which makes us
(Red Hat kexec/kdump team) can not full support it in RHEL
distribution, that is the stability due to drivers usually do not
implement the  device shutdown method or not well tested.   From time
to time we see weird bugs,  could be malfunctioned devices or memory
corruption caused by ongoing DMA etc.   Also no way for the time being
to make some graphic/drm drivers work ok after a kexec reboot, it
might happen to work by luck but also not stable.

So I personally think that improving the above concern is more
important than introducing more features to utilize kexec reboot.

>
>
> Alex
>
>
Alexander Gordeev Feb. 20, 2025, 4:43 p.m. UTC | #21
On Thu, Feb 20, 2025 at 09:49:52AM +0800, Dave Young wrote:
> On Wed, 19 Feb 2025 at 21:55, Alexander Graf <graf@amazon.com> wrote:
> > >>> What architecture exactly does this KHO work fine?   Device Tree
> > >>> should be ok on arm*, x86 and power*, but how about s390?
> > >> KHO does not use device tree as the boot protocol, it uses FDT as a data
> > >> structure and adds architecture specific bits to the boot structures to
> > >> point to that data, very similar to how IMA_KEXEC works.
> > >>
> > >> Currently KHO is implemented on arm64 and x86, but there is no fundamental
> > >> reason why it wouldn't work on any architecture that supports kexec.
> > > Well,  the problem is whether there is a way to  add dtb in the early
> > > boot path,  for X86 it is added via setup_data,  if there is no such
> > > way I'm not sure if it is doable especially for passing some info for
> > > early boot use.  Then the KHO will be only for limited use cases.
> >
> >
> > Every architecture has a platform specific way of passing data into the
> > kernel so it can find its command line and initrd. S390x for example has
> > struct parmarea. To enable s390x, you would remove some of its padding
> > and replace it with a KHO base addr + size, so that the new kernel can
> > find the KHO state tree.
> 
> Ok, thanks for the info,  I cced s390 people maybe they can provide inputs.

If I understand correctly, the parmarea would be used for passing the
FDT address - which appears to be fine. However, s390 does not implement
early_memremap()/early_memunmap(), which KHO needs.

Thanks, Dave!