diff mbox series

[v4,05/14] kexec: Add Kexec HandOver (KHO) generation helpers

Message ID 20250206132754.2596694-6-rppt@kernel.org (mailing list archive)
State New
Headers show
Series kexec: introduce Kexec HandOver (KHO) | expand

Commit Message

Mike Rapoport Feb. 6, 2025, 1:27 p.m. UTC
From: Alexander Graf <graf@amazon.com>

This patch adds the core infrastructure to generate Kexec HandOver
metadata. Kexec HandOver is a mechanism that allows Linux to preserve
state - arbitrary properties as well as memory locations - across kexec.

It does so using 2 concepts:

  1) Device Tree - Every KHO kexec carries a KHO specific flattened
     device tree blob that describes the state of the system. Device
     drivers can register to KHO to serialize their state before kexec.

  2) Scratch Regions - CMA regions that we allocate in the first kernel.
     CMA gives us the guarantee that no handover pages land in those
     regions, because handover pages must be at a static physical memory
     location. We use these regions as the place to load future kexec
     images so that they won't collide with any handover data.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
---
 Documentation/ABI/testing/sysfs-kernel-kho    |  53 +++
 .../admin-guide/kernel-parameters.txt         |  24 +
 MAINTAINERS                                   |   1 +
 include/linux/cma.h                           |   2 +
 include/linux/kexec.h                         |  18 +
 include/linux/kexec_handover.h                |  10 +
 kernel/Makefile                               |   1 +
 kernel/kexec_handover.c                       | 450 ++++++++++++++++++
 mm/internal.h                                 |   3 -
 mm/mm_init.c                                  |   8 +
 10 files changed, 567 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
 create mode 100644 include/linux/kexec_handover.h
 create mode 100644 kernel/kexec_handover.c

Comments

Jason Gunthorpe Feb. 10, 2025, 8:22 p.m. UTC | #1
On Thu, Feb 06, 2025 at 03:27:45PM +0200, Mike Rapoport wrote:
> diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
> new file mode 100644
> index 000000000000..f13b252bc303
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-kho
> @@ -0,0 +1,53 @@
> +What:		/sys/kernel/kho/active
> +Date:		December 2023
> +Contact:	Alexander Graf <graf@amazon.com>
> +Description:
> +		Kexec HandOver (KHO) allows Linux to transition the state of
> +		compatible drivers into the next kexec'ed kernel. To do so,
> +		device drivers will serialize their current state into a DT.
> +		While the state is serialized, they are unable to perform
> +		any modifications to state that was serialized, such as
> +		handed over memory allocations.
> +
> +		When this file contains "1", the system is in the transition
> +		state. When contains "0", it is not. To switch between the
> +		two states, echo the respective number into this file.

I don't think this is a great interface for the actual state machine..

> +What:		/sys/kernel/kho/dt_max
> +Date:		December 2023
> +Contact:	Alexander Graf <graf@amazon.com>
> +Description:
> +		KHO needs to allocate a buffer for the DT that gets
> +		generated before it knows the final size. By default, it
> +		will allocate 10 MiB for it. You can write to this file
> +		to modify the size of that allocation.

Seems gross, why can't it use a non-contiguous page list to generate
the FDT? :\

See below for a suggestion..

> +static int kho_serialize(void)
> +{
> +	void *fdt = NULL;
> +	int err = -ENOMEM;
> +
> +	fdt = kvmalloc(kho_out.dt_max, GFP_KERNEL);
> +	if (!fdt)
> +		goto out;
> +
> +	if (fdt_create(fdt, kho_out.dt_max)) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	err = fdt_finish_reservemap(fdt);
> +	if (err)
> +		goto out;
> +
> +	err = fdt_begin_node(fdt, "");
> +	if (err)
> +		goto out;
> +
> +	err = fdt_property_string(fdt, "compatible", "kho-v1");
> +	if (err)
> +		goto out;
> +
> +	/* Loop through all kho dump functions */
> +	err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_DUMP, fdt);
> +	err = notifier_to_errno(err);

I don't see this really working long term. I think we'd like each
component to be able to serialize at its own pace under userspace
control.

This design requires that the whole thing be wrapped in a notifier
callback just so we can make use of the fdt APIs.

It seems like a poor fit me.

IMHO if you want to keep using FDT I suggest that each serializing
component (ie driver, ftrace whatever) allocate its own FDT fragment
from scratch and the main KHO one just link to the memories that holds
those fragements.

Ie the driver experience would be more like

 kho = kho_start_storage("my_compatible_string,v1", some_kind_of_instance_key);

 fdt...(kho->fdt..)

 kho_finish_storage(kho);

Where this ends up creating a stand alone FDT fragment:

/dts-v1/;
/ {
  compatible = "linux-kho,my_compatible_string,v1";
  instance = some_kind_of_instance_key;
  key-value-1 = <..>;
  key-value-1 = <..>;
};

And then kho_finish_storage() would remember the phys/length until the
kexec fdt is produced as the very last step.

This way we could do things like fdbox an iommufd and create the above
FDT fragment completely seperately from any notifier chain and,
crucially, disconnected from the fdt_create() for the kexec payload.

Further, if you split things like this (it will waste some small
amount of memory) you can probably get to a point where no single FDT
is more than 4k. That looks like it would simplify/robustify alot of
stuff?

Jason
Pasha Tatashin Feb. 10, 2025, 8:58 p.m. UTC | #2
On Mon, Feb 10, 2025 at 3:22 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Feb 06, 2025 at 03:27:45PM +0200, Mike Rapoport wrote:
> > diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
> > new file mode 100644
> > index 000000000000..f13b252bc303
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-kernel-kho
> > @@ -0,0 +1,53 @@
> > +What:                /sys/kernel/kho/active
> > +Date:                December 2023
> > +Contact:     Alexander Graf <graf@amazon.com>
> > +Description:
> > +             Kexec HandOver (KHO) allows Linux to transition the state of
> > +             compatible drivers into the next kexec'ed kernel. To do so,
> > +             device drivers will serialize their current state into a DT.
> > +             While the state is serialized, they are unable to perform
> > +             any modifications to state that was serialized, such as
> > +             handed over memory allocations.
> > +
> > +             When this file contains "1", the system is in the transition
> > +             state. When contains "0", it is not. To switch between the
> > +             two states, echo the respective number into this file.
>
> I don't think this is a great interface for the actual state machine..

In our next proposal we are going to remove this "activate" phase.

>
> > +What:                /sys/kernel/kho/dt_max
> > +Date:                December 2023
> > +Contact:     Alexander Graf <graf@amazon.com>
> > +Description:
> > +             KHO needs to allocate a buffer for the DT that gets
> > +             generated before it knows the final size. By default, it
> > +             will allocate 10 MiB for it. You can write to this file
> > +             to modify the size of that allocation.
>
> Seems gross, why can't it use a non-contiguous page list to generate
> the FDT? :\

We will consider some of these ideas in the future version. I like the
idea of using preserved memory to carry sparse KHO tree: i.e FDT over
sparse memory, maybe use the anchor page to describe how it should be
vmapped into a virtually contiguous tree in the next kernel?

>
> See below for a suggestion..
>
> > +static int kho_serialize(void)
> > +{
> > +     void *fdt = NULL;
> > +     int err = -ENOMEM;
> > +
> > +     fdt = kvmalloc(kho_out.dt_max, GFP_KERNEL);
> > +     if (!fdt)
> > +             goto out;
> > +
> > +     if (fdt_create(fdt, kho_out.dt_max)) {
> > +             err = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     err = fdt_finish_reservemap(fdt);
> > +     if (err)
> > +             goto out;
> > +
> > +     err = fdt_begin_node(fdt, "");
> > +     if (err)
> > +             goto out;
> > +
> > +     err = fdt_property_string(fdt, "compatible", "kho-v1");
> > +     if (err)
> > +             goto out;
> > +
> > +     /* Loop through all kho dump functions */
> > +     err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_DUMP, fdt);
> > +     err = notifier_to_errno(err);
>
> I don't see this really working long term. I think we'd like each
> component to be able to serialize at its own pace under userspace
> control.
>
> This design requires that the whole thing be wrapped in a notifier
> callback just so we can make use of the fdt APIs.
>
> It seems like a poor fit me.
>
> IMHO if you want to keep using FDT I suggest that each serializing
> component (ie driver, ftrace whatever) allocate its own FDT fragment
> from scratch and the main KHO one just link to the memories that holds
> those fragements.
>
> Ie the driver experience would be more like
>
>  kho = kho_start_storage("my_compatible_string,v1", some_kind_of_instance_key);
>
>  fdt...(kho->fdt..)
>
>  kho_finish_storage(kho);
>
> Where this ends up creating a stand alone FDT fragment:
>
> /dts-v1/;
> / {
>   compatible = "linux-kho,my_compatible_string,v1";
>   instance = some_kind_of_instance_key;
>   key-value-1 = <..>;
>   key-value-1 = <..>;
> };
>
> And then kho_finish_storage() would remember the phys/length until the
> kexec fdt is produced as the very last step.
>
> This way we could do things like fdbox an iommufd and create the above
> FDT fragment completely seperately from any notifier chain and,
> crucially, disconnected from the fdt_create() for the kexec payload.
>
> Further, if you split things like this (it will waste some small
> amount of memory) you can probably get to a point where no single FDT
> is more than 4k. That looks like it would simplify/robustify alot of
> stuff?
>
> Jason
>
Jason Gunthorpe Feb. 11, 2025, 12:49 p.m. UTC | #3
On Mon, Feb 10, 2025 at 03:58:00PM -0500, Pasha Tatashin wrote:
> >
> > > +What:                /sys/kernel/kho/dt_max
> > > +Date:                December 2023
> > > +Contact:     Alexander Graf <graf@amazon.com>
> > > +Description:
> > > +             KHO needs to allocate a buffer for the DT that gets
> > > +             generated before it knows the final size. By default, it
> > > +             will allocate 10 MiB for it. You can write to this file
> > > +             to modify the size of that allocation.
> >
> > Seems gross, why can't it use a non-contiguous page list to generate
> > the FDT? :\
> 
> We will consider some of these ideas in the future version. I like the
> idea of using preserved memory to carry sparse KHO tree: i.e FDT over
> sparse memory, maybe use the anchor page to describe how it should be
> vmapped into a virtually contiguous tree in the next kernel?

Yeah, but this is now permanent uAPI that has to be kept forever. I
think you should not add this when there are enough ideas on how to
completely avoid it.

Jason
Pasha Tatashin Feb. 11, 2025, 4:14 p.m. UTC | #4
On Tue, Feb 11, 2025 at 7:49 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Mon, Feb 10, 2025 at 03:58:00PM -0500, Pasha Tatashin wrote:
> > >
> > > > +What:                /sys/kernel/kho/dt_max
> > > > +Date:                December 2023
> > > > +Contact:     Alexander Graf <graf@amazon.com>
> > > > +Description:
> > > > +             KHO needs to allocate a buffer for the DT that gets
> > > > +             generated before it knows the final size. By default, it
> > > > +             will allocate 10 MiB for it. You can write to this file
> > > > +             to modify the size of that allocation.
> > >
> > > Seems gross, why can't it use a non-contiguous page list to generate
> > > the FDT? :\
> >
> > We will consider some of these ideas in the future version. I like the
> > idea of using preserved memory to carry sparse KHO tree: i.e FDT over
> > sparse memory, maybe use the anchor page to describe how it should be
> > vmapped into a virtually contiguous tree in the next kernel?
>
> Yeah, but this is now permanent uAPI that has to be kept forever. I

Agree, what I meant in the future patch version is before it gets
merged. I should have been more clear.

> think you should not add this when there are enough ideas on how to
> completely avoid it.

Thinking about it some more, I'm actually leaning towards keeping
things as they are, instead of going with a sparse FDT. With a sparse
KHO-tree, we'd be kinda trying to fix something that should be handled
higher up. All userspace preservable memory (like emulated pmem with
devdax/fsdax and also pstore for logging) can already survive cold
reboots with modified firmware Google and Microsoft do this.
Similarly, the firmware could give the kernel the KHO-tree (generated
by firmware or from the previous kernel) to keep stuff like telemetry,
oops messages, time stamps etc. KHO should not be considered
explicitly as a mechanism to carry device serialization data, the KHO
should be a standard and simple way to pass kernel data between
reboots. The more complex state can be built on top of it, for example
guestmemfs, could preserve terabytes of data and have only one node in
the KHO tree.

>
> Jason
Jason Gunthorpe Feb. 11, 2025, 4:37 p.m. UTC | #5
On Tue, Feb 11, 2025 at 11:14:06AM -0500, Pasha Tatashin wrote:
> > think you should not add this when there are enough ideas on how to
> > completely avoid it.
> 
> Thinking about it some more, I'm actually leaning towards keeping
> things as they are, instead of going with a sparse FDT. 

What is a sparse FDT? My suggestion each driver make its own FDT?

The reason for this was sequencing because we need a more more
flexable way to manage all this serialization than just a notifier
chain. The existing FDT construction process is too restrictive to
accommodate this, IMHO.

That it also resolves the weird dt_max stuff above is a nice side
effect.

> With a sparse KHO-tree, we'd be kinda trying to fix something that
> should be handled higher up. All userspace preservable memory (like
> emulated pmem with devdax/fsdax and also pstore for logging) can
> already survive cold reboots with modified firmware Google and
> Microsoft do this.

I was hoping the VM memory wouldn't be in DAX. If you want some DAX
stuff to interact with FW, OK, but I think the design here should be
driving toward preserving a memfd/guestmemfd/hugetlbfs FDs directly
and eliminate the DAX backed VMs. We won't get to CC guestmemfd with
DAX.

fdbox of a guestmemfd, for instance.

To do that you need to preserve folios as the basic primitive.

> Similarly, the firmware could give the kernel the KHO-tree (generated
> by firmware or from the previous kernel) to keep stuff like telemetry,
> oops messages, time stamps etc. 

This feels like a mistake to comingle things like this. KHO is complex
enough, it should stay focused on its thing..

Jason
Thomas Weißschuh Feb. 12, 2025, 12:29 p.m. UTC | #6
On 2025-02-06 15:27:45+0200, Mike Rapoport wrote:
> From: Alexander Graf <graf@amazon.com>
> 
> This patch adds the core infrastructure to generate Kexec HandOver
> metadata. Kexec HandOver is a mechanism that allows Linux to preserve
> state - arbitrary properties as well as memory locations - across kexec.
> 
> It does so using 2 concepts:
> 
>   1) Device Tree - Every KHO kexec carries a KHO specific flattened
>      device tree blob that describes the state of the system. Device
>      drivers can register to KHO to serialize their state before kexec.
> 
>   2) Scratch Regions - CMA regions that we allocate in the first kernel.
>      CMA gives us the guarantee that no handover pages land in those
>      regions, because handover pages must be at a static physical memory
>      location. We use these regions as the place to load future kexec
>      images so that they won't collide with any handover data.
> 
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  Documentation/ABI/testing/sysfs-kernel-kho    |  53 +++
>  .../admin-guide/kernel-parameters.txt         |  24 +
>  MAINTAINERS                                   |   1 +
>  include/linux/cma.h                           |   2 +
>  include/linux/kexec.h                         |  18 +
>  include/linux/kexec_handover.h                |  10 +
>  kernel/Makefile                               |   1 +
>  kernel/kexec_handover.c                       | 450 ++++++++++++++++++
>  mm/internal.h                                 |   3 -
>  mm/mm_init.c                                  |   8 +
>  10 files changed, 567 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-kho
>  create mode 100644 include/linux/kexec_handover.h
>  create mode 100644 kernel/kexec_handover.c

<snip>

> --- /dev/null
> +++ b/include/linux/kexec_handover.h
> @@ -0,0 +1,10 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef LINUX_KEXEC_HANDOVER_H
> +#define LINUX_KEXEC_HANDOVER_H

#include <linux/types.h>

> +
> +struct kho_mem {
> +	phys_addr_t addr;
> +	phys_addr_t size;
> +};
> +
> +#endif /* LINUX_KEXEC_HANDOVER_H */

<snip>

> +static ssize_t dt_read(struct file *file, struct kobject *kobj,
> +		       struct bin_attribute *attr, char *buf,

Please make the bin_attribute argument const. Currently both work, but
the non-const variant will go away.
This way I can test my stuff on linux-next.

> +		       loff_t pos, size_t count)
> +{
> +	mutex_lock(&kho_out.lock);
> +	memcpy(buf, attr->private + pos, count);
> +	mutex_unlock(&kho_out.lock);
> +
> +	return count;
> +}
> +
> +struct bin_attribute bin_attr_dt_kern = __BIN_ATTR(dt, 0400, dt_read, NULL, 0);

The new __BIN_ATTR_ADMIN_RO() could make this slightly shorter.

<snip>
Jason Gunthorpe Feb. 12, 2025, 3:23 p.m. UTC | #7
On Tue, Feb 11, 2025 at 12:37:20PM -0400, Jason Gunthorpe wrote:

> To do that you need to preserve folios as the basic primitive.

I made a small sketch of what I suggest.

I imagine the FDT schema for this would look something like this:

/dts-v1/;
/ {
  compatible = "linux-kho,v1";
  phys-addr-size = 64;
  void-p-size = 64;
  preserved-folio-map = <phys_addr>;

  // The per "driver" storage
  instance@1 {..};
  instance@2 {..};
};

I think this is alot better than what is in this series. It uses much
less memory when there are alot of allocation, it supports any order
folios, it is efficient for 1G guestmemfd folios, and it only needs a
few bytes in the FDT. It could preserve and restore the high order
folio struct page folding (HVO).

The use cases I'm imagining for drivers would be pushing gigabytes of
memory into this preservation mechanism. It needs to be scalable!

This also illustrates my point that I don't think FDT is a good
representation to use exclusively. This in-memory structure is much
better and faster than trying to represent the same information
embedded directly into the FDT. I imagine this to be the general
pattern that drivers will want to use. A few bytes in the FDT pointing
at a scalable in-memory structure for the bulk of the data.

/*
 * Keep track of folio memory that is to be preserved across KHO.
 *
 * This is designed with the idea that the system will have alot of memory, eg
 * 1TB, and the majority of it will be ~1G folios assigned to a hugetlb/etc
 * being used to back guest memory. This would leave a smaller amount of memory,
 * eg 16G, reserved for the hypervisor to use. The pages to preserve across KHO
 * would be randomly distributed over the hypervisor memory. The hypervisor
 * memory is not required to be contiguous.
 *
 * This approach is fully incremental, as the serialization progresses folios
 * can continue be aggregated to the tracker. The final step, immediately prior
 * to kexec would serialize the xarray information into a linked list for the
 * successor kernel to parse.
 *
 * The serializing side uses two levels of xarrays to manage chunks of per-order
 * 512 byte bitmaps. For instance the entire 1G order of a 1TB system would fit
 * inside a single 512 byte bitmap. For order 0 allocations each bitmap will
 * cover 16M of address space. Thus, for 16G of hypervisor memory at most 512K
 * of bitmap memory will be needed for order 0.
 */
struct kho_mem_track
{
	/* Points to kho_mem_phys, each order gets its own bitmap tree */
	struct xarray orders;
};

struct kho_mem_phys
{
	/*
	 * Points to kho_mem_phys_bits, a sparse bitmap array. Each bit is sized
	 * to order.
	 */
	struct xarray phys_bits;
};

#define PRESERVE_BITS (512 * 8)
struct kho_mem_phys_bits
{
	DECLARE_BITMAP(preserve, PRESERVE_BITS)
};

static void *
xa_load_or_alloc(struct xarray *xa, unsigned long index, size_t elmsz)
{
	void *elm;
	void *res;

	elm = xa_load(xa, index);
	if (elm)
		return elm;

	elm = kzalloc(elmsz, GFP_KERNEL);
	if (!elm)
		return ERR_PTR(-ENOMEM);
	res = xa_cmpxchg(xa, elmsz, NULL, elm, GFP_KERNEL);
	if (xa_is_err(res)) {
		kfree(elm);
		return ERR_PTR(xa_err(res));
	};
	if (res != NULL) {
		kfree(elm);
		return res;
	}
	return elm;
}

/*
 * Record that the entire folio under virt is preserved across KHO. virt must
 * have come from alloc_pages/folio_alloc or similar and point to the first page
 * of the folio. The order will be preserved as well.
 */
int kho_preserve_folio(struct kho_mem_track *tracker, void *virt)
{
	struct folio *folio = virt_to_folio(virt);
	unsigned int order = folio_order(folio);
	phys_addr_t phys = virt_to_phys(virt);
	struct kho_mem_phys_bits *bits;
	struct kho_mem_phys *physxa;

	might_sleep();

	physxa = xa_load_or_alloc(&tracker->orders, order, sizeof(*physxa));
	if (IS_ERR(physxa))
		return PTR_ERR(physxa);

	phys >>= PAGE_SHIFT + order;
	static_assert(sizeof(phys_addr_t) <= sizeof(unsigned long));
	bits = xa_load_or_alloc(&physxa->phys_bits, phys / PRESERVE_BITS,
				sizeof(*bits));

	set_bit(phys % PRESERVE_BITS, bits->preserve);
	return 0;
}

#define KHOSER_PTR(type)  union {phys_addr_t phys; type ptr;}
#define KHOSER_STORE_PTR(dest, val)                 \
	({                                          \
		(dest).phys = virt_to_phys(val);    \
		typecheck(typeof((dest).ptr), val); \
	})
#define KHOSER_LOAD_PTR(src) ((typeof((src).ptr))(phys_to_virt((src).phys)))

struct khoser_mem_bitmap_ptr {
	phys_addr_t phys_start;
	KHOSER_PTR(struct kho_mem_phys_bits *) bitmap;
};

struct khoser_mem_chunk {
	unsigned int order;
	unsigned int num_elms;
	KHOSER_PTR(struct khoser_mem_chunk *) next;
	struct khoser_mem_bitmap_ptr
		bitmaps[(PAGE_SIZE - 16) / sizeof(struct khoser_mem_bitmap_ptr)];
};
static_assert(sizeof(struct khoser_mem_chunk) == PAGE_SIZE);

static int new_chunk(struct khoser_mem_chunk **cur_chunk)
{
	struct khoser_mem_chunk *chunk;

	chunk = kzalloc(sizeof(*chunk), GFP_KERNEL);
	if (!chunk)
		return -ENOMEM;
	if (*cur_chunk)
		KHOSER_STORE_PTR((*cur_chunk)->next, chunk);
	*cur_chunk = chunk;
	return 0;
}

/*
 * Record all the bitmaps in a linked list of pages for the next kernel to
 * process. Each chunk holds bitmaps of the same order and each block of bitmaps
 * starts at a given physical address. This allows the bitmaps to be sparse. The
 * xarray is used to store them in a tree while building up the data structure,
 * but the KHO successor kernel only needs to process them once in order.
 *
 * All of this memory is normal kmalloc() memory and is not marked for
 * preservation. The successor kernel will remain isolated to the scratch space
 * until it completes processing this list. Once processed all the memory
 * storing these ranges will be marked as free.
 */
int kho_serialize(struct kho_mem_track *tracker, phys_addr_t *fdt_value)
{
	struct khoser_mem_chunk *first_chunk = NULL;
	struct khoser_mem_chunk *chunk = NULL;
	struct kho_mem_phys *physxa;
	unsigned long order;
	int ret;

	xa_for_each(&tracker->orders, order, physxa) {
		struct kho_mem_phys_bits *bits;
		unsigned long phys;

		ret = new_chunk(&chunk);
		if (ret)
			goto err_free;
		if (!first_chunk)
			first_chunk = chunk;
		chunk->order = order;

		xa_for_each(&physxa->phys_bits, phys, bits) {
			struct khoser_mem_bitmap_ptr *elm;

			if (chunk->num_elms == ARRAY_SIZE(chunk->bitmaps)) {
				ret = new_chunk(&chunk);
				if (ret)
					goto err_free;
			}

			elm = &chunk->bitmaps[chunk->num_elms];
			chunk->num_elms++;
			elm->phys_start = phys << (order + PAGE_SIZE);
			KHOSER_STORE_PTR(elm->bitmap, bits);
		}
	}
	*fdt_value = virt_to_phys(first_chunk);
	return 0;
err_free:
	chunk = first_chunk;
	while (chunk) {
		struct khoser_mem_chunk *tmp = chunk;
		chunk = KHOSER_LOAD_PTR(chunk->next);
		kfree(tmp);
	}
	return ret;
}

static void preserve_bitmap(unsigned int order,
			    struct khoser_mem_bitmap_ptr *elm)
{
	struct kho_mem_phys_bits *bitmap = KHOSER_LOAD_PTR(elm->bitmap);
	unsigned int bit;

	for_each_set_bit(bit, bitmap->preserve, PRESERVE_BITS) {
		phys_addr_t phys =
			elm->phys_start + (bit << (order +
			PAGE_SHIFT));

		// Do the struct page stuff..
	}
}

void kho_deserialize(phys_addr_t fdt_value)
{
	struct khoser_mem_chunk *chunk = phys_to_virt(fdt_value);

	while (chunk) {
		unsigned int i;

		for (i = 0; i != chunk->num_elms; i++)
			preserve_bitmap(chunk->order, chunk->bitmaps[i]);
		chunk = KHOSER_LOAD_PTR(chunk->next);
	}
}
Mike Rapoport Feb. 12, 2025, 4:39 p.m. UTC | #8
Hi Jason,

On Wed, Feb 12, 2025 at 11:23:36AM -0400, Jason Gunthorpe wrote:
> On Tue, Feb 11, 2025 at 12:37:20PM -0400, Jason Gunthorpe wrote:
> 
> > To do that you need to preserve folios as the basic primitive.
> 
> I made a small sketch of what I suggest.
> 
> I imagine the FDT schema for this would look something like this:
> 
> /dts-v1/;
> / {
>   compatible = "linux-kho,v1";
>   phys-addr-size = 64;
>   void-p-size = 64;
>   preserved-folio-map = <phys_addr>;
> 
>   // The per "driver" storage
>   instance@1 {..};
>   instance@2 {..};
> };
> 
> I think this is alot better than what is in this series. It uses much
> less memory when there are alot of allocation, it supports any order
> folios, it is efficient for 1G guestmemfd folios, and it only needs a
> few bytes in the FDT. It could preserve and restore the high order
> folio struct page folding (HVO).
> 
> The use cases I'm imagining for drivers would be pushing gigabytes of
> memory into this preservation mechanism. It needs to be scalable!
> 
> This also illustrates my point that I don't think FDT is a good
> representation to use exclusively. This in-memory structure is much
> better and faster than trying to represent the same information
> embedded directly into the FDT. I imagine this to be the general
> pattern that drivers will want to use. A few bytes in the FDT pointing
> at a scalable in-memory structure for the bulk of the data.

As I've mentioned off-list earlier, KHO in its current form is the lowest
level of abstraction for state preservation and it is by no means is
intended to provide complex drivers with all the tools necessary.

It's sole purpose is to allow preserving simple properties and ensure that
memory ranges KHO clients need to preserve won't be overwritten.

What you propose is a great optimization for memory preservation mechanism,
and additional and very useful abstraction layer on top of "basic KHO"!

But I think it will be easier to start with something *very simple* and
probably suboptimal and then extend it rather than to try to build complex
comprehensive solution from day one.
Jason Gunthorpe Feb. 12, 2025, 5:43 p.m. UTC | #9
On Wed, Feb 12, 2025 at 06:39:06PM +0200, Mike Rapoport wrote:

> As I've mentioned off-list earlier, KHO in its current form is the lowest
> level of abstraction for state preservation and it is by no means is
> intended to provide complex drivers with all the tools necessary.

My point, is I think it is the wrong level of abstraction and the
wrong FDT schema. It does not and cannot solve the problems we know we
will have, so why invest anything into that schema?

I think the scratch system is great, and an amazing improvement over
past version. Upgrade the memory preservation to match and it will be
really good.

> What you propose is a great optimization for memory preservation mechanism,
> and additional and very useful abstraction layer on top of "basic KHO"!

I do not see this as a layer on top, I see it as fundamentally
replacing the memory preservation mechanism with something more
scalable.

> But I think it will be easier to start with something *very simple* and
> probably suboptimal and then extend it rather than to try to build complex
> comprehensive solution from day one.

But why? Just do it right from the start? I spent like a hour
sketching that, the existing preservation code is also very simple,
why not just fix it right now?

Jason
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
new file mode 100644
index 000000000000..f13b252bc303
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-kho
@@ -0,0 +1,53 @@ 
+What:		/sys/kernel/kho/active
+Date:		December 2023
+Contact:	Alexander Graf <graf@amazon.com>
+Description:
+		Kexec HandOver (KHO) allows Linux to transition the state of
+		compatible drivers into the next kexec'ed kernel. To do so,
+		device drivers will serialize their current state into a DT.
+		While the state is serialized, they are unable to perform
+		any modifications to state that was serialized, such as
+		handed over memory allocations.
+
+		When this file contains "1", the system is in the transition
+		state. When contains "0", it is not. To switch between the
+		two states, echo the respective number into this file.
+
+What:		/sys/kernel/kho/dt_max
+Date:		December 2023
+Contact:	Alexander Graf <graf@amazon.com>
+Description:
+		KHO needs to allocate a buffer for the DT that gets
+		generated before it knows the final size. By default, it
+		will allocate 10 MiB for it. You can write to this file
+		to modify the size of that allocation.
+
+What:		/sys/kernel/kho/dt
+Date:		December 2023
+Contact:	Alexander Graf <graf@amazon.com>
+Description:
+		When KHO is active, the kernel exposes the generated DT that
+		carries its current KHO state in this file. Kexec user space
+		tooling can use this as input file for the KHO payload image.
+
+What:		/sys/kernel/kho/scratch_len
+Date:		December 2023
+Contact:	Alexander Graf <graf@amazon.com>
+Description:
+		To support continuous KHO kexecs, we need to reserve
+		physically contiguous memory regions that will always stay
+		available for future kexec allocations. This file describes
+		the length of these memory regions. Kexec user space tooling
+		can use this to determine where it should place its payload
+		images.
+
+What:		/sys/kernel/kho/scratch_phys
+Date:		December 2023
+Contact:	Alexander Graf <graf@amazon.com>
+Description:
+		To support continuous KHO kexecs, we need to reserve
+		physically contiguous memory regions that will always stay
+		available for future kexec allocations. This file describes
+		the physical location of these memory regions. Kexec user space
+		tooling can use this to determine where it should place its
+		payload images.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb8752b42ec8..ed656e2fb05e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2698,6 +2698,30 @@ 
 	kgdbwait	[KGDB,EARLY] Stop kernel execution and enter the
 			kernel debugger at the earliest opportunity.
 
+	kho=		[KEXEC,EARLY]
+			Format: { "0" | "1" | "off" | "on" | "y" | "n" }
+			Enables or disables Kexec HandOver.
+			"0" | "off" | "n" - kexec handover is disabled
+			"1" | "on" | "y" - kexec handover is enabled
+
+	kho_scratch=	[KEXEC,EARLY]
+			Format: nn[KMG],mm[KMG] | nn%
+			Defines the size of the KHO scratch region. The KHO
+			scratch regions are physically contiguous memory
+			ranges that can only be used for non-kernel
+			allocations. That way, even when memory is heavily
+			fragmented with handed over memory, the kexeced
+			kernel will always have enough contiguous ranges to
+			bootstrap itself.
+
+			It is possible to specify the exact amount of
+			memory in the form of "nn[KMG],mm[KMG]" where the
+			first parameter defines the size of a global
+			scratch area and the second parameter defines the
+			size of additional per-node scratch areas.
+			The form "nn%" defines scale factor (in percents)
+			of memory that was used during boot.
+
 	kmac=		[MIPS] Korina ethernet MAC address.
 			Configure the RouterBoard 532 series on-chip
 			Ethernet adapter MAC address.
diff --git a/MAINTAINERS b/MAINTAINERS
index 896a307fa065..8327795e8899 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12826,6 +12826,7 @@  M:	Eric Biederman <ebiederm@xmission.com>
 L:	kexec@lists.infradead.org
 S:	Maintained
 W:	http://kernel.org/pub/linux/utils/kernel/kexec/
+F:	Documentation/ABI/testing/sysfs-kernel-kho
 F:	include/linux/kexec.h
 F:	include/uapi/linux/kexec.h
 F:	kernel/kexec*
diff --git a/include/linux/cma.h b/include/linux/cma.h
index d15b64f51336..828a3c17504b 100644
--- a/include/linux/cma.h
+++ b/include/linux/cma.h
@@ -56,6 +56,8 @@  extern void cma_reserve_pages_on_error(struct cma *cma);
 #ifdef CONFIG_CMA
 struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp);
 bool cma_free_folio(struct cma *cma, const struct folio *folio);
+/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
+void init_cma_reserved_pageblock(struct page *page);
 #else
 static inline struct folio *cma_alloc_folio(struct cma *cma, int order, gfp_t gfp)
 {
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index f0e9f8eda7a3..ef5c90abafd1 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -483,6 +483,24 @@  void set_kexec_sig_enforced(void);
 static inline void set_kexec_sig_enforced(void) {}
 #endif
 
+/* KHO Notifier index */
+enum kho_event {
+	KEXEC_KHO_DUMP = 0,
+	KEXEC_KHO_ABORT = 1,
+};
+
+struct notifier_block;
+
+#ifdef CONFIG_KEXEC_HANDOVER
+int register_kho_notifier(struct notifier_block *nb);
+int unregister_kho_notifier(struct notifier_block *nb);
+void kho_memory_init(void);
+#else
+static inline int register_kho_notifier(struct notifier_block *nb) { return 0; }
+static inline int unregister_kho_notifier(struct notifier_block *nb) { return 0; }
+static inline void kho_memory_init(void) {}
+#endif /* CONFIG_KEXEC_HANDOVER */
+
 #endif /* !defined(__ASSEBMLY__) */
 
 #endif /* LINUX_KEXEC_H */
diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
new file mode 100644
index 000000000000..c4b0aab823dc
--- /dev/null
+++ b/include/linux/kexec_handover.h
@@ -0,0 +1,10 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_KEXEC_HANDOVER_H
+#define LINUX_KEXEC_HANDOVER_H
+
+struct kho_mem {
+	phys_addr_t addr;
+	phys_addr_t size;
+};
+
+#endif /* LINUX_KEXEC_HANDOVER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..cef5377c25cd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -75,6 +75,7 @@  obj-$(CONFIG_CRASH_DUMP) += crash_core.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
new file mode 100644
index 000000000000..eccfe3a25798
--- /dev/null
+++ b/kernel/kexec_handover.c
@@ -0,0 +1,450 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_handover.c - kexec handover metadata processing
+ * Copyright (C) 2023 Alexander Graf <graf@amazon.com>
+ * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport <rppt@kernel.org>
+ */
+
+#define pr_fmt(fmt) "KHO: " fmt
+
+#include <linux/cma.h>
+#include <linux/kexec.h>
+#include <linux/sysfs.h>
+#include <linux/libfdt.h>
+#include <linux/memblock.h>
+#include <linux/notifier.h>
+#include <linux/kexec_handover.h>
+#include <linux/page-isolation.h>
+
+static bool kho_enable __ro_after_init;
+
+static int __init kho_parse_enable(char *p)
+{
+	return kstrtobool(p, &kho_enable);
+}
+early_param("kho", kho_parse_enable);
+
+/*
+ * With KHO enabled, memory can become fragmented because KHO regions may
+ * be anywhere in physical address space. The scratch regions give us a
+ * safe zones that we will never see KHO allocations from. This is where we
+ * can later safely load our new kexec images into and then use the scratch
+ * area for early allocations that happen before page allocator is
+ * initialized.
+ */
+static struct kho_mem *kho_scratch;
+static unsigned int kho_scratch_cnt;
+
+struct kho_out {
+	struct blocking_notifier_head chain_head;
+	struct kobject *kobj;
+	struct mutex lock;
+	void *dt;
+	u64 dt_len;
+	u64 dt_max;
+	bool active;
+};
+
+static struct kho_out kho_out = {
+	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
+	.lock = __MUTEX_INITIALIZER(kho_out.lock),
+	.dt_max = 10 * SZ_1M,
+};
+
+int register_kho_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(register_kho_notifier);
+
+int unregister_kho_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&kho_out.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+
+static ssize_t dt_read(struct file *file, struct kobject *kobj,
+		       struct bin_attribute *attr, char *buf,
+		       loff_t pos, size_t count)
+{
+	mutex_lock(&kho_out.lock);
+	memcpy(buf, attr->private + pos, count);
+	mutex_unlock(&kho_out.lock);
+
+	return count;
+}
+
+struct bin_attribute bin_attr_dt_kern = __BIN_ATTR(dt, 0400, dt_read, NULL, 0);
+
+static int kho_expose_dt(void *fdt)
+{
+	long fdt_len = fdt_totalsize(fdt);
+	int err;
+
+	kho_out.dt = fdt;
+	kho_out.dt_len = fdt_len;
+
+	bin_attr_dt_kern.size = fdt_totalsize(fdt);
+	bin_attr_dt_kern.private = fdt;
+	err = sysfs_create_bin_file(kho_out.kobj, &bin_attr_dt_kern);
+
+	return err;
+}
+
+static void kho_abort(void)
+{
+	if (!kho_out.active)
+		return;
+
+	sysfs_remove_bin_file(kho_out.kobj, &bin_attr_dt_kern);
+
+	kvfree(kho_out.dt);
+	kho_out.dt = NULL;
+	kho_out.dt_len = 0;
+
+	blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_ABORT, NULL);
+
+	kho_out.active = false;
+}
+
+static int kho_serialize(void)
+{
+	void *fdt = NULL;
+	int err = -ENOMEM;
+
+	fdt = kvmalloc(kho_out.dt_max, GFP_KERNEL);
+	if (!fdt)
+		goto out;
+
+	if (fdt_create(fdt, kho_out.dt_max)) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = fdt_finish_reservemap(fdt);
+	if (err)
+		goto out;
+
+	err = fdt_begin_node(fdt, "");
+	if (err)
+		goto out;
+
+	err = fdt_property_string(fdt, "compatible", "kho-v1");
+	if (err)
+		goto out;
+
+	/* Loop through all kho dump functions */
+	err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_DUMP, fdt);
+	err = notifier_to_errno(err);
+	if (err)
+		goto out;
+
+	/* Close / */
+	err =  fdt_end_node(fdt);
+	if (err)
+		goto out;
+
+	err = fdt_finish(fdt);
+	if (err)
+		goto out;
+
+	if (WARN_ON(fdt_check_header(fdt))) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	err = kho_expose_dt(fdt);
+
+out:
+	if (err) {
+		pr_err("failed to serialize state: %d", err);
+		kho_abort();
+	}
+	return err;
+}
+
+/* Handling for /sys/kernel/kho */
+
+#define KHO_ATTR_RO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_RO_MODE(_name, 0400)
+#define KHO_ATTR_RW(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_RW_MODE(_name, 0600)
+
+static ssize_t active_store(struct kobject *dev, struct kobj_attribute *attr,
+			    const char *buf, size_t size)
+{
+	ssize_t retsize = size;
+	bool val = false;
+	int ret;
+
+	if (kstrtobool(buf, &val) < 0)
+		return -EINVAL;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+	if (!kho_scratch_cnt)
+		return -ENOMEM;
+
+	mutex_lock(&kho_out.lock);
+	if (val != kho_out.active) {
+		if (val) {
+			ret = kho_serialize();
+			if (ret) {
+				retsize = -EINVAL;
+				goto out;
+			}
+			kho_out.active = true;
+		} else {
+			kho_abort();
+		}
+	}
+
+out:
+	mutex_unlock(&kho_out.lock);
+	return retsize;
+}
+
+static ssize_t active_show(struct kobject *dev, struct kobj_attribute *attr,
+			   char *buf)
+{
+	ssize_t ret;
+
+	mutex_lock(&kho_out.lock);
+	ret = sysfs_emit(buf, "%d\n", kho_out.active);
+	mutex_unlock(&kho_out.lock);
+
+	return ret;
+}
+KHO_ATTR_RW(active);
+
+static ssize_t dt_max_store(struct kobject *dev, struct kobj_attribute *attr,
+			    const char *buf, size_t size)
+{
+	u64 val;
+
+	if (kstrtoull(buf, 0, &val))
+		return -EINVAL;
+
+	/* FDT already exists, it's too late to change dt_max */
+	if (kho_out.dt_len)
+		return -EBUSY;
+
+	kho_out.dt_max = val;
+
+	return size;
+}
+
+static ssize_t dt_max_show(struct kobject *dev, struct kobj_attribute *attr,
+			   char *buf)
+{
+	return sysfs_emit(buf, "0x%llx\n", kho_out.dt_max);
+}
+KHO_ATTR_RW(dt_max);
+
+static ssize_t scratch_len_show(struct kobject *dev, struct kobj_attribute *attr,
+				char *buf)
+{
+	ssize_t count = 0;
+
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		count += sysfs_emit_at(buf, count, "0x%llx\n", kho_scratch[i].size);
+
+	return count;
+}
+KHO_ATTR_RO(scratch_len);
+
+static ssize_t scratch_phys_show(struct kobject *dev, struct kobj_attribute *attr,
+				 char *buf)
+{
+	ssize_t count = 0;
+
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		count += sysfs_emit_at(buf, count, "0x%llx\n", kho_scratch[i].addr);
+
+	return count;
+}
+KHO_ATTR_RO(scratch_phys);
+
+static const struct attribute *kho_out_attrs[] = {
+	&active_attr.attr,
+	&dt_max_attr.attr,
+	&scratch_phys_attr.attr,
+	&scratch_len_attr.attr,
+	NULL,
+};
+
+static __init int kho_out_sysfs_init(void)
+{
+	int err;
+
+	kho_out.kobj = kobject_create_and_add("kho", kernel_kobj);
+	if (!kho_out.kobj)
+		return -ENOMEM;
+
+	err = sysfs_create_files(kho_out.kobj, kho_out_attrs);
+	if (err)
+		goto err_put_kobj;
+
+	return 0;
+
+err_put_kobj:
+	kobject_put(kho_out.kobj);
+	return err;
+}
+
+static __init int kho_init(void)
+{
+	int err;
+
+	if (!kho_enable)
+		return -EINVAL;
+
+	err = kho_out_sysfs_init();
+	if (err)
+		return err;
+
+	for (int i = 0; i < kho_scratch_cnt; i++) {
+		unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr);
+		unsigned long count = kho_scratch[i].size >> PAGE_SHIFT;
+		unsigned long pfn;
+
+		for (pfn = base_pfn; pfn < base_pfn + count;
+		     pfn += pageblock_nr_pages)
+			init_cma_reserved_pageblock(pfn_to_page(pfn));
+	}
+
+	return 0;
+}
+late_initcall(kho_init);
+
+/*
+ * The scratch areas are scaled by default as percent of memory allocated from
+ * memblock. A user can override the scale with command line parameter:
+ *
+ * kho_scratch=N%
+ *
+ * It is also possible to explicitly define size for a global and per-node
+ * scratch areas:
+ *
+ * kho_scratch=n[KMG],m[KMG]
+ *
+ * The explicit size definition takes precedence over scale definition.
+ */
+static unsigned int scratch_scale __initdata = 200;
+static phys_addr_t scratch_size_global __initdata;
+static phys_addr_t scratch_size_pernode __initdata;
+
+static int __init kho_parse_scratch_size(char *p)
+{
+	unsigned long size, size_pernode;
+	char *endptr, *oldp = p;
+
+	if (!p)
+		return -EINVAL;
+
+	size = simple_strtoul(p, &endptr, 0);
+	if (*endptr == '%') {
+		scratch_scale = size;
+		pr_notice("scratch scale is %d percent\n", scratch_scale);
+	} else {
+		size = memparse(p, &p);
+		if (!size || p == oldp)
+			return -EINVAL;
+
+		if (*p != ',')
+			return -EINVAL;
+
+		size_pernode = memparse(p + 1, &p);
+		if (!size_pernode)
+			return -EINVAL;
+
+		scratch_size_global = size;
+		scratch_size_pernode = size_pernode;
+		scratch_scale = 0;
+
+		pr_notice("scratch areas: global: %lluMB pernode: %lldMB\n",
+			  (u64)(scratch_size_global >> 20),
+			  (u64)(scratch_size_pernode >> 20));
+	}
+
+	return 0;
+}
+early_param("kho_scratch", kho_parse_scratch_size);
+
+static phys_addr_t __init scratch_size(int nid)
+{
+	phys_addr_t size;
+
+	if (scratch_scale) {
+		size = memblock_reserved_kern_size(nid) * scratch_scale / 100;
+	} else {
+		if (numa_valid_node(nid))
+			size = scratch_size_pernode;
+		else
+			size = scratch_size_global;
+	}
+
+	return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+/**
+ * kho_reserve_scratch - Reserve a contiguous chunk of memory for kexec
+ *
+ * With KHO we can preserve arbitrary pages in the system. To ensure we still
+ * have a large contiguous region of memory when we search the physical address
+ * space for target memory, let's make sure we always have a large CMA region
+ * active. This CMA region will only be used for movable pages which are not a
+ * problem for us during KHO because we can just move them somewhere else.
+ */
+static void kho_reserve_scratch(void)
+{
+	phys_addr_t addr, size;
+	int nid, i = 1;
+
+	if (!kho_enable)
+		return;
+
+	/* FIXME: deal with node hot-plug/remove */
+	kho_scratch_cnt = num_online_nodes() + 1;
+	size = kho_scratch_cnt * sizeof(*kho_scratch);
+	kho_scratch = memblock_alloc(size, PAGE_SIZE);
+	if (!kho_scratch)
+		goto err_disable_kho;
+
+	/* reserve large contiguous area for allocations without nid */
+	size = scratch_size(NUMA_NO_NODE);
+	addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
+	if (!addr)
+		goto err_free_scratch_desc;
+
+	kho_scratch[0].addr = addr;
+	kho_scratch[0].size = size;
+
+	for_each_online_node(nid) {
+		size = scratch_size(nid);
+		addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES,
+						0, MEMBLOCK_ALLOC_ACCESSIBLE,
+						nid, true);
+		if (!addr)
+			goto err_free_scratch_areas;
+
+		kho_scratch[i].addr = addr;
+		kho_scratch[i].size = size;
+		i++;
+	}
+
+	return;
+
+err_free_scratch_areas:
+	for (i--; i >= 0; i--)
+		memblock_phys_free(kho_scratch[i].addr, kho_scratch[i].size);
+err_free_scratch_desc:
+	memblock_free(kho_scratch, kho_scratch_cnt * sizeof(*kho_scratch));
+err_disable_kho:
+	kho_enable = false;
+}
+
+void __init kho_memory_init(void)
+{
+	kho_reserve_scratch();
+}
diff --git a/mm/internal.h b/mm/internal.h
index 986ad9c2a8b2..fdd379fddf6d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -841,9 +841,6 @@  int
 isolate_migratepages_range(struct compact_control *cc,
 			   unsigned long low_pfn, unsigned long end_pfn);
 
-/* Free whole pageblock and set its migration type to MIGRATE_CMA. */
-void init_cma_reserved_pageblock(struct page *page);
-
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 
 int find_suitable_fallback(struct free_area *area, unsigned int order,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 04441c258b05..60f08930e434 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -30,6 +30,7 @@ 
 #include <linux/crash_dump.h>
 #include <linux/execmem.h>
 #include <linux/vmstat.h>
+#include <linux/kexec.h>
 #include "internal.h"
 #include "slab.h"
 #include "shuffle.h"
@@ -2661,6 +2662,13 @@  void __init mm_core_init(void)
 	report_meminit();
 	kmsan_init_shadow();
 	stack_depot_early_init();
+
+	/*
+	 * KHO memory setup must happen while memblock is still active, but
+	 * as close as possible to buddy initialization
+	 */
+	kho_memory_init();
+
 	mem_init();
 	kmem_cache_init();
 	/*