diff mbox series

[v5,07/16] kexec: add Kexec HandOver (KHO) generation helpers

Message ID 20250320015551.2157511-8-changyuanl@google.com (mailing list archive)
State New
Headers show
Series kexec: introduce Kexec HandOver (KHO) | expand

Commit Message

Changyuan Lyu March 20, 2025, 1:55 a.m. UTC
From: Alexander Graf <graf@amazon.com>

Add the core infrastructure to generate Kexec HandOver metadata. Kexec
HandOver is a mechanism that allows Linux to preserve state - arbitrary
properties as well as memory locations - across kexec.

It does so using 2 concepts:

  1) State Tree - Every KHO kexec carries a state tree that describes the
     state of the system. The state tree is represented as hash-tables.
     Device drivers can add/remove their data into/from the state tree at
     system runtime. On kexec, the tree is converted to FDT (flattened
     device tree).

  2) Scratch Regions - CMA regions that we allocate in the first kernel.
     CMA gives us the guarantee that no handover pages land in those
     regions, because handover pages must be at a static physical memory
     location. We use these regions as the place to load future kexec
     images so that they won't collide with any handover data.

Signed-off-by: Alexander Graf <graf@amazon.com>
Co-developed-by: Pratyush Yadav <ptyadav@amazon.de>
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Co-developed-by: Changyuan Lyu <changyuanl@google.com>
Signed-off-by: Changyuan Lyu <changyuanl@google.com>
---
 MAINTAINERS                    |   2 +-
 include/linux/kexec_handover.h | 109 +++++
 kernel/Makefile                |   1 +
 kernel/kexec_handover.c        | 865 +++++++++++++++++++++++++++++++++
 mm/mm_init.c                   |   8 +
 5 files changed, 984 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/kexec_handover.h
 create mode 100644 kernel/kexec_handover.c

Comments

Jason Gunthorpe March 21, 2025, 1:34 p.m. UTC | #1
On Wed, Mar 19, 2025 at 06:55:42PM -0700, Changyuan Lyu wrote:
> From: Alexander Graf <graf@amazon.com>
> 
> Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> HandOver is a mechanism that allows Linux to preserve state - arbitrary
> properties as well as memory locations - across kexec.
> 
> It does so using 2 concepts:
> 
>   1) State Tree - Every KHO kexec carries a state tree that describes the
>      state of the system. The state tree is represented as hash-tables.
>      Device drivers can add/remove their data into/from the state tree at
>      system runtime. On kexec, the tree is converted to FDT (flattened
>      device tree).

Why are we changing this? I much prefered the idea of having recursive
FDTs than this notion copying eveything into tables then out into FDT?
Now that we have the preserved pages mechanism there is a pretty
direct path to doing recursive FDT.

I feel like this patch is premature, it should come later in the
project along with a stronger justification for this approach.

IHMO keep things simple for this series, just the very basics.

> +int register_kho_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
> +}
> +EXPORT_SYMBOL_GPL(register_kho_notifier);

And another different set of notifiers? :(

> +static int kho_finalize(void)
> +{
> +	int err = 0;
> +	void *fdt;
> +
> +	fdt = kvmalloc(kho_out.fdt_max, GFP_KERNEL);
> +	if (!fdt)
> +		return -ENOMEM;

We go to all the trouble of keeping track of stuff in dynamic hashes
but still can't automatically size the fdt and keep the dumb uapi to
have the user say? :( :(

Jason
Changyuan Lyu March 23, 2025, 7:02 p.m. UTC | #2
Hi Jason, thanks for reviewing the patchset!

On Fri, Mar 21, 2025 at 10:34:47 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Wed, Mar 19, 2025 at 06:55:42PM -0700, Changyuan Lyu wrote:
> > From: Alexander Graf <graf@amazon.com>
> >
> > Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> > HandOver is a mechanism that allows Linux to preserve state - arbitrary
> > properties as well as memory locations - across kexec.
> >
> > It does so using 2 concepts:
> >
> >   1) State Tree - Every KHO kexec carries a state tree that describes the
> >      state of the system. The state tree is represented as hash-tables.
> >      Device drivers can add/remove their data into/from the state tree at
> >      system runtime. On kexec, the tree is converted to FDT (flattened
> >      device tree).
>
> Why are we changing this? I much prefered the idea of having recursive
> FDTs than this notion copying eveything into tables then out into FDT?
> Now that we have the preserved pages mechanism there is a pretty
> direct path to doing recursive FDT.

We are not copying data into the hashtables, instead the hashtables only
record the address and size of the data to be serialized into FDT.
The idea is similar to recording preserved folios in xarray
and then serialize it to linked pages.

> I feel like this patch is premature, it should come later in the
> project along with a stronger justification for this approach.
>
> IHMO keep things simple for this series, just the very basics.

The main purpose of using hashtables is to enable KHO users to save
data to KHO at any time, not just at the time of activate/finalize KHO
through sysfs/debugfs. For example, FDBox can save the data into KHO
tree once a new fd is saved to KHO. Also, using hashtables allows KHO
users to add data to KHO concurrently, while with notifiers, KHO users'
callbacks are executed serially.

Regarding the suggestion of recursive FDT, I feel like it is already
doable with this patchset, or even with Mike's V4 patch. A KHO user can
just allocates a buffer, serialize all its states to the buffer using
libfdt (or even using other binary formats), save the address of the
buffer to KHO's tree, and finally register the buffer's underlying
pages/folios with kho_preserve_folio().

> > +int register_kho_notifier(struct notifier_block *nb)
> > +{
> > +	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
> > +}
> > +EXPORT_SYMBOL_GPL(register_kho_notifier);
>
> And another different set of notifiers? :(

I changed the semantics of the notifiers. In Mike's V4, the KHO notifier
is to pass the fdt pointer to KHO users to push data into the blob. In
this patchset, it notifies KHO users about the last chance for saving
data to KHO.

It is not necessary for every KHO user to register a
notifier, as they can use the helper functions to save data to KHO tree
anytime (but before the KHO tree is converted and frozen). For example,
FDBox would not need a notifier if it saves data to KHO tree immediately
once an FD is registered to it.

However, some KHO users may still want to add data just before kexec,
so I kept the notifiers and allow KHO users to get notified when the
state tree hashtables are about to be frozen and converted to FDT.

> > +static int kho_finalize(void)
> > +{
> > +	int err = 0;
> > +	void *fdt;
> > +
> > +	fdt = kvmalloc(kho_out.fdt_max, GFP_KERNEL);
> > +	if (!fdt)
> > +		return -ENOMEM;
>
> We go to all the trouble of keeping track of stuff in dynamic hashes
> but still can't automatically size the fdt and keep the dumb uapi to
> have the user say? :( :(

The reason of keeping fdt_max in the this patchset is to simplify the
support of kexec_file_load().

We want to be able to do kexec_file_load()
first and then do KHO activation/finalization to move kexec_file_load()
out of the blackout window. At the time of kexec_file_load(), we need to
pass the KHO FDT address to the new kernel's setup data (x86) or
devicetree (arm), but KHO FDT is not generated yet. The simple solution
used in this patchset is to reserve a ksegment of size fdt_max and pass
the address of that ksegment to the new kernel. The final FDT is copied
to that ksegment in kernel_kexec().
The extra benefit of this solution is the reserved ksegment is
physically contiguous.

To completely remove fdt_max, I am considering the idea in [1]. At the
time of kexec_file_load(), we pass the address of an anchor page to
the new kernel, and the anchor page will later be fulfilled with the
physical addresses of the pages containing the FDT blob. Multiple
anchor pages can be linked together. The FDT blob pages can be physically
noncontiguous.

[1] https://lore.kernel.org/all/CA+CK2bBBX+HgD0HLj-AyTScM59F2wXq11BEPgejPMHoEwqj+_Q@mail.gmail.com/

Best,
Changyuan
Jason Gunthorpe March 24, 2025, 4:28 p.m. UTC | #3
On Sun, Mar 23, 2025 at 12:02:04PM -0700, Changyuan Lyu wrote:

> > Why are we changing this? I much prefered the idea of having recursive
> > FDTs than this notion copying eveything into tables then out into FDT?
> > Now that we have the preserved pages mechanism there is a pretty
> > direct path to doing recursive FDT.
> 
> We are not copying data into the hashtables, instead the hashtables only
> record the address and size of the data to be serialized into FDT.
> The idea is similar to recording preserved folios in xarray
> and then serialize it to linked pages.

I understand that, I mean you are copying the keys/tree/etc. It
doesn't seem like a good idea idea to me.
 
> > I feel like this patch is premature, it should come later in the
> > project along with a stronger justification for this approach.
> >
> > IHMO keep things simple for this series, just the very basics.
> 
> The main purpose of using hashtables is to enable KHO users to save
> data to KHO at any time, not just at the time of activate/finalize KHO
> through sysfs/debugfs. For example, FDBox can save the data into KHO
> tree once a new fd is saved to KHO. Also, using hashtables allows KHO
> users to add data to KHO concurrently, while with notifiers, KHO users'
> callbacks are executed serially.

This is why I like the recursive FDT scheme. Each serialization
operation can open its own FDT write to it and the close it
sequenatially within its operation without any worries about
concurrency.

The top level just aggregates the FDT blobs (which are in preserved
memory)

To me all this complexity here with the hash table and the copying
makes no sense compared to that. It is all around slower.

> Regarding the suggestion of recursive FDT, I feel like it is already
> doable with this patchset, or even with Mike's V4 patch. 

Of course it is doable, here we are really talk about what is the
right, recommended way to use this system. recurisive FDT is a better
methodology than hash tables

> just allocates a buffer, serialize all its states to the buffer using
> libfdt (or even using other binary formats), save the address of the
> buffer to KHO's tree, and finally register the buffer's underlying
> pages/folios with kho_preserve_folio().

Yes, exactly! I think this is how we should operate this system as a
paradig, not a giant FDT, hash table and so on...

> I changed the semantics of the notifiers. In Mike's V4, the KHO notifier
> is to pass the fdt pointer to KHO users to push data into the blob. In
> this patchset, it notifies KHO users about the last chance for saving
> data to KHO.

I think Mike's semantic makes more sense.. At least I'd want to see an
actual example of someting that wants to do a list minute adjustment
before adding the code.

> However, some KHO users may still want to add data just before kexec,
> so I kept the notifiers and allow KHO users to get notified when the
> state tree hashtables are about to be frozen and converted to FDT.

Let's try not adding API surface that has no present user as much as
possible please. You can shove this into speculative patches that
someone can pick up if they need this semantic
 
> To completely remove fdt_max, I am considering the idea in [1]. At the
> time of kexec_file_load(), we pass the address of an anchor page to
> the new kernel, and the anchor page will later be fulfilled with the
> physical addresses of the pages containing the FDT blob. Multiple
> anchor pages can be linked together. The FDT blob pages can be physically
> noncontiguous.

Yes, this is basically what I suggested too. I think this is much
prefered and doesn't require the wakky uapi.

Except I suggested you just really need a single u64 to point to a
preserved page holding the top level FDT.

With recursive FDT I think we can say that no FDT fragement should
exceed PAGE_SIZE, and things become much simpler, IMHO.

Jason
Frank van der Linden March 24, 2025, 6:40 p.m. UTC | #4
On Wed, Mar 19, 2025 at 6:56 PM Changyuan Lyu <changyuanl@google.com> wrote:
>
> From: Alexander Graf <graf@amazon.com>
>
> Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> HandOver is a mechanism that allows Linux to preserve state - arbitrary
> properties as well as memory locations - across kexec.
>
> It does so using 2 concepts:
>
>   1) State Tree - Every KHO kexec carries a state tree that describes the
>      state of the system. The state tree is represented as hash-tables.
>      Device drivers can add/remove their data into/from the state tree at
>      system runtime. On kexec, the tree is converted to FDT (flattened
>      device tree).
>
>   2) Scratch Regions - CMA regions that we allocate in the first kernel.
>      CMA gives us the guarantee that no handover pages land in those
>      regions, because handover pages must be at a static physical memory
>      location. We use these regions as the place to load future kexec
>      images so that they won't collide with any handover data.
>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Co-developed-by: Pratyush Yadav <ptyadav@amazon.de>
> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> ---
>  MAINTAINERS                    |   2 +-
>  include/linux/kexec_handover.h | 109 +++++
>  kernel/Makefile                |   1 +
>  kernel/kexec_handover.c        | 865 +++++++++++++++++++++++++++++++++
>  mm/mm_init.c                   |   8 +
>  5 files changed, 984 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/kexec_handover.h
>  create mode 100644 kernel/kexec_handover.c
[...]
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index 04441c258b05..757659b7a26b 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -30,6 +30,7 @@
>  #include <linux/crash_dump.h>
>  #include <linux/execmem.h>
>  #include <linux/vmstat.h>
> +#include <linux/kexec_handover.h>
>  #include "internal.h"
>  #include "slab.h"
>  #include "shuffle.h"
> @@ -2661,6 +2662,13 @@ void __init mm_core_init(void)
>         report_meminit();
>         kmsan_init_shadow();
>         stack_depot_early_init();
> +
> +       /*
> +        * KHO memory setup must happen while memblock is still active, but
> +        * as close as possible to buddy initialization
> +        */
> +       kho_memory_init();
> +
>         mem_init();
>         kmem_cache_init();
>         /*


Thanks for the work on this.

Obviously it needs to happen while memblock is still active - but why
as close as possible to buddy initialization?

Ordering is always a sticky issue when it comes to doing things during
boot, of course. In this case, I can see scenarios where code that
runs a little earlier may want to use some preserved memory. The
current requirement in the patch set seems to be "after sparse/page
init", but I'm not sure why it needs to be as close as possibly to
buddy init.

- Frank
Changyuan Lyu March 25, 2025, 12:21 a.m. UTC | #5
Hi Jason,

On Mon, Mar 24, 2025 at 13:28:53 -0300, Jason Gunthorpe <jgg@nvidia.com> wrote:
> [...]
> > > I feel like this patch is premature, it should come later in the
> > > project along with a stronger justification for this approach.
> > >
> > > IHMO keep things simple for this series, just the very basics.
> >
> > The main purpose of using hashtables is to enable KHO users to save
> > data to KHO at any time, not just at the time of activate/finalize KHO
> > through sysfs/debugfs. For example, FDBox can save the data into KHO
> > tree once a new fd is saved to KHO. Also, using hashtables allows KHO
> > users to add data to KHO concurrently, while with notifiers, KHO users'
> > callbacks are executed serially.
>
> This is why I like the recursive FDT scheme. Each serialization
> operation can open its own FDT write to it and the close it
> sequenatially within its operation without any worries about
> concurrency.
>
> The top level just aggregates the FDT blobs (which are in preserved
> memory)
>
> To me all this complexity here with the hash table and the copying
> makes no sense compared to that. It is all around slower.
>
> > Regarding the suggestion of recursive FDT, I feel like it is already
> > doable with this patchset, or even with Mike's V4 patch.
>
> Of course it is doable, here we are really talk about what is the
> right, recommended way to use this system. recurisive FDT is a better
> methodology than hash tables
>
> > just allocates a buffer, serialize all its states to the buffer using
> > libfdt (or even using other binary formats), save the address of the
> > buffer to KHO's tree, and finally register the buffer's underlying
> > pages/folios with kho_preserve_folio().
>
> Yes, exactly! I think this is how we should operate this system as a
> paradig, not a giant FDT, hash table and so on...
>
> [...]
> > To completely remove fdt_max, I am considering the idea in [1]. At the
> > time of kexec_file_load(), we pass the address of an anchor page to
> > the new kernel, and the anchor page will later be fulfilled with the
> > physical addresses of the pages containing the FDT blob. Multiple
> > anchor pages can be linked together. The FDT blob pages can be physically
> > noncontiguous.
>
> Yes, this is basically what I suggested too. I think this is much
> prefered and doesn't require the wakky uapi.
>
> Except I suggested you just really need a single u64 to point to a
> preserved page holding the top level FDT.
>
> With recursive FDT I think we can say that no FDT fragement should
> exceed PAGE_SIZE, and things become much simpler, IMHO.

Thanks for the suggestions! I am a little bit concerned about assuming
every FDT fragment is smaller than PAGE_SIZE. In case a child FDT is
larger than PAGE_SIZE, I would like to turn the single u64 in the parent
FDT into a u64 list to record all the underlying pages of the child FDT.

To be concrete and make sure I understand your suggestions correctly,
I drafted the following design,

Suppose we have 2 KHO users, memblock and gpu@0x2000000000, the KHO
FDT (top level FDT) would look like the following,

    /dts-v1/;
    / {
            compatible = "kho-v1";
            memblock {
                    kho,recursive-fdt = <0x00 0x40001000>;
            };
            gpu@0x100000000 {
                    kho,recursive-fdt = <0x00 0x40002000>;
            };
    };

kho,recursive-fdt in "memblock" points to a page containing another
FDT,

    / {
            compatible = "memblock-v1";
            n1 {
                    compatible = "reserve-mem-v1";
                    size = <0x04 0x00>;
                    start = <0xc06b 0x4000000>;
            };
            n2 {
                    compatible = "reserve-mem-v1";
                    size = <0x04 0x00>;
                    start = <0xc067 0x4000000>;
            };
    };

Similarly, "kho,recursive-fdt" in "gpu@0x2000000000" points to a page
containing another FDT,

    / {
            compatible = "gpu-v1"
            key1 = "v1";
            key2 = "v2";

            node1 {
                    kho,recursive-fdt = <0x00 0x40003000 0x00 0x40005000>;
            }
            node2 {
                    key3 = "v3";
                    key4 = "v4";
            }
    }

and kho,recursive-fdt in "node1" contains 2 non-contagious pages backing
the following large FDT fragment,

    / {
            compatible = "gpu-subnode1-v1";

            key5 = "v5";
            key6 = "v6";
            key7 = "v7";
            key8 = "v8";
            ... // many many keys and small values
    }

In this way we assume that most FDT fragment is smaller than 1 page so
"kho,recursive-fdt" is usually just 1 u64, but we can also handle
larger fragments if that really happens.

I also allow KHO users to add sub nodes in-place, instead of forcing
to create a new FDT fragment for every sub node, if the KHO user is
confident that those subnodes are small enough to fit in the parent
node's page. In this way we do not need to waste a full page for a small
sub node. An example is the "memblock" node above.

Finally, the KHO top level FDT may also be larger than 1 page, this can
be handled using the anchor-page method discussed in the previous mails.

What do you think?

Best,
Changyuan
Jason Gunthorpe March 25, 2025, 2:20 a.m. UTC | #6
On Mon, Mar 24, 2025 at 05:21:45PM -0700, Changyuan Lyu wrote:

> Thanks for the suggestions! I am a little bit concerned about assuming
> every FDT fragment is smaller than PAGE_SIZE. In case a child FDT is
> larger than PAGE_SIZE, I would like to turn the single u64 in the parent
> FDT into a u64 list to record all the underlying pages of the child FDT.

Maybe, but I'd suggest leaving some accomodation for this in the API
but not implement it until we see proof it is needed. 4k is alot of
space for a FDT, and if you are doing per-object FDT I don't see
exceeding it.

For instance a vfio, memfd, and iommufd object FDTs would not get
close.

> In this way we assume that most FDT fragment is smaller than 1 page so
> "kho,recursive-fdt" is usually just 1 u64, but we can also handle
> larger fragments if that really happens.

Yes, this is close to what I imagine.

You have to decide if the child FDT top pointers will be stored
directly in parent FDTs like you sketched above, or if they should be
stored in some dedicated allocated and preserved datastructure, like
the memory preservation works. There are some tradeoffs in each
direction..

> I also allow KHO users to add sub nodes in-place, instead of forcing
> to create a new FDT fragment for every sub node, if the KHO user is
> confident that those subnodes are small enough to fit in the parent
> node's page. In this way we do not need to waste a full page for a small
> sub node. An example is the "memblock" node above.

Well, I think that sort of misses the bigger picture. What we want is
to run serialization of everything in parallel. So merging like you
say will complicate that.

Really, I think we will have on the order of 10's of objects to
serialize so I don't really care if they use partial pages if that
makes the serialization faster. As long as the memory is freed once
the live update is done, the waste doesn't matter.

> Finally, the KHO top level FDT may also be larger than 1 page, this can
> be handled using the anchor-page method discussed in the previous mails.

This is one of the trade offs I mentioned. If you inline the objects
as FDT nodes then you have to scale and multi-page a FDT.

If you do a binary-structure like memory preservation then you have to
serialize to something that is inherently scalable and 4k granular.

The 4k FDT limit really only works if you make liberal use of pointers
to binary data. Anything that is not of a predictable size limit would
be in some related binary structure.

So.. I'd probably suggest to think about how to make multi-page FDT
work in the memory description, but not implement it now. When we
reach the point where we know we need multi-page FDT then someone
would have to implement a growable FDT through vmap or something like
that to make it work.

Keep this intial step simple, we clearly don't need more than 4k FDT
at this point and we aren't doing stable kexec-ABI either. So simplify
simplify simplify to get a very thin minimal functionality merged to
put the fdbox step on top of.

Jason
Mike Rapoport March 25, 2025, 7:19 p.m. UTC | #7
On Mon, Mar 24, 2025 at 11:40:43AM -0700, Frank van der Linden wrote:
> On Wed, Mar 19, 2025 at 6:56 PM Changyuan Lyu <changyuanl@google.com> wrote:
> >
> > From: Alexander Graf <graf@amazon.com>
> >
> > Add the core infrastructure to generate Kexec HandOver metadata. Kexec
> > HandOver is a mechanism that allows Linux to preserve state - arbitrary
> > properties as well as memory locations - across kexec.
> >
> > It does so using 2 concepts:
> >
> >   1) State Tree - Every KHO kexec carries a state tree that describes the
> >      state of the system. The state tree is represented as hash-tables.
> >      Device drivers can add/remove their data into/from the state tree at
> >      system runtime. On kexec, the tree is converted to FDT (flattened
> >      device tree).
> >
> >   2) Scratch Regions - CMA regions that we allocate in the first kernel.
> >      CMA gives us the guarantee that no handover pages land in those
> >      regions, because handover pages must be at a static physical memory
> >      location. We use these regions as the place to load future kexec
> >      images so that they won't collide with any handover data.
> >
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Co-developed-by: Pratyush Yadav <ptyadav@amazon.de>
> > Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> > Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > Co-developed-by: Changyuan Lyu <changyuanl@google.com>
> > Signed-off-by: Changyuan Lyu <changyuanl@google.com>
> > ---
> >  MAINTAINERS                    |   2 +-
> >  include/linux/kexec_handover.h | 109 +++++
> >  kernel/Makefile                |   1 +
> >  kernel/kexec_handover.c        | 865 +++++++++++++++++++++++++++++++++
> >  mm/mm_init.c                   |   8 +
> >  5 files changed, 984 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/kexec_handover.h
> >  create mode 100644 kernel/kexec_handover.c
> [...]
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index 04441c258b05..757659b7a26b 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/crash_dump.h>
> >  #include <linux/execmem.h>
> >  #include <linux/vmstat.h>
> > +#include <linux/kexec_handover.h>
> >  #include "internal.h"
> >  #include "slab.h"
> >  #include "shuffle.h"
> > @@ -2661,6 +2662,13 @@ void __init mm_core_init(void)
> >         report_meminit();
> >         kmsan_init_shadow();
> >         stack_depot_early_init();
> > +
> > +       /*
> > +        * KHO memory setup must happen while memblock is still active, but
> > +        * as close as possible to buddy initialization
> > +        */
> > +       kho_memory_init();
> > +
> >         mem_init();
> >         kmem_cache_init();
> >         /*
> 
> 
> Thanks for the work on this.
> 
> Obviously it needs to happen while memblock is still active - but why
> as close as possible to buddy initialization?

One reason is to have all memblock allocations done to autoscale the
scratch area. Another reason is to keep memblock structures small as long
as possible as memblock_reserve()ing the preserved memory would quite
inflate them.

And it's overall simpler if memblock only allocates from scratch rather
than doing some of early allocations from scratch and some elsewhere and
still making sure they avoid the preserved ranges.
 
> Ordering is always a sticky issue when it comes to doing things during
> boot, of course. In this case, I can see scenarios where code that
> runs a little earlier may want to use some preserved memory. The

Can you elaborate about such scenarios?

> current requirement in the patch set seems to be "after sparse/page
> init", but I'm not sure why it needs to be as close as possibly to
> buddy init.

Why would you say that sparse/page init would be a requirement here?
 
> - Frank
Frank van der Linden March 25, 2025, 9:56 p.m. UTC | #8
On Tue, Mar 25, 2025 at 12:19 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Mon, Mar 24, 2025 at 11:40:43AM -0700, Frank van der Linden wrote:
[...]
> > Thanks for the work on this.
> >
> > Obviously it needs to happen while memblock is still active - but why
> > as close as possible to buddy initialization?
>
> One reason is to have all memblock allocations done to autoscale the
> scratch area. Another reason is to keep memblock structures small as long
> as possible as memblock_reserve()ing the preserved memory would quite
> inflate them.
>
> And it's overall simpler if memblock only allocates from scratch rather
> than doing some of early allocations from scratch and some elsewhere and
> still making sure they avoid the preserved ranges.

Ah, thanks, I see the argument for the scratch area sizing.

>
> > Ordering is always a sticky issue when it comes to doing things during
> > boot, of course. In this case, I can see scenarios where code that
> > runs a little earlier may want to use some preserved memory. The
>
> Can you elaborate about such scenarios?

There has, for example, been some talk about making hugetlbfs
persistent. You could have hugetlb_cma active. The hugetlb CMA areas
are set up quite early, quite some time before KHO restores memory. So
that would have to be changed somehow if the location of the KHO init
call would remain as close as possible to buddy init as possible. I
suspect there may be other uses.

Although I suppose you could just look up the addresses and then
reserve them yourself, you would just need the KHO FDT to be
initialized. And you'd need to avoid the KHO bitmap deserialize trying
to redo the ranges you've already done.

>
> > current requirement in the patch set seems to be "after sparse/page
> > init", but I'm not sure why it needs to be as close as possibly to
> > buddy init.
>
> Why would you say that sparse/page init would be a requirement here?

At least in its current form, the KHO code expects vmemmap to be
initialized, as it does its restore base on page structures, as
deserialize_bitmap expects them. I think the use of the page->private
field was discussed in a separate thread, I think. If that is done
differently, it wouldn't rely on vmemmap being initialized.

A few more things I've noticed (not sure if these were discussed before):

* Should KHO depend on CONFIG_DEFERRED_STRUCT_PAGE_INIT? Essentially,
marking memblock ranges as NOINIT doesn't work without
DEFERRED_STRUCT_PAGE_INIT. Although, if the page->private use
disappears, this wouldn't be an issue anymore.
* As a future extension, it could be nice to store vmemmap init
information in the KHO FDT. Then you can use that to init ranges in an
optimized way (HVO hugetlb or DAX-style persisted ranges) straight
away.

- Frank




> > - Frank
>
> --
> Sincerely yours,
> Mike.
Mike Rapoport March 26, 2025, 11:59 a.m. UTC | #9
On Tue, Mar 25, 2025 at 02:56:52PM -0700, Frank van der Linden wrote:
> On Tue, Mar 25, 2025 at 12:19 PM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Mon, Mar 24, 2025 at 11:40:43AM -0700, Frank van der Linden wrote:
> [...]
> > > Thanks for the work on this.
> > >
> > > Obviously it needs to happen while memblock is still active - but why
> > > as close as possible to buddy initialization?
> >
> > One reason is to have all memblock allocations done to autoscale the
> > scratch area. Another reason is to keep memblock structures small as long
> > as possible as memblock_reserve()ing the preserved memory would quite
> > inflate them.
> >
> > And it's overall simpler if memblock only allocates from scratch rather
> > than doing some of early allocations from scratch and some elsewhere and
> > still making sure they avoid the preserved ranges.
> 
> Ah, thanks, I see the argument for the scratch area sizing.
> 
> >
> > > Ordering is always a sticky issue when it comes to doing things during
> > > boot, of course. In this case, I can see scenarios where code that
> > > runs a little earlier may want to use some preserved memory. The
> >
> > Can you elaborate about such scenarios?
> 
> There has, for example, been some talk about making hugetlbfs
> persistent. You could have hugetlb_cma active. The hugetlb CMA areas
> are set up quite early, quite some time before KHO restores memory. So
> that would have to be changed somehow if the location of the KHO init
> call would remain as close as possible to buddy init as possible. I
> suspect there may be other uses.

I think we can address this when/if implementing preservation for hugetlbfs
and it will be tricky. 
If hugetlb in the first kernel uses a lot of memory, we just won't have
enough scratch space for early hugetlb reservations in the second kernel
regardless of hugetlb_cma. On the other hand, we already have the preserved
hugetlbfs memory, so we'd probably need to reserve less memory in the
second kernel.

But anyway, it's completely different discussion about how to preserve
hugetlbfs.
 
> > > current requirement in the patch set seems to be "after sparse/page
> > > init", but I'm not sure why it needs to be as close as possibly to
> > > buddy init.
> >
> > Why would you say that sparse/page init would be a requirement here?
> 
> At least in its current form, the KHO code expects vmemmap to be
> initialized, as it does its restore base on page structures, as
> deserialize_bitmap expects them. I think the use of the page->private
> field was discussed in a separate thread, I think. If that is done
> differently, it wouldn't rely on vmemmap being initialized.

In the current form KHO does relies on vmemmap being allocated, but it does
not rely on it being initialized. Marking memblock ranges NOINT ensures
nothing touches the corresponding struct pages and KHO can use their fields
up to the point the memory is returned to KHO callers.
 
> A few more things I've noticed (not sure if these were discussed before):
> 
> * Should KHO depend on CONFIG_DEFERRED_STRUCT_PAGE_INIT? Essentially,
> marking memblock ranges as NOINIT doesn't work without
> DEFERRED_STRUCT_PAGE_INIT. Although, if the page->private use
> disappears, this wouldn't be an issue anymore.

It does. 
memmap_init_reserved_pages() is called always, no matter of
CONFIG_DEFERRED_STRUCT_PAGE_INIT is set or not and it skips initialization
of NOINIT regions.

> * As a future extension, it could be nice to store vmemmap init
> information in the KHO FDT. Then you can use that to init ranges in an
> optimized way (HVO hugetlb or DAX-style persisted ranges) straight
> away.
 
These days memmap contents is unstable because of the folio/memdesc
project, but in general carrying memory map data from kernel to kernel is
indeed something to consider.

> - Frank
Frank van der Linden March 26, 2025, 4:25 p.m. UTC | #10
On Wed, Mar 26, 2025 at 4:59 AM Mike Rapoport <rppt@kernel.org> wrote:
[...]
> > There has, for example, been some talk about making hugetlbfs
> > persistent. You could have hugetlb_cma active. The hugetlb CMA areas
> > are set up quite early, quite some time before KHO restores memory. So
> > that would have to be changed somehow if the location of the KHO init
> > call would remain as close as possible to buddy init as possible. I
> > suspect there may be other uses.
>
> I think we can address this when/if implementing preservation for hugetlbfs
> and it will be tricky.
> If hugetlb in the first kernel uses a lot of memory, we just won't have
> enough scratch space for early hugetlb reservations in the second kernel
> regardless of hugetlb_cma. On the other hand, we already have the preserved
> hugetlbfs memory, so we'd probably need to reserve less memory in the
> second kernel.
>
> But anyway, it's completely different discussion about how to preserve
> hugetlbfs.

Right, there would have to be a KHO interface way to carry over the
early reserved memory and reinit it early too.

>
> > > > current requirement in the patch set seems to be "after sparse/page
> > > > init", but I'm not sure why it needs to be as close as possibly to
> > > > buddy init.
> > >
> > > Why would you say that sparse/page init would be a requirement here?
> >
> > At least in its current form, the KHO code expects vmemmap to be
> > initialized, as it does its restore base on page structures, as
> > deserialize_bitmap expects them. I think the use of the page->private
> > field was discussed in a separate thread, I think. If that is done
> > differently, it wouldn't rely on vmemmap being initialized.
>
> In the current form KHO does relies on vmemmap being allocated, but it does
> not rely on it being initialized. Marking memblock ranges NOINT ensures
> nothing touches the corresponding struct pages and KHO can use their fields
> up to the point the memory is returned to KHO callers.
>
> > A few more things I've noticed (not sure if these were discussed before):
> >
> > * Should KHO depend on CONFIG_DEFERRED_STRUCT_PAGE_INIT? Essentially,
> > marking memblock ranges as NOINIT doesn't work without
> > DEFERRED_STRUCT_PAGE_INIT. Although, if the page->private use
> > disappears, this wouldn't be an issue anymore.
>
> It does.
> memmap_init_reserved_pages() is called always, no matter of
> CONFIG_DEFERRED_STRUCT_PAGE_INIT is set or not and it skips initialization
> of NOINIT regions.

Yeah, I see - the ordering makes this work out.

MEMBLOCK_RSRV_NOINIT is a bit confusing in the sense that if you do a
memblock allocation in the !CONFIG_DEFERRED_STRUCT_PAGE_INIT case, and
that allocation is done before free_area_init(), the pages will always
get initialized regardless, since memmap_init_range() will do it. But
this is done before the KHO deserialize, so it works out.

>
> > * As a future extension, it could be nice to store vmemmap init
> > information in the KHO FDT. Then you can use that to init ranges in an
> > optimized way (HVO hugetlb or DAX-style persisted ranges) straight
> > away.
>
> These days memmap contents is unstable because of the folio/memdesc
> project, but in general carrying memory map data from kernel to kernel is
> indeed something to consider.

Yes, I think we might have a need for that, but we'll see.

Thanks,

- Frank
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 12852355bd66..a000a277ccf7 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12828,7 +12828,7 @@  F:	include/linux/kernfs.h
 KEXEC
 L:	kexec@lists.infradead.org
 W:	http://kernel.org/pub/linux/utils/kernel/kexec/
-F:	include/linux/kexec.h
+F:	include/linux/kexec*.h
 F:	include/uapi/linux/kexec.h
 F:	kernel/kexec*
 
diff --git a/include/linux/kexec_handover.h b/include/linux/kexec_handover.h
new file mode 100644
index 000000000000..9cd9ad31e2d1
--- /dev/null
+++ b/include/linux/kexec_handover.h
@@ -0,0 +1,109 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_KEXEC_HANDOVER_H
+#define LINUX_KEXEC_HANDOVER_H
+
+#include <linux/types.h>
+#include <linux/hashtable.h>
+#include <linux/notifier.h>
+
+struct kho_scratch {
+	phys_addr_t addr;
+	phys_addr_t size;
+};
+
+/* KHO Notifier index */
+enum kho_event {
+	KEXEC_KHO_FINALIZE = 0,
+	KEXEC_KHO_UNFREEZE = 1,
+};
+
+#define KHO_HASHTABLE_BITS 3
+#define KHO_NODE_INIT                                        \
+	{                                                    \
+		.props = HASHTABLE_INIT(KHO_HASHTABLE_BITS), \
+		.nodes = HASHTABLE_INIT(KHO_HASHTABLE_BITS), \
+	}
+
+struct kho_node {
+	struct hlist_node hlist;
+
+	const char *name;
+	DECLARE_HASHTABLE(props, KHO_HASHTABLE_BITS);
+	DECLARE_HASHTABLE(nodes, KHO_HASHTABLE_BITS);
+
+	struct list_head list;
+	bool visited;
+};
+
+#ifdef CONFIG_KEXEC_HANDOVER
+bool kho_is_enabled(void);
+void kho_init_node(struct kho_node *node);
+int kho_add_node(struct kho_node *parent, const char *name,
+		 struct kho_node *child);
+struct kho_node *kho_remove_node(struct kho_node *parent, const char *name);
+int kho_add_prop(struct kho_node *node, const char *key, const void *val,
+		 u32 size);
+void *kho_remove_prop(struct kho_node *node, const char *key, u32 *size);
+int kho_add_string_prop(struct kho_node *node, const char *key,
+			const char *val);
+
+int register_kho_notifier(struct notifier_block *nb);
+int unregister_kho_notifier(struct notifier_block *nb);
+
+void kho_memory_init(void);
+#else
+static inline bool kho_is_enabled(void)
+{
+	return false;
+}
+
+static inline void kho_init_node(struct kho_node *node)
+{
+}
+
+static inline int kho_add_node(struct kho_node *parent, const char *name,
+			       struct kho_node *child)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline struct kho_node *kho_remove_node(struct kho_node *parent,
+					       const char *name)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline int kho_add_prop(struct kho_node *node, const char *key,
+			       const void *val, u32 size)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void *kho_remove_prop(struct kho_node *node, const char *key,
+				    u32 *size)
+{
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static inline int kho_add_string_prop(struct kho_node *node, const char *key,
+				      const char *val)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int register_kho_notifier(struct notifier_block *nb)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int unregister_kho_notifier(struct notifier_block *nb)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void kho_memory_init(void)
+{
+}
+#endif /* CONFIG_KEXEC_HANDOVER */
+
+#endif /* LINUX_KEXEC_HANDOVER_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index 87866b037fbe..cef5377c25cd 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -75,6 +75,7 @@  obj-$(CONFIG_CRASH_DUMP) += crash_core.o
 obj-$(CONFIG_KEXEC) += kexec.o
 obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
 obj-$(CONFIG_KEXEC_ELF) += kexec_elf.o
+obj-$(CONFIG_KEXEC_HANDOVER) += kexec_handover.o
 obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
 obj-$(CONFIG_COMPAT) += compat.o
 obj-$(CONFIG_CGROUPS) += cgroup/
diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
new file mode 100644
index 000000000000..df0d9debbb64
--- /dev/null
+++ b/kernel/kexec_handover.c
@@ -0,0 +1,865 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * kexec_handover.c - kexec handover metadata processing
+ * Copyright (C) 2023 Alexander Graf <graf@amazon.com>
+ * Copyright (C) 2025 Microsoft Corporation, Mike Rapoport <rppt@kernel.org>
+ * Copyright (C) 2024 Google LLC
+ */
+
+#define pr_fmt(fmt) "KHO: " fmt
+
+#include <linux/cma.h>
+#include <linux/kexec.h>
+#include <linux/libfdt.h>
+#include <linux/debugfs.h>
+#include <linux/memblock.h>
+#include <linux/notifier.h>
+#include <linux/kexec_handover.h>
+#include <linux/page-isolation.h>
+#include <linux/rwsem.h>
+#include <linux/xxhash.h>
+/*
+ * KHO is tightly coupled with mm init and needs access to some of mm
+ * internal APIs.
+ */
+#include "../mm/internal.h"
+#include "kexec_internal.h"
+
+static bool kho_enable __ro_after_init;
+
+bool kho_is_enabled(void)
+{
+	return kho_enable;
+}
+EXPORT_SYMBOL_GPL(kho_is_enabled);
+
+static int __init kho_parse_enable(char *p)
+{
+	return kstrtobool(p, &kho_enable);
+}
+early_param("kho", kho_parse_enable);
+
+/*
+ * With KHO enabled, memory can become fragmented because KHO regions may
+ * be anywhere in physical address space. The scratch regions give us a
+ * safe zones that we will never see KHO allocations from. This is where we
+ * can later safely load our new kexec images into and then use the scratch
+ * area for early allocations that happen before page allocator is
+ * initialized.
+ */
+static struct kho_scratch *kho_scratch;
+static unsigned int kho_scratch_cnt;
+
+static struct dentry *debugfs_root;
+
+struct kho_out {
+	struct blocking_notifier_head chain_head;
+
+	struct debugfs_blob_wrapper fdt_wrapper;
+	struct dentry *fdt_file;
+	struct dentry *dir;
+
+	struct rw_semaphore tree_lock;
+	struct kho_node root;
+
+	void *fdt;
+	u64 fdt_max;
+};
+
+static struct kho_out kho_out = {
+	.chain_head = BLOCKING_NOTIFIER_INIT(kho_out.chain_head),
+	.tree_lock = __RWSEM_INITIALIZER(kho_out.tree_lock),
+	.root = KHO_NODE_INIT,
+	.fdt_max = 10 * SZ_1M,
+};
+
+int register_kho_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&kho_out.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(register_kho_notifier);
+
+int unregister_kho_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&kho_out.chain_head, nb);
+}
+EXPORT_SYMBOL_GPL(unregister_kho_notifier);
+
+/* Helper functions for KHO state tree */
+
+struct kho_prop {
+	struct hlist_node hlist;
+
+	const char *key;
+	const void *val;
+	u32 size;
+};
+
+static unsigned long strhash(const char *s)
+{
+	return xxhash(s, strlen(s), 1120);
+}
+
+void kho_init_node(struct kho_node *node)
+{
+	hash_init(node->props);
+	hash_init(node->nodes);
+}
+EXPORT_SYMBOL_GPL(kho_init_node);
+
+/**
+ * kho_add_node - add a child node to a parent node.
+ * @parent: parent node to add to.
+ * @name: name of the child node.
+ * @child: child node to be added to @parent with @name.
+ *
+ * If @parent is NULL, @child is added to KHO state tree root node.
+ *
+ * @child must be a valid pointer through KHO FDT finalization.
+ * @name is duplicated and thus can have a short lifetime.
+ *
+ * Callers must use their own locking if there are concurrent accesses to
+ * @parent or @child.
+ *
+ * Return: 0 on success, 1 if @child is already in @parent with @name, or
+ *   - -EOPNOTSUPP: KHO is not enabled in the kernel command line,
+ *   - -ENOMEM: failed to duplicate @name,
+ *   - -EBUSY: KHO state tree has been converted to FDT,
+ *   - -EEXIST: another node of the same name has been added to the parent.
+ */
+int kho_add_node(struct kho_node *parent, const char *name,
+		 struct kho_node *child)
+{
+	unsigned long name_hash;
+	int ret = 0;
+	struct kho_node *node;
+	char *child_name;
+
+	if (!kho_enable)
+		return -EOPNOTSUPP;
+
+	if (!parent)
+		parent = &kho_out.root;
+
+	child_name = kstrdup(name, GFP_KERNEL);
+	if (!child_name)
+		return -ENOMEM;
+
+	name_hash = strhash(child_name);
+
+	if (parent == &kho_out.root)
+		down_write(&kho_out.tree_lock);
+	else
+		down_read(&kho_out.tree_lock);
+
+	if (kho_out.fdt) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	hash_for_each_possible(parent->nodes, node, hlist, name_hash) {
+		if (!strcmp(node->name, child_name)) {
+			ret = node == child ? 1 : -EEXIST;
+			break;
+		}
+	}
+
+	if (ret == 0) {
+		child->name = child_name;
+		hash_add(parent->nodes, &child->hlist, name_hash);
+	}
+
+out:
+	if (parent == &kho_out.root)
+		up_write(&kho_out.tree_lock);
+	else
+		up_read(&kho_out.tree_lock);
+
+	if (ret)
+		kfree(child_name);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_add_node);
+
+/**
+ * kho_remove_node - remove a child node from a parent node.
+ * @parent: parent node to look up for.
+ * @name: name of the child node.
+ *
+ * If @parent is NULL, KHO state tree root node is looked up.
+ *
+ * Callers must use their own locking if there are concurrent accesses to
+ * @parent or @child.
+ *
+ * Return: the pointer to the child node on success, or an error pointer,
+ *   - -EOPNOTSUPP: KHO is not enabled in the kernel command line,
+ *   - -ENOENT: no node named @name is found.
+ *   - -EBUSY: KHO state tree has been converted to FDT.
+ */
+struct kho_node *kho_remove_node(struct kho_node *parent, const char *name)
+{
+	struct kho_node *child, *ret = ERR_PTR(-ENOENT);
+	unsigned long name_hash;
+
+	if (!kho_enable)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (!parent)
+		parent = &kho_out.root;
+
+	name_hash = strhash(name);
+
+	if (parent == &kho_out.root)
+		down_write(&kho_out.tree_lock);
+	else
+		down_read(&kho_out.tree_lock);
+
+	if (kho_out.fdt) {
+		ret = ERR_PTR(-EBUSY);
+		goto out;
+	}
+
+	hash_for_each_possible(parent->nodes, child, hlist, name_hash) {
+		if (!strcmp(child->name, name)) {
+			ret = child;
+			break;
+		}
+	}
+
+	if (!IS_ERR(ret)) {
+		hash_del(&ret->hlist);
+		kfree(ret->name);
+		ret->name = NULL;
+	}
+
+out:
+	if (parent == &kho_out.root)
+		up_write(&kho_out.tree_lock);
+	else
+		up_read(&kho_out.tree_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_remove_node);
+
+/**
+ * kho_add_prop - add a property to a node.
+ * @node: KHO node to add the property to.
+ * @key: key of the property.
+ * @val: pointer to the property value.
+ * @size: size of the property value in bytes.
+ *
+ * @val and @key must be valid pointers through KHO FDT finalization.
+ * Generally @key is a string literal with static lifetime.
+ *
+ * Callers must use their own locking if there are concurrent accesses to @node.
+ *
+ * Return: 0 on success, 1 if the value is already added with @key, or
+ *   - -ENOMEM: failed to allocate memory,
+ *   - -EBUSY: KHO state tree has been converted to FDT,
+ *   - -EEXIST: another property of the same key exists.
+ */
+int kho_add_prop(struct kho_node *node, const char *key, const void *val,
+		 u32 size)
+{
+	unsigned long key_hash;
+	int ret = 0;
+	struct kho_prop *prop, *p;
+
+	key_hash = strhash(key);
+	prop = kmalloc(sizeof(*prop), GFP_KERNEL);
+	if (!prop)
+		return -ENOMEM;
+
+	prop->key = key;
+	prop->val = val;
+	prop->size = size;
+
+	down_read(&kho_out.tree_lock);
+	if (kho_out.fdt) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	hash_for_each_possible(node->props, p, hlist, key_hash) {
+		if (!strcmp(p->key, key)) {
+			ret = (p->val == val && p->size == size) ? 1 : -EEXIST;
+			break;
+		}
+	}
+
+	if (!ret)
+		hash_add(node->props, &prop->hlist, key_hash);
+
+out:
+	up_read(&kho_out.tree_lock);
+
+	if (ret)
+		kfree(prop);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_add_prop);
+
+/**
+ * kho_add_string_prop - add a string property to a node.
+ *
+ * See kho_add_prop() for details.
+ */
+int kho_add_string_prop(struct kho_node *node, const char *key, const char *val)
+{
+	return kho_add_prop(node, key, val, strlen(val) + 1);
+}
+EXPORT_SYMBOL_GPL(kho_add_string_prop);
+
+/**
+ * kho_remove_prop - remove a property from a node.
+ * @node: KHO node to remove the property from.
+ * @key: key of the property.
+ * @size: if non-NULL, the property size is stored in it on success.
+ *
+ * Callers must use their own locking if there are concurrent accesses to @node.
+ *
+ * Return: the pointer to the property value, or
+ *   - -EBUSY: KHO state tree has been converted to FDT,
+ *   - -ENOENT: no property with @key is found.
+ */
+void *kho_remove_prop(struct kho_node *node, const char *key, u32 *size)
+{
+	struct kho_prop *p, *prop = NULL;
+	unsigned long key_hash;
+	void *ret = ERR_PTR(-ENOENT);
+
+	key_hash = strhash(key);
+
+	down_read(&kho_out.tree_lock);
+
+	if (kho_out.fdt) {
+		ret = ERR_PTR(-EBUSY);
+		goto out;
+	}
+
+	hash_for_each_possible(node->props, p, hlist, key_hash) {
+		if (!strcmp(p->key, key)) {
+			prop = p;
+			break;
+		}
+	}
+
+	if (prop) {
+		ret = (void *)prop->val;
+		if (size)
+			*size = prop->size;
+		hash_del(&prop->hlist);
+		kfree(prop);
+	}
+
+out:
+	up_read(&kho_out.tree_lock);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kho_remove_prop);
+
+static int kho_out_update_debugfs_fdt(void)
+{
+	int err = 0;
+
+	if (kho_out.fdt) {
+		kho_out.fdt_wrapper.data = kho_out.fdt;
+		kho_out.fdt_wrapper.size = fdt_totalsize(kho_out.fdt);
+		kho_out.fdt_file = debugfs_create_blob("fdt", 0400, kho_out.dir,
+						       &kho_out.fdt_wrapper);
+		if (IS_ERR(kho_out.fdt_file))
+			err = -ENOENT;
+	} else {
+		debugfs_remove(kho_out.fdt_file);
+	}
+
+	return err;
+}
+
+static int kho_unfreeze(void)
+{
+	int err;
+	void *fdt;
+
+	down_write(&kho_out.tree_lock);
+	fdt = kho_out.fdt;
+	kho_out.fdt = NULL;
+	up_write(&kho_out.tree_lock);
+
+	if (fdt)
+		kvfree(fdt);
+
+	err = blocking_notifier_call_chain(&kho_out.chain_head,
+					   KEXEC_KHO_UNFREEZE, NULL);
+	err = notifier_to_errno(err);
+
+	return notifier_to_errno(err);
+}
+
+static int kho_flatten_tree(void *fdt)
+{
+	int iter, err = 0;
+	struct kho_node *node, *sub_node;
+	struct list_head *ele;
+	struct kho_prop *prop;
+	LIST_HEAD(stack);
+
+	kho_out.root.visited = false;
+	list_add(&kho_out.root.list, &stack);
+
+	for (ele = stack.next; !list_is_head(ele, &stack); ele = stack.next) {
+		node = list_entry(ele, struct kho_node, list);
+
+		if (node->visited) {
+			err = fdt_end_node(fdt);
+			if (err)
+				return err;
+			list_del_init(ele);
+			continue;
+		}
+
+		err = fdt_begin_node(fdt, node->name);
+		if (err)
+			return err;
+
+		hash_for_each(node->props, iter, prop, hlist) {
+			err = fdt_property(fdt, prop->key, prop->val,
+					   prop->size);
+			if (err)
+				return err;
+		}
+
+		hash_for_each(node->nodes, iter, sub_node, hlist) {
+			sub_node->visited = false;
+			list_add(&sub_node->list, &stack);
+		}
+
+		node->visited = true;
+	}
+
+	return 0;
+}
+
+static int kho_convert_tree(void *buffer, int size)
+{
+	void *fdt = buffer;
+	int err = 0;
+
+	err = fdt_create(fdt, size);
+	if (err)
+		goto out;
+
+	err = fdt_finish_reservemap(fdt);
+	if (err)
+		goto out;
+
+	err = kho_flatten_tree(fdt);
+	if (err)
+		goto out;
+
+	err = fdt_finish(fdt);
+	if (err)
+		goto out;
+
+	err = fdt_check_header(fdt);
+	if (err)
+		goto out;
+
+out:
+	if (err) {
+		pr_err("failed to flatten state tree: %d\n", err);
+		return -EINVAL;
+	}
+	return 0;
+}
+
+static int kho_finalize(void)
+{
+	int err = 0;
+	void *fdt;
+
+	fdt = kvmalloc(kho_out.fdt_max, GFP_KERNEL);
+	if (!fdt)
+		return -ENOMEM;
+
+	err = blocking_notifier_call_chain(&kho_out.chain_head,
+					   KEXEC_KHO_FINALIZE, NULL);
+	err = notifier_to_errno(err);
+	if (err)
+		goto unfreeze;
+
+	down_write(&kho_out.tree_lock);
+	kho_out.fdt = fdt;
+	up_write(&kho_out.tree_lock);
+
+	err = kho_convert_tree(fdt, kho_out.fdt_max);
+
+unfreeze:
+	if (err) {
+		int abort_err;
+
+		pr_err("Failed to convert KHO state tree: %d\n", err);
+
+		abort_err = kho_unfreeze();
+		if (abort_err)
+			pr_err("Failed to abort KHO state tree: %d\n",
+			       abort_err);
+	}
+
+	return err;
+}
+
+/* Handling for debug/kho/out */
+static int kho_out_finalize_get(void *data, u64 *val)
+{
+	*val = !!kho_out.fdt;
+
+	return 0;
+}
+
+static int kho_out_finalize_set(void *data, u64 _val)
+{
+	int ret = 0;
+	bool val = !!_val;
+
+	if (!kexec_trylock())
+		return -EBUSY;
+
+	if (val == !!kho_out.fdt) {
+		if (kho_out.fdt)
+			ret = -EEXIST;
+		else
+			ret = -ENOENT;
+		goto unlock;
+	}
+
+	if (val)
+		ret = kho_finalize();
+	else
+		ret = kho_unfreeze();
+
+	if (ret)
+		goto unlock;
+
+	ret = kho_out_update_debugfs_fdt();
+
+unlock:
+	kexec_unlock();
+	return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_finalize, kho_out_finalize_get,
+			 kho_out_finalize_set, "%llu\n");
+
+static int kho_out_fdt_max_get(void *data, u64 *val)
+{
+	*val = kho_out.fdt_max;
+
+	return 0;
+}
+
+static int kho_out_fdt_max_set(void *data, u64 val)
+{
+	int ret = 0;
+
+	if (!kexec_trylock()) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	/* FDT already exists, it's too late to change fdt_max */
+	if (kho_out.fdt) {
+		ret = -EBUSY;
+		goto unlock;
+	}
+
+	kho_out.fdt_max = val;
+
+unlock:
+	kexec_unlock();
+	return ret;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(fops_kho_out_fdt_max, kho_out_fdt_max_get,
+			 kho_out_fdt_max_set, "%llu\n");
+
+static int scratch_phys_show(struct seq_file *m, void *v)
+{
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		seq_printf(m, "0x%llx\n", kho_scratch[i].addr);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scratch_phys);
+
+static int scratch_len_show(struct seq_file *m, void *v)
+{
+	for (int i = 0; i < kho_scratch_cnt; i++)
+		seq_printf(m, "0x%llx\n", kho_scratch[i].size);
+
+	return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(scratch_len);
+
+static __init int kho_out_debugfs_init(void)
+{
+	struct dentry *dir, *f;
+
+	dir = debugfs_create_dir("out", debugfs_root);
+	if (IS_ERR(dir))
+		return -ENOMEM;
+
+	f = debugfs_create_file("scratch_phys", 0400, dir, NULL,
+				&scratch_phys_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("scratch_len", 0400, dir, NULL,
+				&scratch_len_fops);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("fdt_max", 0600, dir, NULL,
+				&fops_kho_out_fdt_max);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	f = debugfs_create_file("finalize", 0600, dir, NULL,
+				&fops_kho_out_finalize);
+	if (IS_ERR(f))
+		goto err_rmdir;
+
+	kho_out.dir = dir;
+	return 0;
+
+err_rmdir:
+	debugfs_remove_recursive(dir);
+	return -ENOENT;
+}
+
+static __init int kho_init(void)
+{
+	int err;
+
+	if (!kho_enable)
+		return 0;
+
+	kho_out.root.name = "";
+	err = kho_add_string_prop(&kho_out.root, "compatible", "kho-v1");
+	if (err)
+		goto err_free_scratch;
+
+	debugfs_root = debugfs_create_dir("kho", NULL);
+	if (IS_ERR(debugfs_root)) {
+		err = -ENOENT;
+		goto err_free_scratch;
+	}
+
+	err = kho_out_debugfs_init();
+	if (err)
+		goto err_free_scratch;
+
+	for (int i = 0; i < kho_scratch_cnt; i++) {
+		unsigned long base_pfn = PHYS_PFN(kho_scratch[i].addr);
+		unsigned long count = kho_scratch[i].size >> PAGE_SHIFT;
+		unsigned long pfn;
+
+		for (pfn = base_pfn; pfn < base_pfn + count;
+		     pfn += pageblock_nr_pages)
+			init_cma_reserved_pageblock(pfn_to_page(pfn));
+	}
+
+	return 0;
+
+err_free_scratch:
+	for (int i = 0; i < kho_scratch_cnt; i++) {
+		void *start = __va(kho_scratch[i].addr);
+		void *end = start + kho_scratch[i].size;
+
+		free_reserved_area(start, end, -1, "");
+	}
+	kho_enable = false;
+	return err;
+}
+late_initcall(kho_init);
+
+/*
+ * The scratch areas are scaled by default as percent of memory allocated from
+ * memblock. A user can override the scale with command line parameter:
+ *
+ * kho_scratch=N%
+ *
+ * It is also possible to explicitly define size for a lowmem, a global and
+ * per-node scratch areas:
+ *
+ * kho_scratch=l[KMG],n[KMG],m[KMG]
+ *
+ * The explicit size definition takes precedence over scale definition.
+ */
+static unsigned int scratch_scale __initdata = 200;
+static phys_addr_t scratch_size_global __initdata;
+static phys_addr_t scratch_size_pernode __initdata;
+static phys_addr_t scratch_size_lowmem __initdata;
+
+static int __init kho_parse_scratch_size(char *p)
+{
+	unsigned long size, size_pernode, size_global;
+	char *endptr, *oldp = p;
+
+	if (!p)
+		return -EINVAL;
+
+	size = simple_strtoul(p, &endptr, 0);
+	if (*endptr == '%') {
+		scratch_scale = size;
+		pr_notice("scratch scale is %d percent\n", scratch_scale);
+	} else {
+		size = memparse(p, &p);
+		if (!size || p == oldp)
+			return -EINVAL;
+
+		if (*p != ',')
+			return -EINVAL;
+
+		oldp = p;
+		size_global = memparse(p + 1, &p);
+		if (!size_global || p == oldp)
+			return -EINVAL;
+
+		if (*p != ',')
+			return -EINVAL;
+
+		size_pernode = memparse(p + 1, &p);
+		if (!size_pernode)
+			return -EINVAL;
+
+		scratch_size_lowmem = size;
+		scratch_size_global = size_global;
+		scratch_size_pernode = size_pernode;
+		scratch_scale = 0;
+
+		pr_notice("scratch areas: lowmem: %lluMB global: %lluMB pernode: %lldMB\n",
+			  (u64)(scratch_size_lowmem >> 20),
+			  (u64)(scratch_size_global >> 20),
+			  (u64)(scratch_size_pernode >> 20));
+	}
+
+	return 0;
+}
+early_param("kho_scratch", kho_parse_scratch_size);
+
+static void __init scratch_size_update(void)
+{
+	phys_addr_t size;
+
+	if (!scratch_scale)
+		return;
+
+	size = memblock_reserved_kern_size(ARCH_LOW_ADDRESS_LIMIT,
+					   NUMA_NO_NODE);
+	size = size * scratch_scale / 100;
+	scratch_size_lowmem = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+
+	size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
+					   NUMA_NO_NODE);
+	size = size * scratch_scale / 100 - scratch_size_lowmem;
+	scratch_size_global = round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+static phys_addr_t __init scratch_size_node(int nid)
+{
+	phys_addr_t size;
+
+	if (scratch_scale) {
+		size = memblock_reserved_kern_size(MEMBLOCK_ALLOC_ANYWHERE,
+						   nid);
+		size = size * scratch_scale / 100;
+	} else {
+		size = scratch_size_pernode;
+	}
+
+	return round_up(size, CMA_MIN_ALIGNMENT_BYTES);
+}
+
+/**
+ * kho_reserve_scratch - Reserve a contiguous chunk of memory for kexec
+ *
+ * With KHO we can preserve arbitrary pages in the system. To ensure we still
+ * have a large contiguous region of memory when we search the physical address
+ * space for target memory, let's make sure we always have a large CMA region
+ * active. This CMA region will only be used for movable pages which are not a
+ * problem for us during KHO because we can just move them somewhere else.
+ */
+static void __init kho_reserve_scratch(void)
+{
+	phys_addr_t addr, size;
+	int nid, i = 0;
+
+	if (!kho_enable)
+		return;
+
+	scratch_size_update();
+
+	/* FIXME: deal with node hot-plug/remove */
+	kho_scratch_cnt = num_online_nodes() + 2;
+	size = kho_scratch_cnt * sizeof(*kho_scratch);
+	kho_scratch = memblock_alloc(size, PAGE_SIZE);
+	if (!kho_scratch)
+		goto err_disable_kho;
+
+	/*
+	 * reserve scratch area in low memory for lowmem allocations in the
+	 * next kernel
+	 */
+	size = scratch_size_lowmem;
+	addr = memblock_phys_alloc_range(size, CMA_MIN_ALIGNMENT_BYTES, 0,
+					 ARCH_LOW_ADDRESS_LIMIT);
+	if (!addr)
+		goto err_free_scratch_desc;
+
+	kho_scratch[i].addr = addr;
+	kho_scratch[i].size = size;
+	i++;
+
+	/* reserve large contiguous area for allocations without nid */
+	size = scratch_size_global;
+	addr = memblock_phys_alloc(size, CMA_MIN_ALIGNMENT_BYTES);
+	if (!addr)
+		goto err_free_scratch_areas;
+
+	kho_scratch[i].addr = addr;
+	kho_scratch[i].size = size;
+	i++;
+
+	for_each_online_node(nid) {
+		size = scratch_size_node(nid);
+		addr = memblock_alloc_range_nid(size, CMA_MIN_ALIGNMENT_BYTES,
+						0, MEMBLOCK_ALLOC_ACCESSIBLE,
+						nid, true);
+		if (!addr)
+			goto err_free_scratch_areas;
+
+		kho_scratch[i].addr = addr;
+		kho_scratch[i].size = size;
+		i++;
+	}
+
+	return;
+
+err_free_scratch_areas:
+	for (i--; i >= 0; i--)
+		memblock_phys_free(kho_scratch[i].addr, kho_scratch[i].size);
+err_free_scratch_desc:
+	memblock_free(kho_scratch, kho_scratch_cnt * sizeof(*kho_scratch));
+err_disable_kho:
+	kho_enable = false;
+}
+
+void __init kho_memory_init(void)
+{
+	kho_reserve_scratch();
+}
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 04441c258b05..757659b7a26b 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -30,6 +30,7 @@ 
 #include <linux/crash_dump.h>
 #include <linux/execmem.h>
 #include <linux/vmstat.h>
+#include <linux/kexec_handover.h>
 #include "internal.h"
 #include "slab.h"
 #include "shuffle.h"
@@ -2661,6 +2662,13 @@  void __init mm_core_init(void)
 	report_meminit();
 	kmsan_init_shadow();
 	stack_depot_early_init();
+
+	/*
+	 * KHO memory setup must happen while memblock is still active, but
+	 * as close as possible to buddy initialization
+	 */
+	kho_memory_init();
+
 	mem_init();
 	kmem_cache_init();
 	/*