mbox series

[v2,0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Message ID 20250310120318.2124-1-arbn@yandex-team.com (mailing list archive)
Headers show
Series KSTATE: a mechanism to migrate some part of the kernel state across kexec | expand

Message

Andrey Ryabinin March 10, 2025, 12:03 p.m. UTC
Main changes from v1 [1]:
  - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
  - Lots of misc cleanups/refactorings.

kstate (kernel state) is a mechanism to describe internal some part of the
kernel state, save it into the memory and restore the state after kexec
in the new kernel.

The end goal here and the main use case for this is to be able to
update host kernel under VMs with VFIO pass-through devices running
on that host. Since we are pretty far from that end goal yet, this
only establishes some basic infrastructure to describe and migrate complex
in-kernel states.

The idea behind KSTATE resembles QEMU's migration framework [1], which
solves quite similar problem - migrate state of VM/emulated devices
across different versions of QEMU.

This is an altenative to Kexec Hand Over (KHO [3]).

So, why not KHO?

 - The main reason is KHO doesn't provide simple and convenient internal
    API for the drivers/subsystems to preserve internal data.
    E.g. lets consider we have some variable of type 'struct a'
    that needs to be preserved:
	struct a {
	        int i;
        	unsigned long *p_ulong;
	        char s[10];
        	struct page *page;
	};

     The KHO-way requires driver/subsystem to have a bunch of code
     dealing with FDT stuff, something like

     a_kho_write()
     {
	     ...
	     fdt_property(fdt, "i", &a.i, sizeof(a.i));
	     fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong));
	     fdt_property(fdt, "s", &a.s, sizeof(a.s));
	     if (err)
	     ...
     }
     a_kho_restore()
     {
             ...
     	     a.i = fdt_getprop(fdt, offset, "i", &len);
	     if (!a.i || len != sizeof(a.i))
	     	goto err
	     *a.p_ulong = fdt_getprop....
     }

    Each driver/subsystem has to solve this problem in their own way.
    Also if we use fdt properties for individual fields, that might be wastefull
    in terms of used memory, as these properties use strings as keys.

   While with KSTATE solves the same problem in more elegant way, with this:
	struct kstate_description a_state = {
        	.name = "a_struct",
	        .version_id = 1,
        	.id = KSTATE_TEST_ID,
	        .state_list = LIST_HEAD_INIT(test_state.state_list),
        	.fields = (const struct kstate_field[]) {
                	KSTATE_BASE_TYPE(i, struct a, int),
	                KSTATE_BASE_TYPE(s, struct a, char [10]),
        	        KSTATE_POINTER(p_ulong, struct a),
                	KSTATE_PAGE(page, struct a),
	                KSTATE_END_OF_LIST()
        	},
	};


	{
		static unsigned long ulong
		static struct a a_data = { .p_ulong = &ulong };

		kstate_register(&test_state, &a_data);
	}

       The driver needs only to have a proper 'kstate_description' and call kstate_register()
       to save/restore a_data.
       Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
       And kstate_register() does all this save/restore stuff under the hood.

 - Another bonus point - kstate can preserve migratable memory, which is required
    to preserve guest memory


So now to the part how this works.

State of kernel data (usually it's some struct) is described by the
'struct kstate_description' containing the array of individual
fields descpriptions - 'struct kstate_field'. Each field
has set of bits in ->flags which instructs how to save/restore
a certain field of the struct. E.g.:
  - KS_BASE_TYPE flag tells that field can be just copied by value,

  - KS_POINTER means that the struct member is a pointer to the actual
     data, so it needs to be dereference before saving/restoring data
     to/from kstate data steam.

  - KS_STRUCT - contains another struct,  field->ksd must point to
      another 'struct kstate_dscription'

  - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
               ->restore() callbacks to save/restore data.

  - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
                         field->count() callback
  - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
     linear address. Store offset

  - KS_END - special flag indicating the end of migration stream data.

kstate_register() call accepts kstate_description along with an instance
of an object and registers it in the global 'states' list.

During kexec reboot phase we go through the list of 'kstate_description's
and each instance of kstate_description forms the 'struct kstate_entry'
which save into the kstate's data stream.

The 'kstate_entry' contains information like ID of kstate_description, version
of it, size of migration data and the data itself. The ->data is formed in
accordance to the kstate_field's of the corresponding kstate_description.

After the reboot, when the kstate_register() called it parses migration
stream, finds the appropriate 'kstate_entry' and restores the contents of
the object in accordance with kstate_description and ->fields.

 [1] https://lkml.kernel.org/r/20241002160722.20025-1-arbn@yandex-team.com
 [2] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate
 [3] https://lkml.kernel.org/r/20250206132754.2596694-1-rppt@kernel.org

Andrey Ryabinin (7):
  kstate: Add kstate - a mechanism to describe and migrate kernel state
    across kexec
  kstate, kexec, x86: transfer kstate data across kexec
  kexec: exclude control pages from the destination addresses
  kexec, kstate: delay loading of kexec segments
  x86, kstate: Add the ability to preserve memory pages across kexec.
  kexec, kstate: save kstate data before kexec'ing
  kstate, test: add test module for testing kstate subsystem.

 arch/x86/Kconfig                  |   1 +
 arch/x86/kernel/kexec-bzimage64.c |   4 +
 arch/x86/kernel/setup.c           |   2 +
 include/linux/kexec.h             |   3 +
 include/linux/kstate.h            | 216 ++++++++++++++
 kernel/Kconfig.kexec              |  13 +
 kernel/Makefile                   |   1 +
 kernel/kexec_core.c               |  30 ++
 kernel/kexec_file.c               | 159 +++++++----
 kernel/kexec_internal.h           |   9 +
 kernel/kstate.c                   | 458 ++++++++++++++++++++++++++++++
 lib/Makefile                      |   2 +
 lib/test_kstate.c                 |  86 ++++++
 13 files changed, 925 insertions(+), 59 deletions(-)
 create mode 100644 include/linux/kstate.h
 create mode 100644 kernel/kstate.c
 create mode 100644 lib/test_kstate.c

Comments

Cong Wang March 11, 2025, 2:27 a.m. UTC | #1
Hi Andrey,

On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
>     Each driver/subsystem has to solve this problem in their own way.
>     Also if we use fdt properties for individual fields, that might be wastefull
>     in terms of used memory, as these properties use strings as keys.
>
>    While with KSTATE solves the same problem in more elegant way, with this:
>         struct kstate_description a_state = {
>                 .name = "a_struct",
>                 .version_id = 1,
>                 .id = KSTATE_TEST_ID,
>                 .state_list = LIST_HEAD_INIT(test_state.state_list),
>                 .fields = (const struct kstate_field[]) {
>                         KSTATE_BASE_TYPE(i, struct a, int),
>                         KSTATE_BASE_TYPE(s, struct a, char [10]),
>                         KSTATE_POINTER(p_ulong, struct a),
>                         KSTATE_PAGE(page, struct a),
>                         KSTATE_END_OF_LIST()
>                 },
>         };

Hmm, this still requires manual efforts to implement this, so potentially
a lot of work given how many drivers we have in-tree.

And those KSTATE_* stuffs look a lot similar to BTF:
https://docs.kernel.org/bpf/btf.html

So, any possibility to reuse BTF here? Note, BTF is automatically
generated by pahole, no manual effort is required.

Regards,
Cong Wang
Andrey Ryabinin March 11, 2025, 12:19 p.m. UTC | #2
On Tue, Mar 11, 2025 at 3:28 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Hi Andrey,
>
> On Mon, Mar 10, 2025 at 5:04 AM Andrey Ryabinin <arbn@yandex-team.com> wrote:
> >     Each driver/subsystem has to solve this problem in their own way.
> >     Also if we use fdt properties for individual fields, that might be wastefull
> >     in terms of used memory, as these properties use strings as keys.
> >
> >    While with KSTATE solves the same problem in more elegant way, with this:
> >         struct kstate_description a_state = {
> >                 .name = "a_struct",
> >                 .version_id = 1,
> >                 .id = KSTATE_TEST_ID,
> >                 .state_list = LIST_HEAD_INIT(test_state.state_list),
> >                 .fields = (const struct kstate_field[]) {
> >                         KSTATE_BASE_TYPE(i, struct a, int),
> >                         KSTATE_BASE_TYPE(s, struct a, char [10]),
> >                         KSTATE_POINTER(p_ulong, struct a),
> >                         KSTATE_PAGE(page, struct a),
> >                         KSTATE_END_OF_LIST()
> >                 },
> >         };
>
> Hmm, this still requires manual efforts to implement this, so potentially
> a lot of work given how many drivers we have in-tree.
>

We are not going to have every possible driver to be able to persist its state.
I think the main target is VFIO driver which also implies PCI/IOMMU.

Besides, we'll need to persist only some fields of the struct, not the
entire thing.
There is no way to automate such decisions, so there will be some
manual effort anyway.


> And those KSTATE_* stuffs look a lot similar to BTF:
> https://docs.kernel.org/bpf/btf.html
>
> So, any possibility to reuse BTF here?

Perhaps, but I don't see it right away. I'll think about it.

> Note, BTF is automatically generated by pahole, no manual effort is required.

Nothing will save us from manual efforts of what parts of data we want to save,
so there has to be some way to mark that data.
Also same C types may represent different kind of data, e.g.
we may have an address to some persistent data (in linear mapping)
stored as an 'unsigned long address'.
Because of KASLR we can't copy 'address' by value, we'll need to save
it as an offset from PAGE_OFFSET
and add PAGE_OFFSET of the new kernel on restore.