From patchwork Mon Mar 10 12:03:11 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrey Ryabinin X-Patchwork-Id: 14009703 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85323C282DE for ; Mon, 10 Mar 2025 12:04:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E39C1280003; Mon, 10 Mar 2025 08:04:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E10A7280001; Mon, 10 Mar 2025 08:04:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CDC45280003; Mon, 10 Mar 2025 08:04:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id AC9B9280001 for ; Mon, 10 Mar 2025 08:04:04 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 3D8CB160EA9 for ; Mon, 10 Mar 2025 12:04:06 +0000 (UTC) X-FDA: 83205508092.21.979BE5C Received: from forwardcorp1a.mail.yandex.net (forwardcorp1a.mail.yandex.net [178.154.239.72]) by imf10.hostedemail.com (Postfix) with ESMTP id CDB0FC0020 for ; Mon, 10 Mar 2025 12:04:03 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=yandex-team.com header.s=default header.b=fWZtyGmN; dmarc=pass (policy=none) header.from=yandex-team.com; spf=pass (imf10.hostedemail.com: domain of arbn@yandex-team.com designates 178.154.239.72 as permitted sender) smtp.mailfrom=arbn@yandex-team.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741608244; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=XKHgnvQxGKjJqIO37SYW6qMo+hh+Xx+yeaCwPWU001c=; b=BiBUvFGI+TB2C8AxGP3yplUOkIVB19E51UQGQysMoFvdu3BtduzvYCfno65eA/8oZ3wlyF s/fHFTTIEakUb91Tlp/Dcg3epFQiSa0TNrGlR+CHRPdQXipekNEiLvuM9WvyBa/TqhHvTb eU9NwdIl4klUM+3aijS8fdz96ub1S10= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741608244; a=rsa-sha256; cv=none; b=Jv4r4j4vv6TF3CGD/RuPZujjW3cXeChieJ2oHC1vDHnRKLhH+D+ERrFu2tRLKxzMSQ6tOx zUhOYc5yjWKo0J66ErfF/d1p5bUyWGZVeB44LDs48FGmWlyKqPoU5sj1/Fr8znK5T/oQFa Hra/puZZFLvh83i/PKTbk2PPcDwJ3go= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=yandex-team.com header.s=default header.b=fWZtyGmN; dmarc=pass (policy=none) header.from=yandex-team.com; spf=pass (imf10.hostedemail.com: domain of arbn@yandex-team.com designates 178.154.239.72 as permitted sender) smtp.mailfrom=arbn@yandex-team.com Received: from mail-nwsmtp-smtp-corp-main-83.vla.yp-c.yandex.net (mail-nwsmtp-smtp-corp-main-83.vla.yp-c.yandex.net [IPv6:2a02:6b8:c1f:600c:0:640:a431:0]) by forwardcorp1a.mail.yandex.net (Yandex) with ESMTPS id 7B72A60EBA; Mon, 10 Mar 2025 15:04:01 +0300 (MSK) Received: from dellarbn.yandex.net (unknown [10.214.35.248]) by mail-nwsmtp-smtp-corp-main-83.vla.yp-c.yandex.net (smtpcorp/Yandex) with ESMTPSA id s3o0lL2FT0U0-Kq6zF6NZ; Mon, 10 Mar 2025 15:04:00 +0300 X-Yandex-Fwd: 1 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yandex-team.com; s=default; t=1741608240; bh=XKHgnvQxGKjJqIO37SYW6qMo+hh+Xx+yeaCwPWU001c=; h=Message-ID:Date:Cc:Subject:To:From; b=fWZtyGmNAf3TxCVO6cR1CGphoTpFv46TB+NUDs4zpbDUeTVIZKTYFlC+LQ+ien5GU QI0RVRKypVgBFhNPJDMiHCF41YTWhLaWzJdRGjVMxU/odm0o51V6JYFmZE/A+JZxL7 1txeZtvXpPp4GDHbsTI1J0q25oF/V39ygtwGEZGk= From: Andrey Ryabinin To: linux-kernel@vger.kernel.org Cc: Alexander Graf , James Gowans , Mike Rapoport , Andrew Morton , linux-mm@kvack.org, Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H . Peter Anvin" , Eric Biederman , kexec@lists.infradead.org, Pratyush Yadav , Jason Gunthorpe , Pasha Tatashin , David Rientjes , Andrey Ryabinin Subject: [PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec Date: Mon, 10 Mar 2025 13:03:11 +0100 Message-ID: <20250310120318.2124-1-arbn@yandex-team.com> X-Mailer: git-send-email 2.45.3 MIME-Version: 1.0 X-Rspam-User: X-Rspamd-Queue-Id: CDB0FC0020 X-Rspamd-Server: rspam08 X-Stat-Signature: eibbygijen4d8tata16aog4xnsm8nyaf X-HE-Tag: 1741608243-270193 X-HE-Meta: U2FsdGVkX1/bimzxlIWfOxQpEZHo5lmdF3oSTBZ1ItktWuV+2nFTZQvrnHRhU8LkeC3OD3cGJoZXb2nxO4CIaGFr4QhOKjiuTwwLQnc++EMEGBIUzQOr3BPRcF2AdHH+ZwurcjD5KhYjvycMvAlSgvMAi4QyziwFuaJIIk73rU1YWyu7L1H2Q9gKTZOuCbsV6TattbuXEh7WsInDgUU34iIrRsmD8KFHD+adU0aB4hOmtr7EHgwU2478yyPXkBw7ZSR3RivsZbj6hy6N/J3dBkvnHRN1RxMKIy7w86DgcXdIkm//43Uv+SU3tblZ3f1OvHn7B8uh41yZ2AkVDqsyfh5FUwm8XvDcwpLpijsZyhFAGNRuyrmP11mRL7CqBEPqgYoUjtEaidoxG9IPf0ZgcmtV8/dR5K62uncZl06UcDjuYqMD7o8X7lYC8DBDFD48/VIG9IYmF4JFiF5R6LfZEmyc2MQXP7dFfFoVugmt4xJLwBCJH81nXfhRjh80NsRZgZz7wTetF7v5EYC6LeOZFTm3Ysd/GLb+nU7NRA/A9KLa+jElAxHFY92+T6qB6GE6a4Ofq33PKGaJRfuI6E4ShnJFxpQCmDkoWi/QsVJD/vYgod35cc/9g368DlBZt95JdVAEkHqWITfuu9rLTqDXyphnYPTcQn7WS7PuF5tRs+mri2qGVuhRUIIC3hX/tlHnnYruZdO74RdlYAKqbDRV4bMIjYsiJCF6XlLq8Uc1V16N6Dugc7OWBDj0L3YduFgJzl0KrNzIQ/VmL50v+J7f6uSCI/1QUA+JNSZtj/GnExYGEEQRsk/cqnzy/h5x/jTel/unYOSORww8gI1A3TnLuwWCuWkPqjRRkCbIvs/hc8lXR1SL2dbvZ3h91bQHuifMe7O672P9eFSTLS4Ngagm4O5+8rccbxsZzXZRbGPHO3o+cb9v06MEZK9+M95/WW+bHH3+Jkq3CZVw6IxynTt 31zT1t+e C0txGN0OriYDJBQM5vsA7qI1VwMdBB8mda2aIwZaRm3FVCviAsQqhVtY7p5v0QF7o+GkDMebRj01yltH58JT5ifP+KjzB6ymCO5EJdPJTG7EYFGYIfvElHm8tSUrU3nXvBuk7PbbEq84/0Z/QlmqqaxdBEclFe/UyeWEnyulRensyQau8mW6pDq2BBNhUeBbVMrNJ9c/IbnubGX3cIKN7C/Ggwe+l1qJ0nFxp3OxnX0EcQd7tR80dpRaQkBnSe/V1yoiK95nhenHWFidUWPGBiOYmmiZjpcj9iQGBIO/MZ8K/7ahnVXoCMQwITpz3IRls7zVqaaL0S80w2kbQg+oHQcMG3epNOSwlC7UmG1UghbIniAI746qpAh0ATzif3DZitrDFkYJbkNNNlXu4o5IYC4mfeq4UiYCq3q17G1YG6THHhzm+GUzXmhDxOViCTDNr4ze5ruXbIpKkk1kYHVv4vrhzDKDYlBT+IbBiiSgCUN+A+aA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Main changes from v1 [1]: - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel - Lots of misc cleanups/refactorings. kstate (kernel state) is a mechanism to describe internal some part of the kernel state, save it into the memory and restore the state after kexec in the new kernel. The end goal here and the main use case for this is to be able to update host kernel under VMs with VFIO pass-through devices running on that host. Since we are pretty far from that end goal yet, this only establishes some basic infrastructure to describe and migrate complex in-kernel states. The idea behind KSTATE resembles QEMU's migration framework [1], which solves quite similar problem - migrate state of VM/emulated devices across different versions of QEMU. This is an altenative to Kexec Hand Over (KHO [3]). So, why not KHO? - The main reason is KHO doesn't provide simple and convenient internal API for the drivers/subsystems to preserve internal data. E.g. lets consider we have some variable of type 'struct a' that needs to be preserved: struct a { int i; unsigned long *p_ulong; char s[10]; struct page *page; }; The KHO-way requires driver/subsystem to have a bunch of code dealing with FDT stuff, something like a_kho_write() { ... fdt_property(fdt, "i", &a.i, sizeof(a.i)); fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong)); fdt_property(fdt, "s", &a.s, sizeof(a.s)); if (err) ... } a_kho_restore() { ... a.i = fdt_getprop(fdt, offset, "i", &len); if (!a.i || len != sizeof(a.i)) goto err *a.p_ulong = fdt_getprop.... } Each driver/subsystem has to solve this problem in their own way. Also if we use fdt properties for individual fields, that might be wastefull in terms of used memory, as these properties use strings as keys. While with KSTATE solves the same problem in more elegant way, with this: struct kstate_description a_state = { .name = "a_struct", .version_id = 1, .id = KSTATE_TEST_ID, .state_list = LIST_HEAD_INIT(test_state.state_list), .fields = (const struct kstate_field[]) { KSTATE_BASE_TYPE(i, struct a, int), KSTATE_BASE_TYPE(s, struct a, char [10]), KSTATE_POINTER(p_ulong, struct a), KSTATE_PAGE(page, struct a), KSTATE_END_OF_LIST() }, }; { static unsigned long ulong static struct a a_data = { .p_ulong = &ulong }; kstate_register(&test_state, &a_data); } The driver needs only to have a proper 'kstate_description' and call kstate_register() to save/restore a_data. Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'. And kstate_register() does all this save/restore stuff under the hood. - Another bonus point - kstate can preserve migratable memory, which is required to preserve guest memory So now to the part how this works. State of kernel data (usually it's some struct) is described by the 'struct kstate_description' containing the array of individual fields descpriptions - 'struct kstate_field'. Each field has set of bits in ->flags which instructs how to save/restore a certain field of the struct. E.g.: - KS_BASE_TYPE flag tells that field can be just copied by value, - KS_POINTER means that the struct member is a pointer to the actual data, so it needs to be dereference before saving/restoring data to/from kstate data steam. - KS_STRUCT - contains another struct, field->ksd must point to another 'struct kstate_dscription' - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save() ->restore() callbacks to save/restore data. - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the field->count() callback - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or linear address. Store offset - KS_END - special flag indicating the end of migration stream data. kstate_register() call accepts kstate_description along with an instance of an object and registers it in the global 'states' list. During kexec reboot phase we go through the list of 'kstate_description's and each instance of kstate_description forms the 'struct kstate_entry' which save into the kstate's data stream. The 'kstate_entry' contains information like ID of kstate_description, version of it, size of migration data and the data itself. The ->data is formed in accordance to the kstate_field's of the corresponding kstate_description. After the reboot, when the kstate_register() called it parses migration stream, finds the appropriate 'kstate_entry' and restores the contents of the object in accordance with kstate_description and ->fields. [1] https://lkml.kernel.org/r/20241002160722.20025-1-arbn@yandex-team.com [2] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate [3] https://lkml.kernel.org/r/20250206132754.2596694-1-rppt@kernel.org Andrey Ryabinin (7): kstate: Add kstate - a mechanism to describe and migrate kernel state across kexec kstate, kexec, x86: transfer kstate data across kexec kexec: exclude control pages from the destination addresses kexec, kstate: delay loading of kexec segments x86, kstate: Add the ability to preserve memory pages across kexec. kexec, kstate: save kstate data before kexec'ing kstate, test: add test module for testing kstate subsystem. arch/x86/Kconfig | 1 + arch/x86/kernel/kexec-bzimage64.c | 4 + arch/x86/kernel/setup.c | 2 + include/linux/kexec.h | 3 + include/linux/kstate.h | 216 ++++++++++++++ kernel/Kconfig.kexec | 13 + kernel/Makefile | 1 + kernel/kexec_core.c | 30 ++ kernel/kexec_file.c | 159 +++++++---- kernel/kexec_internal.h | 9 + kernel/kstate.c | 458 ++++++++++++++++++++++++++++++ lib/Makefile | 2 + lib/test_kstate.c | 86 ++++++ 13 files changed, 925 insertions(+), 59 deletions(-) create mode 100644 include/linux/kstate.h create mode 100644 kernel/kstate.c create mode 100644 lib/test_kstate.c