From patchwork Fri Mar 7 00:57:39 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pratyush Yadav X-Patchwork-Id: 14005623 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 571DDC282D1 for ; Fri, 7 Mar 2025 00:58:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 657C8280006; Thu, 6 Mar 2025 19:58:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 60896280001; Thu, 6 Mar 2025 19:58:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3C5E2280007; Thu, 6 Mar 2025 19:58:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 0DB37280001 for ; Thu, 6 Mar 2025 19:58:41 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A7B4B807EC for ; Fri, 7 Mar 2025 00:58:41 +0000 (UTC) X-FDA: 83192944842.13.E0F8B4B Received: from smtp-fw-2101.amazon.com (smtp-fw-2101.amazon.com [72.21.196.25]) by imf05.hostedemail.com (Postfix) with ESMTP id B49BF100005 for ; Fri, 7 Mar 2025 00:58:39 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=amazon.de header.s=amazon201209 header.b=B6BnT2tx; dmarc=pass (policy=quarantine) header.from=amazon.de; spf=pass (imf05.hostedemail.com: domain of "prvs=1541f9db8=ptyadav@amazon.com" designates 72.21.196.25 as permitted sender) smtp.mailfrom="prvs=1541f9db8=ptyadav@amazon.com" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741309119; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sQqXLYiJ1VVGRN+jOJACYXz/EcfmdNuv7Pcm+0ImUEs=; b=ofeqaCF/J8hnMGSmrt6P9chDbvHaT62+o41SezTPqfpeocrXZgsj6wamVAH8dGjMhHKk8w O0t4aJ4HEdPwkgckD4/InUKFv706pEtIPK6iO4RyrOBs8b2JFP1F7HC+4P5HVqChmp24pp Zk+5tys6vHr1PT7EIRaBS5ZoT4V/zXU= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=amazon.de header.s=amazon201209 header.b=B6BnT2tx; dmarc=pass (policy=quarantine) header.from=amazon.de; spf=pass (imf05.hostedemail.com: domain of "prvs=1541f9db8=ptyadav@amazon.com" designates 72.21.196.25 as permitted sender) smtp.mailfrom="prvs=1541f9db8=ptyadav@amazon.com" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741309119; a=rsa-sha256; cv=none; b=bEl04UzDgkIA2bcPMC/rWAVfdhYrsYQVmyc0+8Jjb6m0M8KJfS75HwKapLoJHvZ6qyCdWY bHLTJX88Um7mxb7/s6Nlt1UjpJb+oU4K0t9slV2qG15ULynlt5b68GwWKs+ROSm4IN/eUV pn3Bk7HtP5qjV/zDuyp3/gx4Mn3uF64= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1741309120; x=1772845120; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=sQqXLYiJ1VVGRN+jOJACYXz/EcfmdNuv7Pcm+0ImUEs=; b=B6BnT2txDN5b7FwxaN0IL5vQOMpcwU63wgr2Ucfhp4AbhdEytd6dsJ43 GRjHZoPPwsTdbXhueWmjO/odEK9Ui9fKmll5e7yXYyI4Q4G/h472FFDmL AZ9wHdJDtW1pAAu/tGxr9J+fxd0BrEJBwsG/3tzS7l7RZnCnAmzri77Qu w=; X-IronPort-AV: E=Sophos;i="6.14,227,1736812800"; d="scan'208";a="472783393" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-2101.iad2.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Mar 2025 00:58:35 +0000 Received: from EX19MTAUWB001.ant.amazon.com [10.0.38.20:32828] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.40.101:2525] with esmtp (Farcaster) id 6f9afba4-0449-4880-b8c4-e10c02a9ebe8; Fri, 7 Mar 2025 00:58:34 +0000 (UTC) X-Farcaster-Flow-ID: 6f9afba4-0449-4880-b8c4-e10c02a9ebe8 Received: from EX19D020UWA003.ant.amazon.com (10.13.138.254) by EX19MTAUWB001.ant.amazon.com (10.250.64.248) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from EX19MTAUWB002.ant.amazon.com (10.250.64.231) by EX19D020UWA003.ant.amazon.com (10.13.138.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from email-imr-corp-prod-pdx-all-2c-475d797d.us-west-2.amazon.com (10.25.36.214) by mail-relay.amazon.com (10.250.64.228) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14 via Frontend Transport; Fri, 7 Mar 2025 00:58:34 +0000 Received: from dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com [172.19.91.144]) by email-imr-corp-prod-pdx-all-2c-475d797d.us-west-2.amazon.com (Postfix) with ESMTP id DAA12A5F81; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) Received: by dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (Postfix, from userid 23027615) id 13BED4FF9; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) From: Pratyush Yadav To: CC: Pratyush Yadav , Jonathan Corbet , "Eric Biederman" , Arnd Bergmann , "Greg Kroah-Hartman" , Alexander Viro , Christian Brauner , Jan Kara , Hugh Dickins , Alexander Graf , Benjamin Herrenschmidt , "David Woodhouse" , James Gowans , "Mike Rapoport" , Paolo Bonzini , "Pasha Tatashin" , Anthony Yznaga , Dave Hansen , David Hildenbrand , Jason Gunthorpe , Matthew Wilcox , "Wei Yang" , Andrew Morton , , , , Subject: [RFC PATCH 5/5] mm/memfd: allow preserving FD over FDBOX + KHO Date: Fri, 7 Mar 2025 00:57:39 +0000 Message-ID: <20250307005830.65293-6-ptyadav@amazon.de> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250307005830.65293-1-ptyadav@amazon.de> References: <20250307005830.65293-1-ptyadav@amazon.de> MIME-Version: 1.0 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B49BF100005 X-Rspam-User: X-Stat-Signature: zct9xa1xp3f3qeak9zhfdkhxkgrshr8i X-HE-Tag: 1741309119-399643 X-HE-Meta: U2FsdGVkX18avr4lq8SNWQPE9so4cajK6bedm0iDr43STfiZR3bZ3VticwukEgBQLiqtaYvq4JgjH+XtjPYrRLtAFOSi5xyqvHPTxtFQ+zcsgrPjjtCI2RgNUrIWrgBo1MzW58q+7CqyfWSCcVQAVKOdPeM7U0eVnQ386zlSOGp/rosOv5IPCzLNDCnFjotpF9fVASxIs+bt+/Mkbu0ji4EPHh/dne0fsUIQARuNU24xuXEz9Khifvc0hXWud2bs8n6I5tTMfiUWVOvQF6csALTUXt87GAm3VrRB3rQj+mjClBxM7NDxuxG7V+DNWg76K14ahUuO7SeB5cu3aOYGuofuT8SY2JqTYIG/MB2T5U0npC0RQlSFj9NkPPbZa4+YywSBjmVH41hhIMBhNUH/7/O15Uuo+xicX47crUan1tx7qDnE7bmnFSo0Zw5BUtB/1TSqb/TtFPDvCrlRVrZob+dHBiIRY3fGUjYxnpBvXdpBM+FiPrzOTcYh42NwhzK7YgDXB8p1FYyXoonvtsXT8E9dG/MqJjI9nVxdHTN3U6GOUA0dz2FcS73XHxvBoTqJRJOjnxqfKgv9QI9zb8ngRluUhhRQdo2c7oIJbxjlXW+7SJcQkHUMgGhnCcwoEcsEheHwK7wY8NHMCN/ucsfQmXjiVs6g1907ldhdOG1W/9pkdAyspz1nW1BSb43kHS73mG2LAAcUCaMR3YMjxeIIS1K0YSi9828RxCU+cou8BsQysdVfxrKPUW7L85lIwDkGV1dMidf9A9bfbphHwDyGIvLZeClbfJklkkvsWu8tO1RQFs3/aVbrBhkC4au4wnKPiWeOwugWvfzEk0gdtpV5nE4oaGGabtHbh1ctSlqRDP7iJkDrue6aRnDHQoktUsnbC9H6bSaq49uG6zxzvAZR9kHNRs9DeaeSX1m+zf4gZLovPJsfKQNdgpm6BP+cFRSFrDOl2ZXt3f/hYGKTR5N 90yDytM4 E/MZYCOVvvZipHmZ2o7S9LUNpXxXPpLkoOqmmy6Qts4JrQJzCnkG9TulVJnb2obg49gOUxNIggKMivlEFFFi2PR1GKkul42q2OgWsZ2V9v6tZJmcNVoQ+Fz1u6hrWPAqVKrHTfMZrhx0awR4du/wWWBaGKjWGleDYbU8VOYUUhHaTzIpV4oSKwk7BuuwWHTyI5JQGTE0gE1GHp6twogW3ECTTiG9G+GPo9stkaR9tyuyOLlV4mmBomWz3n0OmHI1YjPF+V6jqWOssUnhFwXqxt6Qx6ZTlJ0zYXzRb9jZ8QjsvCVwT/QHbZYbW1MrIY58yYwObKSAu1FpKVJWZkclWHwnvibzZJJ9H84PZ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: For applications with a large amount of memory that takes time to rebuild, reboots to consume kernel upgrades can be very expensive. FDBox allows preserving file descriptors over kexec using KHO. Combining that with memfd gives those applications reboot-persistent memory that they can use to quickly save and reconstruct that state. While memfd is backed by either hugetlbfs or shmem, currently only support on shmem is added for this. Allow saving and restoring shmem FDs over FDBOX + KHO. The memfd FDT node itself does not contain much information. It just creates a subnode and passes it over to shmem to do its thing. Similar behaviour is followed on the restore side. Since there are now two paths of getting a shmem file, refactor the file setup into its own function called memfd_setup_file(). It sets up the file flags, mode, etc., and sets fdbox ops if enabled. Signed-off-by: Pratyush Yadav --- mm/memfd.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 116 insertions(+), 12 deletions(-) diff --git a/mm/memfd.c b/mm/memfd.c index 37f7be57c2f50..1c32e66197f6d 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -7,6 +7,8 @@ * This file is released under the GPL. */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include #include #include @@ -19,8 +21,12 @@ #include #include #include +#include +#include #include +static const struct fdbox_file_ops memfd_fdbox_fops; + /* * We need a tag: a new tag would expand every xa_node by 8 bytes, * so reuse a tag which we firmly believe is never set or cleared on tmpfs @@ -418,21 +424,10 @@ static char *alloc_name(const char __user *uname) return ERR_PTR(error); } -static struct file *alloc_file(const char *name, unsigned int flags) +static void memfd_setup_file(struct file *file, unsigned int flags) { unsigned int *file_seals; - struct file *file; - if (flags & MFD_HUGETLB) { - file = hugetlb_file_setup(name, 0, VM_NORESERVE, - HUGETLB_ANONHUGE_INODE, - (flags >> MFD_HUGE_SHIFT) & - MFD_HUGE_MASK); - } else { - file = shmem_file_setup(name, 0, VM_NORESERVE); - } - if (IS_ERR(file)) - return file; file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE; @@ -452,6 +447,27 @@ static struct file *alloc_file(const char *name, unsigned int flags) *file_seals &= ~F_SEAL_SEAL; } +#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER) + file->f_fdbox_op = &memfd_fdbox_fops; +#endif +} + +static struct file *alloc_file(const char *name, unsigned int flags) +{ + struct file *file; + + if (flags & MFD_HUGETLB) { + file = hugetlb_file_setup(name, 0, VM_NORESERVE, + HUGETLB_ANONHUGE_INODE, + (flags >> MFD_HUGE_SHIFT) & + MFD_HUGE_MASK); + } else { + file = shmem_file_setup(name, 0, VM_NORESERVE); + } + if (IS_ERR(file)) + return file; + + memfd_setup_file(file, flags); return file; } @@ -493,3 +509,91 @@ SYSCALL_DEFINE2(memfd_create, kfree(name); return error; } + +#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER) +static const char memfd_fdbox_compatible[] = "fdbox,memfd-v1"; + +static struct file *memfd_fdbox_kho_recover(const void *fdt, int offset) +{ + struct file *file; + int ret, subnode; + + ret = fdt_node_check_compatible(fdt, offset, memfd_fdbox_compatible); + if (ret) { + pr_err("kho: invalid compatible\n"); + return NULL; + } + + /* Make sure there is exactly one subnode. */ + subnode = fdt_first_subnode(fdt, offset); + if (subnode < 0) { + pr_err("kho: no subnode for underlying storage found!\n"); + return NULL; + } + if (fdt_next_subnode(fdt, subnode) >= 0) { + pr_err("kho: too many subnodes. Expected only 1.\n"); + return NULL; + } + + if (is_node_shmem(fdt, subnode)) { + file = shmem_fdbox_kho_recover(fdt, subnode); + if (!file) + return NULL; + + memfd_setup_file(file, 0); + return file; + } + + return NULL; +} + +static int memfd_fdbox_kho_write(struct fdbox_fd *box_fd, void *fdt) +{ + int ret = 0; + + ret |= fdt_property(fdt, "compatible", memfd_fdbox_compatible, + sizeof(memfd_fdbox_compatible)); + + /* TODO: Track seals on the file as well. */ + + ret |= fdt_begin_node(fdt, ""); + if (ret) { + pr_err("kho: failed to set up memfd node\n"); + return -EINVAL; + } + + if (shmem_file(box_fd->file)) + ret = shmem_fdbox_kho_write(box_fd, fdt); + else + /* TODO: HugeTLB support. */ + ret = -EOPNOTSUPP; + + if (ret) + return ret; + + ret = fdt_end_node(fdt); + if (ret) { + pr_err("kho: failed to end memfd node!\n"); + return ret; + } + + return 0; +} + +static const struct fdbox_file_ops memfd_fdbox_fops = { + .kho_write = memfd_fdbox_kho_write, +}; + +static int __init memfd_fdbox_init(void) +{ + int error; + + error = fdbox_register_handler(memfd_fdbox_compatible, + memfd_fdbox_kho_recover); + if (error) + pr_err("Could not register fdbox handler: %d\n", error); + + return 0; +} +late_initcall(memfd_fdbox_init); +#endif /* CONFIG_FDBOX && CONFIG_KEXEC_HANDOVER */