From patchwork Fri Mar 7 00:57:35 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pratyush Yadav X-Patchwork-Id: 14005618 Received: from smtp-fw-52004.amazon.com (smtp-fw-52004.amazon.com [52.119.213.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 999D515B99E; Fri, 7 Mar 2025 00:58:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.154 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309124; cv=none; b=Qy8ocO0SkwI7B+rngZeJoZ08wvpADUMAX/Z8ql5pDy7dJnO2Dx0XELatYbuF+nIA5394+Er2TKPTwMm1wswzxbTyc81ncych8s5PETJ5CCsiqoKI3iLBfk95qvV+OX7DGlOz6DTCTjkERMR55rMqRlO7UpfznVJLoHgNw9QOHEU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309124; c=relaxed/simple; bh=yiyAS94KCu4e2acKNOAaXLPEKQ1jS/LJFKU5vN90AmI=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=bCAVrn1HDjtwa7MWi6aXyhHSi1sOFDibB/ZrE8uLpSOuRId2BaRf8TjtMQoVki3/VIcsNSgnjzqws0qYDqZELki35tKdDAchFGSbwv2YjNkUNH+rk1f3D/NoM0JNDNkfdkOK9edWKQMAtH0eoXdj6LnJSVZj8w28cp/ytApfQ9c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b=jP8nwB4R; arc=none smtp.client-ip=52.119.213.154 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b="jP8nwB4R" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1741309121; x=1772845121; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=xs//Ktg1Y251LV5BPFRVIZ6BEdLyRGdSogUo8qySbUI=; b=jP8nwB4RAJcoy9XUPwp7+97gRQ45nbNEygaFLmFMGBc+91e9dJQrrnjS Bdx7XjauIaxoeQ82ySrBHd7OHYeDybyA70bW7pmJ+/AfR+d/UOK3IFsLh rB/8ea5LiDMQo/VCD+bVe9G1hC0fcxgHNjvEZLQIFr64vcVqrbnqcPK+j E=; X-IronPort-AV: E=Sophos;i="6.14,227,1736812800"; d="scan'208";a="277072810" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.2]) by smtp-border-fw-52004.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Mar 2025 00:58:36 +0000 Received: from EX19MTAUWA002.ant.amazon.com [10.0.7.35:33741] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.9.151:2525] with esmtp (Farcaster) id d64753f7-51ff-49fc-9439-056aa2d48a0b; Fri, 7 Mar 2025 00:58:35 +0000 (UTC) X-Farcaster-Flow-ID: d64753f7-51ff-49fc-9439-056aa2d48a0b Received: from EX19D020UWA002.ant.amazon.com (10.13.138.222) by EX19MTAUWA002.ant.amazon.com (10.250.64.202) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from EX19MTAUEA001.ant.amazon.com (10.252.134.203) by EX19D020UWA002.ant.amazon.com (10.13.138.222) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from email-imr-corp-prod-pdx-1box-2b-ecca39fb.us-west-2.amazon.com (10.43.8.2) by mail-relay.amazon.com (10.252.134.102) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14 via Frontend Transport; Fri, 7 Mar 2025 00:58:34 +0000 Received: from dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com [172.19.91.144]) by email-imr-corp-prod-pdx-1box-2b-ecca39fb.us-west-2.amazon.com (Postfix) with ESMTP id 68D4580110; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) Received: by dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (Postfix, from userid 23027615) id 011884FCC; Fri, 7 Mar 2025 00:58:32 +0000 (UTC) From: Pratyush Yadav To: CC: Pratyush Yadav , Jonathan Corbet , "Eric Biederman" , Arnd Bergmann , "Greg Kroah-Hartman" , Alexander Viro , Christian Brauner , Jan Kara , Hugh Dickins , Alexander Graf , Benjamin Herrenschmidt , "David Woodhouse" , James Gowans , "Mike Rapoport" , Paolo Bonzini , "Pasha Tatashin" , Anthony Yznaga , Dave Hansen , David Hildenbrand , Jason Gunthorpe , Matthew Wilcox , "Wei Yang" , Andrew Morton , , , , Subject: [RFC PATCH 1/5] misc: introduce FDBox Date: Fri, 7 Mar 2025 00:57:35 +0000 Message-ID: <20250307005830.65293-2-ptyadav@amazon.de> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250307005830.65293-1-ptyadav@amazon.de> References: <20250307005830.65293-1-ptyadav@amazon.de> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 The File Descriptor Box (FDBox) is a mechanism for userspace to name file descriptors and give them over to the kernel to hold. They can later be retrieved by passing in the same name. The primary purpose of FDBox is to be used with Kexec Handover (KHO). There are many kinds anonymous file descriptors in the kernel like memfd, guest_memfd, iommufd, etc. that would be useful to be preserved using KHO. To be able to do that, there needs to be a mechanism to label FDs that allows userspace to set the label before doing KHO and to use the label to map them back after KHO. FDBox achieves that purpose by exposing a miscdevice which exposes ioctls to label and transfer FDs between the kernel and userspace. FDBox is not intended to work with any generic file descriptor. Support for each kind of FDs must be explicitly enabled. While the primary purpose of FDBox is to be used with KHO, it does not explicitly require CONFIG_KEXEC_HANDOVER, since it can be used without KHO, simply as a way to preserve or transfer FDs when userspace exits. Co-developed-by: Alexander Graf Signed-off-by: Alexander Graf Signed-off-by: Pratyush Yadav --- Notes: In a real live-update environment, it would likely make more sense to have a way of passing a hint to the kernel that KHO is about to happen and it should start preparing. Having as much state serialized as possible before the KHO freeze would help reduce downtime. An FDBox operation, say FDBOX_PREPARE_FD that can give the signal to prepare before actually being put in the box and sealed. I have not added something like that yet for simplicity sake. MAINTAINERS | 8 + drivers/misc/Kconfig | 7 + drivers/misc/Makefile | 1 + drivers/misc/fdbox.c | 758 +++++++++++++++++++++++++++++++++++++ include/linux/fdbox.h | 119 ++++++ include/linux/fs.h | 7 + include/linux/miscdevice.h | 1 + include/uapi/linux/fdbox.h | 61 +++ 8 files changed, 962 insertions(+) create mode 100644 drivers/misc/fdbox.c create mode 100644 include/linux/fdbox.h create mode 100644 include/uapi/linux/fdbox.h diff --git a/MAINTAINERS b/MAINTAINERS index 82c2ef421c000..d329d3e5514c5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8862,6 +8862,14 @@ F: include/scsi/libfc.h F: include/scsi/libfcoe.h F: include/uapi/scsi/fc/ +FDBOX +M: Pratyush Yadav +L: linux-fsdevel@vger.kernel.org +S: Maintained +F: drivers/misc/fdbox.c +F: include/linux/fdbox.h +F: include/uapi/linux/fdbox.h + FILE LOCKING (flock() and fcntl()/lockf()) M: Jeff Layton M: Chuck Lever diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig index 56bc72c7ce4a9..6fee70c9479c4 100644 --- a/drivers/misc/Kconfig +++ b/drivers/misc/Kconfig @@ -632,6 +632,13 @@ config MCHP_LAN966X_PCI - lan966x-miim (MDIO_MSCC_MIIM) - lan966x-switch (LAN966X_SWITCH) +config FDBOX + bool "File Descriptor Box device to persist fds" + help + Add a new /dev/fdbox directory that allows user space to preserve specific + types of file descritors when user space exits. Also preserves the same + types of file descriptors across kexec when KHO is enabled. + source "drivers/misc/c2port/Kconfig" source "drivers/misc/eeprom/Kconfig" source "drivers/misc/cb710/Kconfig" diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile index 545aad06d0885..59a398dcfcd64 100644 --- a/drivers/misc/Makefile +++ b/drivers/misc/Makefile @@ -75,3 +75,4 @@ lan966x-pci-objs := lan966x_pci.o lan966x-pci-objs += lan966x_pci.dtbo.o obj-$(CONFIG_MCHP_LAN966X_PCI) += lan966x-pci.o obj-y += keba/ +obj-$(CONFIG_FDBOX) += fdbox.o diff --git a/drivers/misc/fdbox.c b/drivers/misc/fdbox.c new file mode 100644 index 0000000000000..a8f6574e2c25f --- /dev/null +++ b/drivers/misc/fdbox.c @@ -0,0 +1,758 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * fdbox.c - framework to preserve file descriptors across + * process lifetime and kexec + * + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates. + * + * Author: Pratyush Yadav + * Author: Alexander Graf + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static struct miscdevice fdbox_dev; + +static struct { + struct class *class; + dev_t box_devt; + struct xarray box_list; + struct xarray handlers; + struct rw_semaphore recover_sem; + bool recover_done; +} priv = { + .box_list = XARRAY_INIT(fdbox.box_list, XA_FLAGS_ALLOC), + .handlers = XARRAY_INIT(fdbox.handlers, XA_FLAGS_ALLOC), + .recover_sem = __RWSEM_INITIALIZER(priv.recover_sem), +}; + +struct fdbox_handler { + const char *compatible; + struct file *(*fn)(const void *fdt, int offset); +}; + +static struct fdbox *fdbox_remove_box(char *name) +{ + struct xarray *boxlist = &priv.box_list; + unsigned long box_idx; + struct fdbox *box; + + xa_lock(boxlist); + xa_for_each(boxlist, box_idx, box) { + if (!strcmp(box->name, name)) { + __xa_erase(boxlist, box_idx); + break; + } + } + xa_unlock(boxlist); + + return box; +} + +static struct fdbox_fd *fdbox_remove_fd(struct fdbox *box, char *name) +{ + struct xarray *fdlist = &box->fd_list; + struct fdbox_fd *box_fd; + unsigned long idx; + + xa_lock(fdlist); + xa_for_each(fdlist, idx, box_fd) { + if (!strncmp(box_fd->name, name, sizeof(box_fd->name))) { + __xa_erase(fdlist, idx); + break; + } + } + xa_unlock(fdlist); + + return box_fd; +} + +/* Must be called with box->rwsem held. */ +static struct fdbox_fd *fdbox_put_file(struct fdbox *box, const char *name, + struct file *file) +{ + struct fdbox_fd *box_fd __free(kfree) = NULL, *cmp; + struct xarray *fdlist = &box->fd_list; + unsigned long idx; + u32 newid; + int ret; + + /* Only files that set f_fdbox_op are allowed in the box. */ + if (!file->f_fdbox_op) + return ERR_PTR(-EOPNOTSUPP); + + box_fd = kzalloc(sizeof(*box_fd), GFP_KERNEL); + if (!box_fd) + return ERR_PTR(-ENOMEM); + + if (strscpy_pad(box_fd->name, name, sizeof(box_fd->name)) < 0) + /* Name got truncated. This means the name is not NUL-terminated. */ + return ERR_PTR(-EINVAL); + + box_fd->file = file; + box_fd->box = box; + + xa_lock(fdlist); + xa_for_each(fdlist, idx, cmp) { + /* Look for name collisions. */ + if (!strcmp(box_fd->name, cmp->name)) { + xa_unlock(fdlist); + return ERR_PTR(-EEXIST); + } + } + + ret = __xa_alloc(fdlist, &newid, box_fd, xa_limit_32b, GFP_KERNEL); + xa_unlock(fdlist); + if (ret) + return ERR_PTR(ret); + + return_ptr(box_fd); +} + +static long fdbox_put_fd(struct fdbox *box, unsigned long arg) +{ + struct fdbox_put_fd put_fd; + struct fdbox_fd *box_fd; + struct file *file; + int ret; + + if (copy_from_user(&put_fd, (void __user *)arg, sizeof(put_fd))) + return -EFAULT; + + guard(rwsem_read)(&box->rwsem); + + if (box->sealed) + return -EBUSY; + + file = fget_raw(put_fd.fd); + if (!file) + return -EINVAL; + + box_fd = fdbox_put_file(box, put_fd.name, file); + if (IS_ERR(box_fd)) { + fput(file); + return PTR_ERR(box_fd); + } + + ret = close_fd(put_fd.fd); + if (ret) { + struct fdbox_fd *del; + + del = fdbox_remove_fd(box, put_fd.name); + /* + * If we fail to remove from list, it means someone else took + * the FD out. In that case, they own the refcount of the file + * now. + */ + if (del == box_fd) + fput(file); + + return ret; + } + + return 0; +} + +static long fdbox_seal(struct fdbox *box) +{ + struct fdbox_fd *box_fd; + unsigned long idx; + int ret; + + guard(rwsem_write)(&box->rwsem); + + if (box->sealed) + return -EBUSY; + + xa_for_each(&box->fd_list, idx, box_fd) { + const struct fdbox_file_ops *fdbox_ops = box_fd->file->f_fdbox_op; + + if (fdbox_ops && fdbox_ops->seal) { + ret = fdbox_ops->seal(box); + if (ret) + return ret; + } + } + + box->sealed = true; + + return 0; +} + +static long fdbox_unseal(struct fdbox *box) +{ + struct fdbox_fd *box_fd; + unsigned long idx; + int ret; + + guard(rwsem_write)(&box->rwsem); + + if (!box->sealed) + return -EBUSY; + + xa_for_each(&box->fd_list, idx, box_fd) { + const struct fdbox_file_ops *fdbox_ops = box_fd->file->f_fdbox_op; + + if (fdbox_ops && fdbox_ops->seal) { + ret = fdbox_ops->seal(box); + if (ret) + return ret; + } + } + + box->sealed = false; + + return 0; +} + +static long fdbox_get_fd(struct fdbox *box, unsigned long arg) +{ + struct fdbox_get_fd get_fd; + struct fdbox_fd *box_fd; + int fd; + + guard(rwsem_read)(&box->rwsem); + + if (box->sealed) + return -EBUSY; + + if (copy_from_user(&get_fd, (void __user *)arg, sizeof(get_fd))) + return -EFAULT; + + if (get_fd.flags) + return -EINVAL; + + fd = get_unused_fd_flags(0); + if (fd < 0) + return fd; + + box_fd = fdbox_remove_fd(box, get_fd.name); + if (!box_fd) { + put_unused_fd(fd); + return -ENOENT; + } + + fd_install(fd, box_fd->file); + kfree(box_fd); + return fd; +} + +static long box_fops_unl_ioctl(struct file *filep, + unsigned int cmd, unsigned long arg) +{ + struct fdbox *box = filep->private_data; + long ret = -EINVAL; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + switch (cmd) { + case FDBOX_PUT_FD: + ret = fdbox_put_fd(box, arg); + break; + case FDBOX_UNSEAL: + ret = fdbox_unseal(box); + break; + case FDBOX_SEAL: + ret = fdbox_seal(box); + break; + case FDBOX_GET_FD: + ret = fdbox_get_fd(box, arg); + break; + default: + ret = -EINVAL; + break; + } + + return ret; +} + +static int box_fops_open(struct inode *inode, struct file *filep) +{ + struct fdbox *box = container_of(inode->i_cdev, struct fdbox, cdev); + + filep->private_data = box; + + return 0; +} + +static const struct file_operations box_fops = { + .owner = THIS_MODULE, + .unlocked_ioctl = box_fops_unl_ioctl, + .compat_ioctl = compat_ptr_ioctl, + .open = box_fops_open, +}; + +static void fdbox_device_release(struct device *dev) +{ + struct fdbox *box = container_of(dev, struct fdbox, dev); + struct xarray *fdlist = &box->fd_list; + struct fdbox_fd *box_fd; + unsigned long idx; + + unregister_chrdev_region(box->dev.devt, 1); + + xa_for_each(fdlist, idx, box_fd) { + xa_erase(fdlist, idx); + fput(box_fd->file); + kfree(box_fd); + } + + xa_destroy(fdlist); + kfree(box); +} + +static struct fdbox *_fdbox_create_box(const char *name) +{ + struct fdbox *box; + int ret = 0; + u32 id; + + box = kzalloc(sizeof(*box), GFP_KERNEL); + if (!box) + return ERR_PTR(-ENOMEM); + + xa_init_flags(&box->fd_list, XA_FLAGS_ALLOC); + xa_init_flags(&box->pending_fds, XA_FLAGS_ALLOC); + init_rwsem(&box->rwsem); + + if (strscpy_pad(box->name, name, sizeof(box->name)) < 0) { + /* Name got truncated. This means the name is not NUL-terminated. */ + kfree(box); + return ERR_PTR(-EINVAL); + } + + dev_set_name(&box->dev, "fdbox/%s", name); + + ret = alloc_chrdev_region(&box->dev.devt, 0, 1, name); + if (ret) { + kfree(box); + return ERR_PTR(ret); + } + + box->dev.release = fdbox_device_release; + device_initialize(&box->dev); + + cdev_init(&box->cdev, &box_fops); + box->cdev.owner = THIS_MODULE; + kobject_set_name(&box->cdev.kobj, "fdbox/%s", name); + + ret = cdev_device_add(&box->cdev, &box->dev); + if (ret) + goto err_dev; + + ret = xa_alloc(&priv.box_list, &id, box, xa_limit_32b, GFP_KERNEL); + if (ret) + goto err_cdev; + + return box; + +err_cdev: + cdev_device_del(&box->cdev, &box->dev); +err_dev: + /* + * This should free the box and chrdev region via + * fdbox_device_release(). + */ + put_device(&box->dev); + + return ERR_PTR(ret); +} + +static long fdbox_create_box(unsigned long arg) +{ + struct fdbox_create_box create_box; + + if (copy_from_user(&create_box, (void __user *)arg, sizeof(create_box))) + return -EFAULT; + + if (create_box.flags) + return -EINVAL; + + return PTR_ERR_OR_ZERO(_fdbox_create_box(create_box.name)); +} + +static void _fdbox_delete_box(struct fdbox *box) +{ + cdev_device_del(&box->cdev, &box->dev); + unregister_chrdev_region(box->dev.devt, 1); + put_device(&box->dev); +} + +static long fdbox_delete_box(unsigned long arg) +{ + struct fdbox_delete_box delete_box; + struct fdbox *box; + + if (copy_from_user(&delete_box, (void __user *)arg, sizeof(delete_box))) + return -EFAULT; + + if (delete_box.flags) + return -EINVAL; + + box = fdbox_remove_box(delete_box.name); + if (!box) + return -ENOENT; + + _fdbox_delete_box(box); + return 0; +} + +static long fdbox_fops_unl_ioctl(struct file *filep, + unsigned int cmd, unsigned long arg) +{ + long ret = -EINVAL; + + switch (cmd) { + case FDBOX_CREATE_BOX: + ret = fdbox_create_box(arg); + break; + case FDBOX_DELETE_BOX: + ret = fdbox_delete_box(arg); + break; + } + + return ret; +} + +static const struct file_operations fdbox_fops = { + .owner = THIS_MODULE, + .unlocked_ioctl = fdbox_fops_unl_ioctl, + .compat_ioctl = compat_ptr_ioctl, +}; + +static struct miscdevice fdbox_dev = { + .minor = FDBOX_MINOR, + .name = "fdbox", + .fops = &fdbox_fops, + .nodename = "fdbox/fdbox", + .mode = 0600, +}; + +static char *fdbox_devnode(const struct device *dev, umode_t *mode) +{ + char *ret = kasprintf(GFP_KERNEL, "fdbox/%s", dev_name(dev)); + return ret; +} + +static int fdbox_kho_write_fds(void *fdt, struct fdbox *box) +{ + struct fdbox_fd *box_fd; + struct file *file; + unsigned long idx; + int err = 0; + + xa_for_each(&box->fd_list, idx, box_fd) { + file = box_fd->file; + + if (!file->f_fdbox_op->kho_write) { + pr_info("box '%s' FD '%s' has no KHO method. It won't be saved across kexec\n", + box->name, box_fd->name); + continue; + } + + err = fdt_begin_node(fdt, box_fd->name); + if (err) { + pr_err("failed to begin node for box '%s' FD '%s'\n", + box->name, box_fd->name); + return err; + } + + inode_lock(file_inode(file)); + err = file->f_fdbox_op->kho_write(box_fd, fdt); + inode_unlock(file_inode(file)); + if (err) { + pr_err("kho_write failed for box '%s' FD '%s': %d\n", + box->name, box_fd->name, err); + return err; + } + + err = fdt_end_node(fdt); + if (err) { + /* TODO: This leaks all pages reserved by kho_write(). */ + pr_err("failed to end node for box '%s' FD '%s'\n", + box->name, box_fd->name); + return err; + } + } + + return err; +} + +static int fdbox_kho_write_boxes(void *fdt) +{ + static const char compatible[] = "fdbox,box-v1"; + struct fdbox *box; + unsigned long idx; + int err = 0; + + xa_for_each(&priv.box_list, idx, box) { + if (!box->sealed) + continue; + + err |= fdt_begin_node(fdt, box->name); + err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible)); + err |= fdbox_kho_write_fds(fdt, box); + err |= fdt_end_node(fdt); + } + + return err; +} + +static int fdbox_kho_notifier(struct notifier_block *self, + unsigned long cmd, + void *v) +{ + static const char compatible[] = "fdbox-v1"; + void *fdt = v; + int err = 0; + + switch (cmd) { + case KEXEC_KHO_ABORT: + return NOTIFY_DONE; + case KEXEC_KHO_DUMP: + /* Handled below */ + break; + default: + return NOTIFY_BAD; + } + + err |= fdt_begin_node(fdt, "fdbox"); + err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible)); + err |= fdbox_kho_write_boxes(fdt); + err |= fdt_end_node(fdt); + + return err ? NOTIFY_BAD : NOTIFY_DONE; +} + +static struct notifier_block fdbox_kho_nb = { + .notifier_call = fdbox_kho_notifier, +}; + +static void fdbox_recover_fd(const void *fdt, int offset, struct fdbox *box, + struct file *(*fn)(const void *fdt, int offset)) +{ + struct fdbox_fd *box_fd; + struct file *file; + const char *name; + + name = fdt_get_name(fdt, offset, NULL); + if (!name) { + pr_err("no name in FDT for FD at offset %d\n", offset); + return; + } + + file = fn(fdt, offset); + if (!file) + return; + + scoped_guard(rwsem_read, &box->rwsem) { + box_fd = fdbox_put_file(box, name, file); + if (IS_ERR(box_fd)) { + pr_err("failed to put fd '%s' into box '%s': %ld\n", + box->name, name, PTR_ERR(box_fd)); + fput(file); + return; + } + } +} + +static void fdbox_kho_recover(void) +{ + const void *fdt = kho_get_fdt(); + const char *path = "/fdbox"; + int off, box, fd; + int err; + + /* Not a KHO boot */ + if (!fdt) + return; + + /* + * When adding handlers this is taken as read. Taking it as write here + * ensures no handlers get added while nodes are being processed, + * eliminating the race of a handler getting added after its node is + * processed, but before the whole recover is done. + */ + guard(rwsem_write)(&priv.recover_sem); + + off = fdt_path_offset(fdt, path); + if (off < 0) { + pr_debug("could not find '%s' in DT", path); + return; + } + + err = fdt_node_check_compatible(fdt, off, "fdbox-v1"); + if (err) { + pr_err("invalid top level compatible\n"); + return; + } + + fdt_for_each_subnode(box, fdt, off) { + struct fdbox *new_box; + + err = fdt_node_check_compatible(fdt, box, "fdbox,box-v1"); + if (err) { + pr_err("invalid compatible for box '%s'\n", + fdt_get_name(fdt, box, NULL)); + continue; + } + + new_box = _fdbox_create_box(fdt_get_name(fdt, box, NULL)); + if (IS_ERR(new_box)) { + pr_warn("could not create box '%s'\n", + fdt_get_name(fdt, box, NULL)); + continue; + } + + fdt_for_each_subnode(fd, fdt, box) { + struct fdbox_handler *handler; + const char *compatible; + unsigned long idx; + + compatible = fdt_getprop(fdt, fd, "compatible", NULL); + if (!compatible) { + pr_warn("failed to get compatible for FD '%s'. Skipping.\n", + fdt_get_name(fdt, fd, NULL)); + continue; + } + + xa_for_each(&priv.handlers, idx, handler) { + if (!strcmp(handler->compatible, compatible)) + break; + } + + if (handler) { + fdbox_recover_fd(fdt, fd, new_box, handler->fn); + } else { + u32 id; + + pr_debug("found no handler for compatible %s. Queueing for later.\n", + compatible); + + if (xa_alloc(&new_box->pending_fds, &id, + xa_mk_value(fd), xa_limit_32b, + GFP_KERNEL)) { + pr_warn("failed to queue pending FD '%s' to list\n", + fdt_get_name(fdt, fd, NULL)); + } + } + } + + new_box->sealed = true; + } + + priv.recover_done = true; +} + +static void fdbox_recover_pending(struct fdbox_handler *handler) +{ + const void *fdt = kho_get_fdt(); + unsigned long bid, pid; + struct fdbox *box; + void *pending; + + if (WARN_ON(!fdt)) + return; + + xa_for_each(&priv.box_list, bid, box) { + xa_for_each(&box->pending_fds, pid, pending) { + int off = xa_to_value(pending); + + if (fdt_node_check_compatible(fdt, off, handler->compatible) == 0) { + fdbox_recover_fd(fdt, off, box, handler->fn); + xa_erase(&box->pending_fds, pid); + } + } + } +} + +int fdbox_register_handler(const char *compatible, + struct file *(*fn)(const void *fdt, int offset)) +{ + struct xarray *handlers = &priv.handlers; + struct fdbox_handler *handler, *cmp; + unsigned long idx; + int ret; + u32 id; + + /* See comment in fdbox_kho_recover(). */ + guard(rwsem_read)(&priv.recover_sem); + + handler = kmalloc(sizeof(*handler), GFP_KERNEL); + if (!handler) + return -ENOMEM; + + handler->compatible = compatible; + handler->fn = fn; + + xa_lock(handlers); + xa_for_each(handlers, idx, cmp) { + if (!strcmp(cmp->compatible, compatible)) { + xa_unlock(handlers); + kfree(handler); + return -EEXIST; + } + } + + ret = __xa_alloc(handlers, &id, handler, xa_limit_32b, GFP_KERNEL); + xa_unlock(handlers); + if (ret) { + kfree(handler); + return ret; + } + + if (priv.recover_done) + fdbox_recover_pending(handler); + + return 0; +} + +static int __init fdbox_init(void) +{ + int ret = 0; + + /* /dev/fdbox/$NAME */ + priv.class = class_create("fdbox"); + if (IS_ERR(priv.class)) + return PTR_ERR(priv.class); + + priv.class->devnode = fdbox_devnode; + + ret = alloc_chrdev_region(&priv.box_devt, 0, 1, "fdbox"); + if (ret) + goto err_class; + + ret = misc_register(&fdbox_dev); + if (ret) { + pr_err("fdbox: misc device register failed\n"); + goto err_chrdev; + } + + if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) { + register_kho_notifier(&fdbox_kho_nb); + fdbox_kho_recover(); + } + + return 0; + +err_chrdev: + unregister_chrdev_region(priv.box_devt, 1); + priv.box_devt = 0; +err_class: + class_destroy(priv.class); + priv.class = NULL; + return ret; +} +module_init(fdbox_init); diff --git a/include/linux/fdbox.h b/include/linux/fdbox.h new file mode 100644 index 0000000000000..0bc18742940f5 --- /dev/null +++ b/include/linux/fdbox.h @@ -0,0 +1,119 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates. + * + * Author: Pratyush Yadav + * Author: Alexander Graf + */ +#ifndef _LINUX_FDBOX_H +#define _LINUX_FDBOX_H + +#include +#include +#include +#include +#include +#include +#include + +/** + * struct fdbox - A box of FDs. + * @name: Name of the box. Must be unique. + * @rwsem: Used to ensure exclusive access to the box during SEAL/UNSEAL + * operations. + * @dev: Backing device for the character device. + * @cdev: Character device which accepts ioctls from userspace. + * @fd_list: List of FDs in the box. + * @sealed: Whether the box is sealed or not. + */ +struct fdbox { + char name[FDBOX_NAME_LEN]; + /* + * Taken as read when non-exclusive access is needed and the box can be + * in mutable state. For example, the GET_FD and PUT_FD operations use + * it as read when adding or removing FDs from the box. + * + * Taken as write when exclusive access is needed and the box should be + * in a stable, non-mutable state. For example, the SEAL and UNSEAL + * operations use it as write because they need the list of FDs to be + * stable. + */ + struct rw_semaphore rwsem; + struct device dev; + struct cdev cdev; + struct xarray fd_list; + struct xarray pending_fds; + bool sealed; +}; + +/** + * struct fdbox_fd - An FD in a box. + * @name: Name of the FD. Must be unique in the box. + * @file: Underlying file for the FD. + * @flags: Box flags. Currently, no flags are allowed. + * @box: The box to which this FD belongs. + */ +struct fdbox_fd { + char name[FDBOX_NAME_LEN]; + struct file *file; + int flags; + struct fdbox *box; +}; + +/** + * struct fdbox_file_ops - operations for files that can be put into a fdbox. + */ +struct fdbox_file_ops { + /** + * @kho_write: write fd to KHO FDT. + * + * box_fd: Box FD to be serialized. + * + * fdt: KHO FDT + * + * This is called during KHO activation phase to serialize all data + * needed for a FD to be preserved across a KHO. + * + * Returns: 0 on success, -errno on failure. Error here causes KHO + * activation failure. + */ + int (*kho_write)(struct fdbox_fd *box_fd, void *fdt); + /** + * @seal: seal the box + * + * box: Box which is going to be sealed. + * + * This can be set if a file has a dependency on other files. At seal + * time, all the FDs in the box can be inspected to ensure all the + * dependencies are met. + */ + int (*seal)(struct fdbox *box); + /** + * @unseal: unseal the box + * + * box: Box which is going to be sealed. + * + * The opposite of seal. This can be set if a file has a dependency on + * other files. At unseal time, all the FDs in the box can be inspected + * to ensure all the dependencies are met. This can help ensure all + * necessary FDs made it through after a KHO for example. + */ + int (*unseal)(struct fdbox *box); +}; + +/** + * fdbox_register_handler - register a handler for recovering Box FDs after KHO. + * @compatible: compatible string in the KHO FDT node. + * @handler: function to parse the FDT at offset 'offset'. + * + * After KHO, the FDs in the KHO FDT must be deserialized by the underlying + * modules or file systems. Since module initialization can be in any order, + * including after FDBox has been initialized, handler registration allows + * modules to queue their parsing functions, and FDBox will execute them when it + * can. + * + * Returns: 0 on success, -errno otherwise. + */ +int fdbox_register_handler(const char *compatible, + struct file *(*handler)(const void *fdt, int offset)); +#endif /* _LINUX_FDBOX_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index be3ad155ec9f7..7d710a5e09b5b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -81,6 +81,9 @@ struct fs_context; struct fs_parameter_spec; struct fileattr; struct iomap_ops; +struct fdbox; +struct fdbox_fd; +struct fdbox_file_ops; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -1078,6 +1081,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) * @f_llist: work queue entrypoint * @f_ra: file's readahead state * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.) + * @f_fdbox_op: FDBOX operations */ struct file { file_ref_t f_ref; @@ -1116,6 +1120,9 @@ struct file { freeptr_t f_freeptr; }; /* --- cacheline 3 boundary (192 bytes) --- */ +#ifdef CONFIG_FDBOX + const struct fdbox_file_ops *f_fdbox_op; +#endif } __randomize_layout __attribute__((aligned(4))); /* lest something weird decides that 2 is OK */ diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h index 69e110c2b86a9..fedb873c04453 100644 --- a/include/linux/miscdevice.h +++ b/include/linux/miscdevice.h @@ -71,6 +71,7 @@ #define USERIO_MINOR 240 #define VHOST_VSOCK_MINOR 241 #define RFKILL_MINOR 242 +#define FDBOX_MINOR 243 #define MISC_DYNAMIC_MINOR 255 struct device; diff --git a/include/uapi/linux/fdbox.h b/include/uapi/linux/fdbox.h new file mode 100644 index 0000000000000..577ba33b908fd --- /dev/null +++ b/include/uapi/linux/fdbox.h @@ -0,0 +1,61 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * This file contains definitions and structures for fdbox ioctls. + * + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates. + * + * Author: Pratyush Yadav + * Author: Alexander Graf + */ +#ifndef _UAPI_LINUX_FDBOX_H +#define _UAPI_LINUX_FDBOX_H + +#include +#include + +#define FDBOX_NAME_LEN 256 + +#define FDBOX_TYPE ('.') +#define FDBOX_BASE 0 + +/* Ioctls on /dev/fdbox/fdbox */ + +/* Create a box. */ +#define FDBOX_CREATE_BOX _IO(FDBOX_TYPE, FDBOX_BASE + 0) +struct fdbox_create_box { + __u64 flags; + __u8 name[FDBOX_NAME_LEN]; +}; + +/* Delete a box. */ +#define FDBOX_DELETE_BOX _IO(FDBOX_TYPE, FDBOX_BASE + 1) +struct fdbox_delete_box { + __u64 flags; + __u8 name[FDBOX_NAME_LEN]; +}; + +/* Ioctls on /dev/fdbox/$BOXNAME */ + +/* Put FD into box. This unmaps the FD from the calling process. */ +#define FDBOX_PUT_FD _IO(FDBOX_TYPE, FDBOX_BASE + 2) +struct fdbox_put_fd { + __u64 flags; + __u32 fd; + __u32 pad; + __u8 name[FDBOX_NAME_LEN]; +}; + +/* Get the FD from box. This maps the FD into the calling process. */ +#define FDBOX_GET_FD _IO(FDBOX_TYPE, FDBOX_BASE + 3) +struct fdbox_get_fd { + __u64 flags; + __u32 pad; + __u8 name[FDBOX_NAME_LEN]; +}; + +/* Seal the box. After this, no FDs can be put in or taken out of the box. */ +#define FDBOX_SEAL _IO(FDBOX_TYPE, FDBOX_BASE + 4) +/* Unseal the box. Opposite of seal. */ +#define FDBOX_UNSEAL _IO(FDBOX_TYPE, FDBOX_BASE + 5) + +#endif /* _UAPI_LINUX_FDBOX_H */ From patchwork Fri Mar 7 00:57:36 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pratyush Yadav X-Patchwork-Id: 14005616 Received: from smtp-fw-52003.amazon.com (smtp-fw-52003.amazon.com [52.119.213.152]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9CD2015575C; Fri, 7 Mar 2025 00:58:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.152 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309121; cv=none; b=Y2DZYeJHC8iUywY2SBVRhUrclcb72+OshlGn9KE+RRbt+lc25YYdCc7ce8tXGqj+y6OSRVsMzm+QtHBqrwaI0P/SwKJOZz5Zc99016VgQ7ii8vz/RkGsc8EjeG6O7H1Aiwjxy1sXQWL7JAvSAQxOgkfu2oEk/ra3UCYE5LmLGQQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309121; c=relaxed/simple; bh=sSKw8GZ/kVfCKmy78Nbmbwe4LJmPA9c+Q0hFr078t3I=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=cofIRfWYrx9YWs0Wvo334caoAtSYm2PFk0ztGm4Ugiolka7WZ/MNmc6IrLNgrtY8lSqVEFZmz2fufePgIfjjngwslxm8emu2lhGOLejMA5TgwkTpTS6xYBqX6V6ZT5I5xLpqsrI1W8GMHRrsI6S4a9UbFnjwZtoykw847QV2lZQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b=Klrjf5ll; arc=none smtp.client-ip=52.119.213.152 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b="Klrjf5ll" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1741309121; x=1772845121; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=c17Kf0Er2P2sLGN5x8xcm8SJt/PbgebI9se/WEG3Bh0=; b=Klrjf5llwW92xMo4+kBglbMlHheb9eEe/FW7xyP55kK/uXkXgxwLrYNk qkEQfHuvytP7f/phCETXGtmHvd8wokr0nqrsGYCajSK1Itxt/ow+EB6PU kTb7GZ/EUeBMLBBSou6vepAZL5hunPHx6Km5U3jnqLCf4OIlKEFmXRqJl U=; X-IronPort-AV: E=Sophos;i="6.14,227,1736812800"; d="scan'208";a="72017059" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52003.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Mar 2025 00:58:36 +0000 Received: from EX19MTAUWC002.ant.amazon.com [10.0.38.20:64899] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.58.39:2525] with esmtp (Farcaster) id cb54e298-4a3c-4851-97b6-2bac0b19c9ca; Fri, 7 Mar 2025 00:58:34 +0000 (UTC) X-Farcaster-Flow-ID: cb54e298-4a3c-4851-97b6-2bac0b19c9ca Received: from EX19D020UWA004.ant.amazon.com (10.13.138.231) by EX19MTAUWC002.ant.amazon.com (10.250.64.143) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from EX19MTAUWA001.ant.amazon.com (10.250.64.204) by EX19D020UWA004.ant.amazon.com (10.13.138.231) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from email-imr-corp-prod-iad-all-1a-f1af3bd3.us-east-1.amazon.com (10.25.36.214) by mail-relay.amazon.com (10.250.64.204) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14 via Frontend Transport; Fri, 7 Mar 2025 00:58:33 +0000 Received: from dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com [172.19.91.144]) by email-imr-corp-prod-iad-all-1a-f1af3bd3.us-east-1.amazon.com (Postfix) with ESMTP id 463AE40235; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) Received: by dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (Postfix, from userid 23027615) id 05EB84FDD; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) From: Pratyush Yadav To: CC: Pratyush Yadav , Jonathan Corbet , "Eric Biederman" , Arnd Bergmann , "Greg Kroah-Hartman" , Alexander Viro , Christian Brauner , Jan Kara , Hugh Dickins , Alexander Graf , Benjamin Herrenschmidt , "David Woodhouse" , James Gowans , "Mike Rapoport" , Paolo Bonzini , "Pasha Tatashin" , Anthony Yznaga , Dave Hansen , David Hildenbrand , Jason Gunthorpe , Matthew Wilcox , "Wei Yang" , Andrew Morton , , , , Subject: [RFC PATCH 2/5] misc: add documentation for FDBox Date: Fri, 7 Mar 2025 00:57:36 +0000 Message-ID: <20250307005830.65293-3-ptyadav@amazon.de> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250307005830.65293-1-ptyadav@amazon.de> References: <20250307005830.65293-1-ptyadav@amazon.de> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 With FDBox in place, add documentation that describes what it is and how it is used, along with its UAPI and in-kernel API. Since the document refers to KHO, add a reference tag in kho/index.rst. Signed-off-by: Pratyush Yadav --- Documentation/filesystems/locking.rst | 21 +++ Documentation/kho/fdbox.rst | 224 ++++++++++++++++++++++++++ Documentation/kho/index.rst | 3 + MAINTAINERS | 1 + 4 files changed, 249 insertions(+) create mode 100644 Documentation/kho/fdbox.rst diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst index d20a32b77b60f..5526833faf79a 100644 --- a/Documentation/filesystems/locking.rst +++ b/Documentation/filesystems/locking.rst @@ -607,6 +607,27 @@ used. To block changes to file contents via a memory mapping during the operation, the filesystem must take mapping->invalidate_lock to coordinate with ->page_mkwrite. +fdbox_file_ops +============== + +prototypes:: + + int (*kho_write)(struct fdbox_fd *box_fd, void *fdt); + int (*seal)(struct fdbox *box); + int (*unseal)(struct fdbox *box); + + +locking rules: + all may block + +============== ================================================== +ops i_rwsem(box_fd->file->f_inode) +============== ================================================== +kho_write: exclusive +seal: no +unseal: no +============== ================================================== + dquot_operations ================ diff --git a/Documentation/kho/fdbox.rst b/Documentation/kho/fdbox.rst new file mode 100644 index 0000000000000..44a3f5cdf1efb --- /dev/null +++ b/Documentation/kho/fdbox.rst @@ -0,0 +1,224 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +=========================== +File Descriptor Box (FDBox) +=========================== + +:Author: Pratyush Yadav + +Introduction +============ + +The File Descriptor Box (FDBox) is a mechanism for userspace to name file +descriptors and give them over to the kernel to hold. They can later be +retrieved by passing in the same name. + +The primary purpose of FDBox is to be used with :ref:`kho`. There are many kinds +anonymous file descriptors in the kernel like memfd, guest_memfd, iommufd, etc. +that would be useful to be preserved using KHO. To be able to do that, there +needs to be a mechanism to label FDs that allows userspace to set the label +before doing KHO and to use the label to map them back after KHO. FDBox achieves +that purpose by exposing a miscdevice which exposes ioctls to label and transfer +FDs between the kernel and userspace. FDBox is not intended to work with any +generic file descriptor. Support for each kind of FDs must be explicitly +enabled. + +FDBox can be enabled by setting the ``CONFIG_FDBOX`` option to ``y``. While the +primary purpose of FDBox is to be used with KHO, it does not explicitly require +``CONFIG_KEXEC_HANDOVER``, since it can be used without KHO, simply as a way to +preserve or transfer FDs when userspace exits. + +Concepts +======== + +Box +--- + +The box is a container for FDs. Boxes are identified by their name, which must +be unique. Userspace can put FDs in the box using the ``FDBOX_PUT_FD`` +operation, and take them out of the box using the ``FDBOX_GET_FD`` operation. +Once all the required FDs are put into the box, it can be sealed to make it +ready for shipping. This can be done by the ``FDBOX_SEAL`` operation. The seal +operation notifies each FD in the box. If any of the FDs have a dependency on +another, this gives them an opportunity to ensure all dependencies are met, or +fail the seal if not. Once a box is sealed, no FDs can be added or removed from +the box until it is unsealed. Only sealed boxes are transported to a new kernel +via KHO. The box can be unsealed by the ``FDBOX_UNSEAL`` operation. This is the +opposite of seal. It also notifies each FD in the box to ensure all dependencies +are met. This can be useful in case some FDs fail to be restored after KHO. + +Box FD +------ + +The Box FD is a FD that is currently in a box. It is identified by its name, +which must be unique in the box it belongs to. The Box FD is created when a FD +is put into a box by using the ``FDBOX_PUT_FD`` operation. This operation +removes the FD from the calling task. The FD can be restored by passing the +unique name to the ``FDBOX_GET_FD`` operation. + +FDBox control device +-------------------- + +This is the ``/dev/fdbox/fdbox`` device. A box can be created using the +``FDBOX_CREATE_BOX`` operation on the device. A box can be removed using the +``FDBOX_DELETE_BOX`` operation. + +UAPI +==== + +FDBOX_NAME_LEN +-------------- + +.. code-block:: c + + #define FDBOX_NAME_LEN 256 + +Maximum length of the name of a Box or Box FD. + +Ioctls on /dev/fdbox/fdbox +-------------------------- + +FDBOX_CREATE_BOX +~~~~~~~~~~~~~~~~ + +.. code-block:: c + + #define FDBOX_CREATE_BOX _IO(FDBOX_TYPE, FDBOX_BASE + 0) + struct fdbox_create_box { + __u64 flags; + __u8 name[FDBOX_NAME_LEN]; + }; + +Create a box. + +After this returns, the box is available at ``/dev/fdbox/``. + +``name`` + The name of the box to be created. Must be unique. + +``flags`` + Flags to the operation. Currently, no flags are defined. + +Returns: + 0 on success, -1 on error, with errno set. + +FDBOX_DELETE_BOX +~~~~~~~~~~~~~~~~ + +.. code-block:: c + + #define FDBOX_DELETE_BOX _IO(FDBOX_TYPE, FDBOX_BASE + 1) + struct fdbox_delete_box { + __u64 flags; + __u8 name[FDBOX_NAME_LEN]; + }; + +Delete a box. + +After this returns, the box is no longer available at ``/dev/fdbox/``. + +``name`` + The name of the box to be deleted. + +``flags`` + Flags to the operation. Currently, no flags are defined. + +Returns: + 0 on success, -1 on error, with errno set. + +Ioctls on /dev/fdbox/ +------------------------------ + +These must be performed on the ``/dev/fdbox/`` device. + +FDBX_PUT_FD +~~~~~~~~~~~ + +.. code-block:: c + + #define FDBOX_PUT_FD _IO(FDBOX_TYPE, FDBOX_BASE + 2) + struct fdbox_put_fd { + __u64 flags; + __u32 fd; + __u32 pad; + __u8 name[FDBOX_NAME_LEN]; + }; + + +Put FD into the box. + +After this returns, ``fd`` is removed from the task and can no longer be used by +it. + +``name`` + The name of the FD. + +``fd`` + The file descriptor number to be + +``flags`` + Flags to the operation. Currently, no flags are defined. + +Returns: + 0 on success, -1 on error, with errno set. + +FDBX_GET_FD +~~~~~~~~~~~ + +.. code-block:: c + + #define FDBOX_GET_FD _IO(FDBOX_TYPE, FDBOX_BASE + 3) + struct fdbox_get_fd { + __u64 flags; + __u8 name[FDBOX_NAME_LEN]; + }; + +Get an FD from the box. + +After this returns, the FD identified by ``name`` is mapped into the task and is +available for use. + +``name`` + The name of the FD to get. + +``flags`` + Flags to the operation. Currently, no flags are defined. + +Returns: + FD number on success, -1 on error with errno set. + +FDBOX_SEAL +~~~~~~~~~~ + +.. code-block:: c + + #define FDBOX_SEAL _IO(FDBOX_TYPE, FDBOX_BASE + 4) + +Seal the box. + +Gives the kernel an opportunity to ensure all dependencies are met in the box. +After this returns, the box is sealed and FDs can no longer be added or removed +from it. A box must be sealed for it to be transported across KHO. + +Returns: + 0 on success, -1 on error with errno set. + +FDBOX_UNSEAL +~~~~~~~~~~~~ + +.. code-block:: c + + #define FDBOX_UNSEAL _IO(FDBOX_TYPE, FDBOX_BASE + 5) + +Unseal the box. + +Gives the kernel an opportunity to ensure all dependencies are met in the box, +and in case of KHO, no FDs have been lost in transit. + +Returns: + 0 on success, -1 on error with errno set. + +Kernel functions and structures +=============================== + +.. kernel-doc:: include/linux/fdbox.h diff --git a/Documentation/kho/index.rst b/Documentation/kho/index.rst index 5e7eeeca8520f..051513b956075 100644 --- a/Documentation/kho/index.rst +++ b/Documentation/kho/index.rst @@ -1,5 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0-or-later +.. _kho: + ======================== Kexec Handover Subsystem ======================== @@ -9,6 +11,7 @@ Kexec Handover Subsystem concepts usage + fdbox .. only:: subproject and html diff --git a/MAINTAINERS b/MAINTAINERS index d329d3e5514c5..135427582e60f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8866,6 +8866,7 @@ FDBOX M: Pratyush Yadav L: linux-fsdevel@vger.kernel.org S: Maintained +F: Documentation/kho/fdbox.rst F: drivers/misc/fdbox.c F: include/linux/fdbox.h F: include/uapi/linux/fdbox.h From patchwork Fri Mar 7 00:57:37 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pratyush Yadav X-Patchwork-Id: 14005619 Received: from smtp-fw-2101.amazon.com (smtp-fw-2101.amazon.com [72.21.196.25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D02E018A6A6; Fri, 7 Mar 2025 00:58:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=72.21.196.25 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309127; cv=none; b=H2pYr94MSqlncX+d1XjhPyTMHyjrv+LNS10P7SUbhG1kncK649HSukZNgHVCb0Lxn57XRxXXCzMclrYQLP6siblHt+kJL/gaHBlJ57u7NGPE5JhunBxqP9816uq22tdDlv7yg6y6vo2it0pQdgVQsSpWNojzRR9+I4+II4gc88o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309127; c=relaxed/simple; bh=zDPtEVg/MaMKzIDcBmsu+Ksp5ZiaJWqHPkqXOcJ+wKw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=VCFIezy2Ni6W124WLLR2KB+Vner0prptMXIFl+cguYNllx6znMhc82QKhFFIFSEecmaNAFzXiJxV02YOrGINF1T3TzJVQmsHG4ocs4MVCi+lXzqSBbJJv9el34tll/8FozOsojQzKvwU2fI0dLYXQLB4B5pvrrzrR3pk3531bG8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b=WW75kKEV; arc=none smtp.client-ip=72.21.196.25 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b="WW75kKEV" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1741309125; x=1772845125; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=oXAk78z0pait+Woj/pzvjhLafLsZG4rU6PcvxTGAP2w=; b=WW75kKEVr19LwEX+8YZBcpldFnlHb5uQhTKuCTv46Zf1a5Z2XMZ+5bj5 9XJrtp2EUzIl99PeAejBf29CUIvEMg7pf+zfyzEArJ4lh7YZa4bD/CemY sx1y7SH+pkfJgdIZn2otDDTEQZ6RMSKRXVNcsPSw9o3G2YVVh47+CLtnY g=; X-IronPort-AV: E=Sophos;i="6.14,227,1736812800"; d="scan'208";a="472783411" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-2101.iad2.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Mar 2025 00:58:43 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.17.79:1261] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.26.251:2525] with esmtp (Farcaster) id ed8deb03-f964-450e-8c8f-083e6f99ea68; Fri, 7 Mar 2025 00:58:41 +0000 (UTC) X-Farcaster-Flow-ID: ed8deb03-f964-450e-8c8f-083e6f99ea68 Received: from EX19D014EUA003.ant.amazon.com (10.252.50.119) by EX19MTAEUC001.ant.amazon.com (10.252.51.193) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from EX19MTAUEA002.ant.amazon.com (10.252.134.9) by EX19D014EUA003.ant.amazon.com (10.252.50.119) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from email-imr-corp-prod-pdx-all-2b-c1559d0e.us-west-2.amazon.com (10.43.8.2) by mail-relay.amazon.com (10.252.134.34) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14 via Frontend Transport; Fri, 7 Mar 2025 00:58:34 +0000 Received: from dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com [172.19.91.144]) by email-imr-corp-prod-pdx-all-2b-c1559d0e.us-west-2.amazon.com (Postfix) with ESMTP id 72A2840294; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) Received: by dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (Postfix, from userid 23027615) id 0A87A4FF0; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) From: Pratyush Yadav To: CC: Pratyush Yadav , Jonathan Corbet , "Eric Biederman" , Arnd Bergmann , "Greg Kroah-Hartman" , Alexander Viro , Christian Brauner , Jan Kara , Hugh Dickins , Alexander Graf , Benjamin Herrenschmidt , "David Woodhouse" , James Gowans , "Mike Rapoport" , Paolo Bonzini , "Pasha Tatashin" , Anthony Yznaga , Dave Hansen , David Hildenbrand , Jason Gunthorpe , Matthew Wilcox , "Wei Yang" , Andrew Morton , , , , Subject: [RFC PATCH 3/5] mm: shmem: allow callers to specify operations to shmem_undo_range Date: Fri, 7 Mar 2025 00:57:37 +0000 Message-ID: <20250307005830.65293-4-ptyadav@amazon.de> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250307005830.65293-1-ptyadav@amazon.de> References: <20250307005830.65293-1-ptyadav@amazon.de> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In a following patch, support for preserving a shmem file over kexec handover (KHO) will be added. When a shmem file is to be preserved over KHO, its pages must be removed from the inode's page cache and kept reserved. That work is very similar to what shmem_undo_range() does. The only extra thing that needs to be done is to track the PFN and index of each page and get an extra refcount on the page to make sure it does not get freed. Refactor shmem_undo_range() to accept the ops it should execute for each folio, along with a cookie to pass along. During undo, three distinct kinds of operations are made: truncate a folio, truncate a partial folio, truncate a folio in swap. Add a callback for each of the operations. Add shmem_default_undo_ops that maintain the old behaviour, and make callers use that. Since the ops for KHO might fail (needing to allocate memory, or being unable to bring a page back from swap for example), there needs to be a way for them to report errors and stop the undo. Because of this, the function returns an int instead of void. This has the unfortunate side effect of implying this function can fail, though during normal usage, it should never fail. Add some WARNs to ensure that if that assumption ever changes, it gets caught. Signed-off-by: Pratyush Yadav --- Notes: I did it this way since it seemed to be duplicating the least amount of code. The undo logic is fairly complicated, and I was not too keen on replicating it elsewhere. On thinking about this again, I am not so sure if that was a good idea since the end result looks a bit complicated. mm/shmem.c | 165 +++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 134 insertions(+), 31 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 4ea6109a80431..d6d9266b27b75 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1064,12 +1064,56 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) return folio; } +struct shmem_undo_range_ops { + /* Return -ve on error, or number of entries freed. */ + long (*undo_swap)(struct address_space *mapping, pgoff_t index, + void *old, void *arg); + /* Return -ve on error, 0 on success. */ + int (*undo_folio)(struct address_space *mapping, struct folio *folio, + void *arg); + /* + * Return -ve on error, 0 if splitting failed, 1 if splitting succeeded. + */ + int (*undo_partial_folio)(struct folio *folio, pgoff_t lstart, + pgoff_t lend, void *arg); +}; + +static long shmem_default_undo_swap(struct address_space *mapping, pgoff_t index, + void *old, void *arg) +{ + return shmem_free_swap(mapping, index, old); +} + +static int shmem_default_undo_folio(struct address_space *mapping, + struct folio *folio, void *arg) +{ + truncate_inode_folio(mapping, folio); + return 0; +} + +static int shmem_default_undo_partial_folio(struct folio *folio, pgoff_t lstart, + pgoff_t lend, void *arg) +{ + /* + * Function returns bool. Convert it to int and return. No error + * returns needed here. + */ + return truncate_inode_partial_folio(folio, lstart, lend); +} + +static const struct shmem_undo_range_ops shmem_default_undo_ops = { + .undo_swap = shmem_default_undo_swap, + .undo_folio = shmem_default_undo_folio, + .undo_partial_folio = shmem_default_undo_partial_folio, +}; + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. */ -static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, - bool unfalloc) +static int shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, + bool unfalloc, + const struct shmem_undo_range_ops *ops, void *arg) { struct address_space *mapping = inode->i_mapping; struct shmem_inode_info *info = SHMEM_I(inode); @@ -1081,7 +1125,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, bool same_folio; long nr_swaps_freed = 0; pgoff_t index; - int i; + int i, ret = 0; if (lend == -1) end = -1; /* unsigned, so actually very big */ @@ -1099,17 +1143,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, if (xa_is_value(folio)) { if (unfalloc) continue; - nr_swaps_freed += shmem_free_swap(mapping, - indices[i], folio); + + ret = ops->undo_swap(mapping, indices[i], folio, + arg); + if (ret < 0) { + folio_unlock(folio); + break; + } + + nr_swaps_freed += ret; continue; } - if (!unfalloc || !folio_test_uptodate(folio)) - truncate_inode_folio(mapping, folio); + if (!unfalloc || !folio_test_uptodate(folio)) { + ret = ops->undo_folio(mapping, folio, arg); + if (ret < 0) { + folio_unlock(folio); + break; + } + } folio_unlock(folio); } folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); + if (ret < 0) + goto out; cond_resched(); } @@ -1127,7 +1185,13 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, if (folio) { same_folio = lend < folio_pos(folio) + folio_size(folio); folio_mark_dirty(folio); - if (!truncate_inode_partial_folio(folio, lstart, lend)) { + ret = ops->undo_partial_folio(folio, lstart, lend, arg); + if (ret < 0) { + folio_unlock(folio); + folio_put(folio); + goto out; + } + if (ret == 0) { start = folio_next_index(folio); if (same_folio) end = folio->index; @@ -1141,7 +1205,14 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, folio = shmem_get_partial_folio(inode, lend >> PAGE_SHIFT); if (folio) { folio_mark_dirty(folio); - if (!truncate_inode_partial_folio(folio, lstart, lend)) + ret = ops->undo_partial_folio(folio, lstart, lend, arg); + if (ret < 0) { + folio_unlock(folio); + folio_put(folio); + goto out; + } + + if (ret == 0) end = folio->index; folio_unlock(folio); folio_put(folio); @@ -1166,18 +1237,21 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, folio = fbatch.folios[i]; if (xa_is_value(folio)) { - long swaps_freed; - if (unfalloc) continue; - swaps_freed = shmem_free_swap(mapping, indices[i], folio); - if (!swaps_freed) { + + ret = ops->undo_swap(mapping, indices[i], folio, + arg); + if (ret < 0) { + break; + } else if (ret == 0) { /* Swap was replaced by page: retry */ index = indices[i]; break; + } else { + nr_swaps_freed += ret; + continue; } - nr_swaps_freed += swaps_freed; - continue; } folio_lock(folio); @@ -1193,35 +1267,58 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, folio); if (!folio_test_large(folio)) { - truncate_inode_folio(mapping, folio); - } else if (truncate_inode_partial_folio(folio, lstart, lend)) { - /* - * If we split a page, reset the loop so - * that we pick up the new sub pages. - * Otherwise the THP was entirely - * dropped or the target range was - * zeroed, so just continue the loop as - * is. - */ - if (!folio_test_large(folio)) { + ret = ops->undo_folio(mapping, folio, + arg); + if (ret < 0) { folio_unlock(folio); - index = start; break; } + } else { + ret = ops->undo_partial_folio(folio, lstart, lend, arg); + if (ret < 0) { + folio_unlock(folio); + break; + } + + if (ret) { + /* + * If we split a page, reset the loop so + * that we pick up the new sub pages. + * Otherwise the THP was entirely + * dropped or the target range was + * zeroed, so just continue the loop as + * is. + */ + if (!folio_test_large(folio)) { + folio_unlock(folio); + index = start; + break; + } + } } } folio_unlock(folio); } folio_batch_remove_exceptionals(&fbatch); folio_batch_release(&fbatch); + if (ret < 0) + goto out; } + ret = 0; +out: shmem_recalc_inode(inode, 0, -nr_swaps_freed); + return ret; } void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend) { - shmem_undo_range(inode, lstart, lend, false); + int ret; + + ret = shmem_undo_range(inode, lstart, lend, false, + &shmem_default_undo_ops, NULL); + + WARN(ret < 0, "shmem_undo_range() should never fail with default ops"); inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode)); inode_inc_iversion(inode); } @@ -3740,9 +3837,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, info->fallocend = undo_fallocend; /* Remove the !uptodate folios we added */ if (index > start) { - shmem_undo_range(inode, - (loff_t)start << PAGE_SHIFT, - ((loff_t)index << PAGE_SHIFT) - 1, true); + int ret; + + ret = shmem_undo_range(inode, + (loff_t)start << PAGE_SHIFT, + ((loff_t)index << PAGE_SHIFT) - 1, + true, + &shmem_default_undo_ops, + NULL); + WARN(ret < 0, "shmem_undo_range() should never fail with default ops"); } goto undone; } From patchwork Fri Mar 7 00:57:38 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pratyush Yadav X-Patchwork-Id: 14005615 Received: from smtp-fw-80006.amazon.com (smtp-fw-80006.amazon.com [99.78.197.217]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B60D1136352; Fri, 7 Mar 2025 00:58:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.217 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309118; cv=none; b=sKx7YGV5S6J6vPCs5D8sDEUbd6nNV8+ZNoJQTE40PGYW1hl9drJKxIMAdnvajYRCXOyHdjrzrv9Et0bUQRZZHpGShfBzmRxN5KefdztcGZgiUpvUk4D++hmJx0/kqZKfFZhzg9ZEBPu3hM9Cr9B29ydNyAKmKeurZQ2UqtHTBsQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309118; c=relaxed/simple; bh=4xU6F7La+ws0aow6wqubP+OtL58zpgwYvT0M8ClIN/E=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=QqokWQ1D0kKvp8IVqJUpMNhzUTA5Vkn3NZa0lh+oukhPBycN6BTmh2L1EsAFMYZd+/+I74vlPs8QkrHT0DovXRcMfRym8IjSrAvYdL5m3cKlD+39eWYGPW1wjpsqVWV+seHII0k0a2FPHWGHS0QVa9SC0RuAPe5rzTKiFJDk1OM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b=JV914lPu; arc=none smtp.client-ip=99.78.197.217 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b="JV914lPu" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1741309116; x=1772845116; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Us+2oflLvLaA9Q+lohHm25RujTrS6yw5N7o3L4t/IOc=; b=JV914lPurH15HUDO55gzolO0gBte5oFHOegiGg3LTfsQr8KwJ4s1wAWM Qm2PdzvcPtJk0w/40sOP97w5qQb7/xwxp1Y3drywuFq9u6LChyq6Oq2oG YU0JLlA1PSdhi0COG7vTP+XkcSskJHIjSFFZWWAdtKWkX/jwhyys6TBDk E=; X-IronPort-AV: E=Sophos;i="6.14,227,1736812800"; d="scan'208";a="29275899" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80006.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Mar 2025 00:58:35 +0000 Received: from EX19MTAUWB001.ant.amazon.com [10.0.38.20:32828] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.40.101:2525] with esmtp (Farcaster) id 9eec1440-3b96-4a48-a704-b02f65525c99; Fri, 7 Mar 2025 00:58:34 +0000 (UTC) X-Farcaster-Flow-ID: 9eec1440-3b96-4a48-a704-b02f65525c99 Received: from EX19D020UWA003.ant.amazon.com (10.13.138.254) by EX19MTAUWB001.ant.amazon.com (10.250.64.248) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from EX19MTAUWA002.ant.amazon.com (10.250.64.202) by EX19D020UWA003.ant.amazon.com (10.13.138.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from email-imr-corp-prod-iad-1box-1a-6851662a.us-east-1.amazon.com (10.25.36.210) by mail-relay.amazon.com (10.250.64.203) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14 via Frontend Transport; Fri, 7 Mar 2025 00:58:33 +0000 Received: from dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com [172.19.91.144]) by email-imr-corp-prod-iad-1box-1a-6851662a.us-east-1.amazon.com (Postfix) with ESMTP id 4FF9A401E0; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) Received: by dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (Postfix, from userid 23027615) id 0F4D14FF6; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) From: Pratyush Yadav To: CC: Pratyush Yadav , Jonathan Corbet , "Eric Biederman" , Arnd Bergmann , "Greg Kroah-Hartman" , Alexander Viro , Christian Brauner , Jan Kara , Hugh Dickins , Alexander Graf , Benjamin Herrenschmidt , "David Woodhouse" , James Gowans , "Mike Rapoport" , Paolo Bonzini , "Pasha Tatashin" , Anthony Yznaga , Dave Hansen , David Hildenbrand , Jason Gunthorpe , Matthew Wilcox , "Wei Yang" , Andrew Morton , , , , Subject: [RFC PATCH 4/5] mm: shmem: allow preserving file over FDBOX + KHO Date: Fri, 7 Mar 2025 00:57:38 +0000 Message-ID: <20250307005830.65293-5-ptyadav@amazon.de> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250307005830.65293-1-ptyadav@amazon.de> References: <20250307005830.65293-1-ptyadav@amazon.de> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 For applications with a large amount of memory that takes time to rebuild, reboots to consume kernel upgrades can be very expensive. FDBox allows preserving file descriptors over kexec using KHO. Combining that with memfd gives those applications reboot-persistent memory that they can use to quickly save and reconstruct that state. Since memfd is backed either by hugetlbfs or shmem, use shmem as the first backend for memfd that is FDBOX + KHO capable. To preserve the file's contents during KHO activation, the file's page cache must be walked and all entries removed, and their indices stored. Use the newly introduced shmem_undo_range_ops to achieve this. Walk each entry and before truncating it, take a refcount on the folio so it does not get freed, and store its physical address and index in the kho_mem and indices arrays. Swap pages, partial folios, and huge folios are not supported yet. Encountering those results in an error. On the restore side, an empty file is created and then the mems array walked to insert the pages into the page cache. The logic in shmem_alloc_and_add_folio() is roughly followed. Signed-off-by: Pratyush Yadav --- include/linux/shmem_fs.h | 6 + mm/shmem.c | 333 +++++++++++++++++++++++++++++++++++++++ 2 files changed, 339 insertions(+) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 0b273a7b9f01d..263416f357fe1 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -205,6 +205,12 @@ extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd, #endif /* CONFIG_SHMEM */ #endif /* CONFIG_USERFAULTFD */ +#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER) +bool is_node_shmem(const void *fdt, int offset); +int shmem_fdbox_kho_write(struct fdbox_fd *ffd, void *fdt); +struct file *shmem_fdbox_kho_recover(const void *fdt, int offset); +#endif + /* * Used space is stored as unsigned 64-bit value in bytes but * quota core supports only signed 64-bit values so use that diff --git a/mm/shmem.c b/mm/shmem.c index d6d9266b27b75..c2efdb34a1a18 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -41,6 +41,12 @@ #include #include #include +#include +#include +#include +#include +#include +#include #include "swap.h" static struct vfsmount *shm_mnt __ro_after_init; @@ -5283,6 +5289,333 @@ static int shmem_error_remove_folio(struct address_space *mapping, return 0; } +#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER) +static const char fdbox_kho_compatible[] = "fdbox,shmem-v1"; + +bool is_node_shmem(const void *fdt, int offset) +{ + return fdt_node_check_compatible(fdt, offset, fdbox_kho_compatible) == 0; +} + +struct shmem_fdbox_put_arg { + struct kho_mem *mems; + unsigned long *indices; + unsigned long nr_mems; + unsigned long idx; +}; + +static long shmem_fdbox_undo_swap(struct address_space *mapping, pgoff_t index, + void *old, void *arg) +{ + return -EOPNOTSUPP; +} + +static int shmem_fdbox_undo_folio(struct address_space *mapping, + struct folio *folio, void *__arg) +{ + struct shmem_fdbox_put_arg *arg = __arg; + struct kho_mem *mem; + + if (arg->idx == arg->nr_mems) + return -ENOSPC; + + if (folio_nr_pages(folio) != 1) + return -EOPNOTSUPP; + + /* + * Grab an extra refcount to the folio so it sticks around after + * truncation. + */ + folio_get(folio); + + mem = arg->mems + arg->idx; + + mem->addr = PFN_PHYS(folio_pfn(folio)); + mem->size = PAGE_SIZE; + arg->indices[arg->idx] = folio_index(folio); + arg->idx++; + + truncate_inode_folio(mapping, folio); + return 0; +} + +static int shmem_fdbox_undo_partial_folio(struct folio *folio, pgoff_t lstart, + pgoff_t lend, void *arg) +{ + return -EOPNOTSUPP; +} + +static const struct shmem_undo_range_ops shmem_fdbox_undo_ops = { + .undo_swap = shmem_fdbox_undo_swap, + .undo_folio = shmem_fdbox_undo_folio, + .undo_partial_folio = shmem_fdbox_undo_partial_folio, +}; + +static struct kho_mem *shmem_fdbox_kho_get_mems(struct inode *inode, + unsigned long **indicesp, + unsigned long *nr) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + unsigned long *indices __free(kvfree) = NULL; + struct kho_mem *mems __free(kvfree) = NULL; + struct shmem_fdbox_put_arg arg; + unsigned long nr_mems; + int ret, i; + + scoped_guard(spinlock, &info->lock) { + /* TODO: Support swapped pages. Perhaps swap them back in? */ + if (info->swapped) + return ERR_PTR(-EOPNOTSUPP); + + /* + * Estimate the size of the array using the size of the inode, + * assuming there are no contiguous pages. + */ + nr_mems = info->alloced; + } + + mems = kvmalloc_array(nr_mems, sizeof(*mems), GFP_KERNEL); + if (!mems) + return ERR_PTR(-ENOMEM); + + indices = kvmalloc_array(nr_mems, sizeof(*indices), GFP_KERNEL); + if (!indices) + return ERR_PTR(-ENOMEM); + + arg.mems = mems; + arg.indices = indices; + arg.nr_mems = nr_mems; + arg.idx = 0; + + ret = shmem_undo_range(inode, 0, -1, false, &shmem_fdbox_undo_ops, &arg); + if (ret < 0) { + pr_err("shmem: failed to undo fdbox range: %d\n", ret); + goto err; + } + + *nr = arg.idx; + *indicesp = no_free_ptr(indices); + return_ptr(mems); + +err: + /* + * TODO: This kills the whole file on failure to KHO. We should keep the + * contents around for another try later. The problem is, if re-adding + * pages fails, there would be no recovery at that point. Ideally, we + * should first serialize the whole file, and only then remove things + * from page cache so we are sure to never fail. + */ + for (i = 0; i < arg.idx; i++) { + struct folio *folio = page_folio(phys_to_page(mems[i].addr)); + + folio_put(folio); + } + + /* Undo the rest of the file. This should not fail. */ + WARN_ON(shmem_undo_range(inode, 0, -1, false, &shmem_default_undo_ops, NULL)); + return ERR_PTR(ret); +} + +int shmem_fdbox_kho_write(struct fdbox_fd *box_fd, void *fdt) +{ + struct inode *inode = box_fd->file->f_inode; + unsigned long *indices __free(kvfree) = NULL; + struct kho_mem *mems __free(kvfree) = NULL; + u64 pos = box_fd->file->f_pos, size = inode->i_size; + unsigned long nr_mems, i; + int ret = 0; + + /* + * mems can be larger than sizeof(*mems) * nr_mems, but we should only + * look at things in the range of 0 to nr_mems. + */ + mems = shmem_fdbox_kho_get_mems(inode, &indices, &nr_mems); + if (IS_ERR(mems)) + return PTR_ERR(mems); + + /* + * fdbox should have already started the node. We can start adding + * properties directly. + */ + ret |= fdt_property(fdt, "compatible", fdbox_kho_compatible, + sizeof(fdbox_kho_compatible)); + ret |= fdt_property(fdt, "pos", &pos, sizeof(u64)); + ret |= fdt_property(fdt, "size", &size, sizeof(u64)); + ret |= fdt_property(fdt, "mem", mems, sizeof(*mems) * nr_mems); + ret |= fdt_property(fdt, "indices", indices, sizeof(*indices) * nr_mems); + + if (ret) { + pr_err("shmem: failed to add properties to FDT!\n"); + ret = -EINVAL; + goto err; + } + + return 0; + +err: + /* + * TODO: This kills the whole file on failure to KHO. We should keep the + * contents around for another try later. The problem is, if re-adding + * pages fails, there would be no recovery at that point. Ideally, we + * should first serialize the whole file, and only then remove things + * from page cache so we are sure to never fail. + */ + for (i = 0; i < nr_mems; i++) { + struct folio *folio = page_folio(phys_to_page(mems[i].addr)); + + folio_put(folio); + } + return ret; +} + +struct file *shmem_fdbox_kho_recover(const void *fdt, int offset) +{ + struct address_space *mapping; + char pathbuf[1024] = "", *path; + const unsigned long *indices; + const struct kho_mem *mems; + unsigned long nr_mems, i = 0; + const u64 *pos, *size; + struct inode *inode; + struct file *file; + int len, ret; + + ret = fdt_node_check_compatible(fdt, offset, fdbox_kho_compatible); + if (ret) { + pr_err("shmem: invalid compatible\n"); + goto err; + } + + mems = fdt_getprop(fdt, offset, "mem", &len); + if (!mems || len % sizeof(*mems)) { + pr_err("shmem: invalid mems property\n"); + goto err; + } + nr_mems = len / sizeof(*mems); + + indices = fdt_getprop(fdt, offset, "indices", &len); + if (!indices || len % sizeof(unsigned long)) { + pr_err("shmem: invalid indices property\n"); + goto err_return; + } + if (len / sizeof(unsigned long) != nr_mems) { + pr_err("shmem: number of indices and mems do not match\n"); + goto err_return; + } + + size = fdt_getprop(fdt, offset, "size", &len); + if (!size || len != sizeof(u64)) { + pr_err("shmem: invalid size property\n"); + goto err_return; + } + + pos = fdt_getprop(fdt, offset, "pos", &len); + if (!pos || len != sizeof(u64)) { + pr_err("shmem: invalid pos property\n"); + goto err_return; + } + + /* + * TODO: This sets UID/GID, cgroup accounting to root. Should this + * be given to the first user that maps the FD instead? + */ + file = shmem_file_setup(fdt_get_name(fdt, offset, NULL), 0, + VM_NORESERVE); + if (IS_ERR(file)) { + pr_err("shmem: failed to setup file\n"); + goto err_return; + } + + inode = file->f_inode; + mapping = inode->i_mapping; + vfs_setpos(file, *pos, MAX_LFS_FILESIZE); + + for (; i < nr_mems; i++) { + struct folio *folio; + void *va; + + if (mems[i].size != PAGE_SIZE) { + pr_err("shmem: unknown kho_mem size %llx. Expected %lx\n", + mems[i].size, PAGE_SIZE); + goto err_return; + } + + va = kho_claim_mem(&mems[i]); + folio = virt_to_folio(va); + + /* Set up the folio for insertion. */ + + /* + * TODO: This breaks falloc-ed folios since now they get marked + * uptodate when they might not actually be zeroed out yet. Need + * a way to distinguish falloc-ed folios. + */ + folio_mark_uptodate(folio); + folio_mark_dirty(folio); + + /* + * TODO: Should find a way to unify this and + * shmem_alloc_and_add_folio(). + */ + __folio_set_locked(folio); + __folio_set_swapbacked(folio); + + ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping)); + if (ret) { + folio_unlock(folio); + folio_put(folio); + fput(file); + pr_err("shmem: failed to charge folio index %lu\n", i); + goto err_return_next; + } + + ret = shmem_add_to_page_cache(folio, mapping, indices[i], NULL, + mapping_gfp_mask(mapping)); + if (ret) { + folio_unlock(folio); + folio_put(folio); + fput(file); + pr_err("shmem: failed to add to page cache folio index %lu\n", i); + goto err_return_next; + } + + ret = shmem_inode_acct_blocks(inode, 1); + if (ret) { + folio_unlock(folio); + folio_put(folio); + fput(file); + pr_err("shmem: failed to account folio index %lu\n", i); + goto err_return_next; + } + + shmem_recalc_inode(inode, 1, 0); + folio_add_lru(folio); + folio_unlock(folio); + folio_put(folio); + } + + inode->i_size = *size; + + return file; + +err_return: + kho_return_mem(mems + i); +err_return_next: + for (i = i + 1; i < nr_mems; i++) + kho_return_mem(mems + i); +err: + ret = fdt_get_path(fdt, offset, pathbuf, sizeof(pathbuf)); + if (ret) + path = "unknown"; + else + path = pathbuf; + + pr_err("shmem: error when recovering KHO node '%s'\n", path); + return NULL; +} + +#endif /* CONFIG_FDBOX && CONFIG_KEXEC_HANDOVER */ + static const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .dirty_folio = noop_dirty_folio, From patchwork Fri Mar 7 00:57:39 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pratyush Yadav X-Patchwork-Id: 14005617 Received: from smtp-fw-2101.amazon.com (smtp-fw-2101.amazon.com [72.21.196.25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 27814156C72; Fri, 7 Mar 2025 00:58:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=72.21.196.25 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309122; cv=none; b=CBbhoqnv0mQdE43aUpNRFJMkxJtPgqtLV1bgZihoKn0Od+nIZxTL2sndtxn+rILpJrV511QnPZxTQ/npAmE53v7m0cBK0UfB4cUGxfTJoVOVYo3oj4+0tkyqglzsxOqH0LnZKN3WjGoqqJbBVeW/sXiS/OjTnE/BRXUajMk3vyg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741309122; c=relaxed/simple; bh=CUemmUBXCoody8wnnAm1qD67Ji2foJlGPZcw3CxU16Y=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=TQgkTtnSm0PiKt3QoVYtjQQHUlKD2h2+9Aaefx1evK+U7p+swN/m2G2yptXQ+Q7ixq82bm+dF0APL7bWvwFO5fHEgsYIyqqr1M0H+cXGQf+BXaFGkoo1oSLNI2Ux/RhyNOZ3X+n0XmOQcaPD8y/8MryxHGIA4yrSZAOgFCWkDNM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b=B6BnT2tx; arc=none smtp.client-ip=72.21.196.25 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.de header.i=@amazon.de header.b="B6BnT2tx" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1741309120; x=1772845120; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=sQqXLYiJ1VVGRN+jOJACYXz/EcfmdNuv7Pcm+0ImUEs=; b=B6BnT2txDN5b7FwxaN0IL5vQOMpcwU63wgr2Ucfhp4AbhdEytd6dsJ43 GRjHZoPPwsTdbXhueWmjO/odEK9Ui9fKmll5e7yXYyI4Q4G/h472FFDmL AZ9wHdJDtW1pAAu/tGxr9J+fxd0BrEJBwsG/3tzS7l7RZnCnAmzri77Qu w=; X-IronPort-AV: E=Sophos;i="6.14,227,1736812800"; d="scan'208";a="472783393" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-2101.iad2.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Mar 2025 00:58:35 +0000 Received: from EX19MTAUWB001.ant.amazon.com [10.0.38.20:32828] by smtpin.naws.us-west-2.prod.farcaster.email.amazon.dev [10.0.40.101:2525] with esmtp (Farcaster) id 6f9afba4-0449-4880-b8c4-e10c02a9ebe8; Fri, 7 Mar 2025 00:58:34 +0000 (UTC) X-Farcaster-Flow-ID: 6f9afba4-0449-4880-b8c4-e10c02a9ebe8 Received: from EX19D020UWA003.ant.amazon.com (10.13.138.254) by EX19MTAUWB001.ant.amazon.com (10.250.64.248) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from EX19MTAUWB002.ant.amazon.com (10.250.64.231) by EX19D020UWA003.ant.amazon.com (10.13.138.254) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14; Fri, 7 Mar 2025 00:58:34 +0000 Received: from email-imr-corp-prod-pdx-all-2c-475d797d.us-west-2.amazon.com (10.25.36.214) by mail-relay.amazon.com (10.250.64.228) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA) id 15.2.1544.14 via Frontend Transport; Fri, 7 Mar 2025 00:58:34 +0000 Received: from dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com [172.19.91.144]) by email-imr-corp-prod-pdx-all-2c-475d797d.us-west-2.amazon.com (Postfix) with ESMTP id DAA12A5F81; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) Received: by dev-dsk-ptyadav-1c-43206220.eu-west-1.amazon.com (Postfix, from userid 23027615) id 13BED4FF9; Fri, 7 Mar 2025 00:58:33 +0000 (UTC) From: Pratyush Yadav To: CC: Pratyush Yadav , Jonathan Corbet , "Eric Biederman" , Arnd Bergmann , "Greg Kroah-Hartman" , Alexander Viro , Christian Brauner , Jan Kara , Hugh Dickins , Alexander Graf , Benjamin Herrenschmidt , "David Woodhouse" , James Gowans , "Mike Rapoport" , Paolo Bonzini , "Pasha Tatashin" , Anthony Yznaga , Dave Hansen , David Hildenbrand , Jason Gunthorpe , Matthew Wilcox , "Wei Yang" , Andrew Morton , , , , Subject: [RFC PATCH 5/5] mm/memfd: allow preserving FD over FDBOX + KHO Date: Fri, 7 Mar 2025 00:57:39 +0000 Message-ID: <20250307005830.65293-6-ptyadav@amazon.de> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20250307005830.65293-1-ptyadav@amazon.de> References: <20250307005830.65293-1-ptyadav@amazon.de> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 For applications with a large amount of memory that takes time to rebuild, reboots to consume kernel upgrades can be very expensive. FDBox allows preserving file descriptors over kexec using KHO. Combining that with memfd gives those applications reboot-persistent memory that they can use to quickly save and reconstruct that state. While memfd is backed by either hugetlbfs or shmem, currently only support on shmem is added for this. Allow saving and restoring shmem FDs over FDBOX + KHO. The memfd FDT node itself does not contain much information. It just creates a subnode and passes it over to shmem to do its thing. Similar behaviour is followed on the restore side. Since there are now two paths of getting a shmem file, refactor the file setup into its own function called memfd_setup_file(). It sets up the file flags, mode, etc., and sets fdbox ops if enabled. Signed-off-by: Pratyush Yadav --- mm/memfd.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++----- 1 file changed, 116 insertions(+), 12 deletions(-) diff --git a/mm/memfd.c b/mm/memfd.c index 37f7be57c2f50..1c32e66197f6d 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -7,6 +7,8 @@ * This file is released under the GPL. */ +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt + #include #include #include @@ -19,8 +21,12 @@ #include #include #include +#include +#include #include +static const struct fdbox_file_ops memfd_fdbox_fops; + /* * We need a tag: a new tag would expand every xa_node by 8 bytes, * so reuse a tag which we firmly believe is never set or cleared on tmpfs @@ -418,21 +424,10 @@ static char *alloc_name(const char __user *uname) return ERR_PTR(error); } -static struct file *alloc_file(const char *name, unsigned int flags) +static void memfd_setup_file(struct file *file, unsigned int flags) { unsigned int *file_seals; - struct file *file; - if (flags & MFD_HUGETLB) { - file = hugetlb_file_setup(name, 0, VM_NORESERVE, - HUGETLB_ANONHUGE_INODE, - (flags >> MFD_HUGE_SHIFT) & - MFD_HUGE_MASK); - } else { - file = shmem_file_setup(name, 0, VM_NORESERVE); - } - if (IS_ERR(file)) - return file; file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; file->f_flags |= O_LARGEFILE; @@ -452,6 +447,27 @@ static struct file *alloc_file(const char *name, unsigned int flags) *file_seals &= ~F_SEAL_SEAL; } +#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER) + file->f_fdbox_op = &memfd_fdbox_fops; +#endif +} + +static struct file *alloc_file(const char *name, unsigned int flags) +{ + struct file *file; + + if (flags & MFD_HUGETLB) { + file = hugetlb_file_setup(name, 0, VM_NORESERVE, + HUGETLB_ANONHUGE_INODE, + (flags >> MFD_HUGE_SHIFT) & + MFD_HUGE_MASK); + } else { + file = shmem_file_setup(name, 0, VM_NORESERVE); + } + if (IS_ERR(file)) + return file; + + memfd_setup_file(file, flags); return file; } @@ -493,3 +509,91 @@ SYSCALL_DEFINE2(memfd_create, kfree(name); return error; } + +#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER) +static const char memfd_fdbox_compatible[] = "fdbox,memfd-v1"; + +static struct file *memfd_fdbox_kho_recover(const void *fdt, int offset) +{ + struct file *file; + int ret, subnode; + + ret = fdt_node_check_compatible(fdt, offset, memfd_fdbox_compatible); + if (ret) { + pr_err("kho: invalid compatible\n"); + return NULL; + } + + /* Make sure there is exactly one subnode. */ + subnode = fdt_first_subnode(fdt, offset); + if (subnode < 0) { + pr_err("kho: no subnode for underlying storage found!\n"); + return NULL; + } + if (fdt_next_subnode(fdt, subnode) >= 0) { + pr_err("kho: too many subnodes. Expected only 1.\n"); + return NULL; + } + + if (is_node_shmem(fdt, subnode)) { + file = shmem_fdbox_kho_recover(fdt, subnode); + if (!file) + return NULL; + + memfd_setup_file(file, 0); + return file; + } + + return NULL; +} + +static int memfd_fdbox_kho_write(struct fdbox_fd *box_fd, void *fdt) +{ + int ret = 0; + + ret |= fdt_property(fdt, "compatible", memfd_fdbox_compatible, + sizeof(memfd_fdbox_compatible)); + + /* TODO: Track seals on the file as well. */ + + ret |= fdt_begin_node(fdt, ""); + if (ret) { + pr_err("kho: failed to set up memfd node\n"); + return -EINVAL; + } + + if (shmem_file(box_fd->file)) + ret = shmem_fdbox_kho_write(box_fd, fdt); + else + /* TODO: HugeTLB support. */ + ret = -EOPNOTSUPP; + + if (ret) + return ret; + + ret = fdt_end_node(fdt); + if (ret) { + pr_err("kho: failed to end memfd node!\n"); + return ret; + } + + return 0; +} + +static const struct fdbox_file_ops memfd_fdbox_fops = { + .kho_write = memfd_fdbox_kho_write, +}; + +static int __init memfd_fdbox_init(void) +{ + int error; + + error = fdbox_register_handler(memfd_fdbox_compatible, + memfd_fdbox_kho_recover); + if (error) + pr_err("Could not register fdbox handler: %d\n", error); + + return 0; +} +late_initcall(memfd_fdbox_init); +#endif /* CONFIG_FDBOX && CONFIG_KEXEC_HANDOVER */