From patchwork Mon Feb 5 12:01:46 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545321 Received: from smtp-fw-9106.amazon.com (smtp-fw-9106.amazon.com [207.171.188.206]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E87BF1AACF; Mon, 5 Feb 2024 12:02:27 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=207.171.188.206 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134549; cv=none; b=tN9N11+/bs67/kIY+g53EPbcEEeflSkH1zs3kkYKw50DgqCZk+alRRPFTW+KAOGdhlMgRqs+u9xj9iM0/46tdoApKMqW2dN4/sDwtgNajAeXc7fLFkvkVKUpKOc25Rj4bSU2fpCHJrEE1Q/ORMxkQfMNFV2pzcG2ZgxKpXfn0Mw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134549; c=relaxed/simple; bh=RV+Ytw8ThVExKlkqCCz/3DDoByyrvDLhxUnYt76lnO8=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=HTg/jKuyBiCsP4PFdL2Phu3DBq0wFvOGyBG5mqOSiHR07vZxr3l6B9+KWf3M8Wf2GaJrdmZCtnu38Lb2dgWoLk4zWyuWj9iEy8jdm9mWEDGjBGnmLtwsd7c7Jwp/DVcTgpNaY9rEheBFHYx2Gaj2C5M/+HVfXKmeplAtySGOWaE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=b3mmLUM8; arc=none smtp.client-ip=207.171.188.206 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="b3mmLUM8" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134549; x=1738670549; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=ulQJ3D4Me8ScxlcztMMDIgEvnR/mGPMynfspToLkF2M=; b=b3mmLUM8uxUdbQnSkV2tBea6B5synisslurnQMLhT1Y48dqsPRoeoRTI 6zWc/GGxW9avbDPVCdKxH58P9Dr0k5iHKpkCzx+wqjbVJk03MVe6yB7e7 EYdCy2Rd/TN/Ur0HgkAZqbavOeRmsOHJuvj7iR+0Yjh9S/ByJIIE4Kmac g=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="702145833" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-9106.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:02:22 +0000 Received: from EX19MTAEUB002.ant.amazon.com [10.0.43.254:59802] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.28.192:2525] with esmtp (Farcaster) id c85ecf83-e3a4-4c42-963b-1a5c7099c0b7; Mon, 5 Feb 2024 12:02:20 +0000 (UTC) X-Farcaster-Flow-ID: c85ecf83-e3a4-4c42-963b-1a5c7099c0b7 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUB002.ant.amazon.com (10.252.51.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:02:20 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:02:14 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 01/18] pkernfs: Introduce filesystem skeleton Date: Mon, 5 Feb 2024 12:01:46 +0000 Message-ID: <20240205120203.60312-2-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D045UWA002.ant.amazon.com (10.13.139.12) To EX19D014EUC004.ant.amazon.com (10.252.51.182) Add an in-memory filesystem: pkernfs. Memory is donated to pkernfs by carving it out of the normal System RAM range with the memmap= cmdline parameter and then giving that same physical range to pkernfs with the pkernfs= cmdline parameter. A new filesystem is added; so far it doesn't do much except persist a super block at the start of the donated memory and allows itself to be mounted. --- fs/Kconfig | 1 + fs/Makefile | 3 ++ fs/pkernfs/Kconfig | 9 ++++ fs/pkernfs/Makefile | 6 +++ fs/pkernfs/pkernfs.c | 99 ++++++++++++++++++++++++++++++++++++++++++++ fs/pkernfs/pkernfs.h | 6 +++ 6 files changed, 124 insertions(+) create mode 100644 fs/pkernfs/Kconfig create mode 100644 fs/pkernfs/Makefile create mode 100644 fs/pkernfs/pkernfs.c create mode 100644 fs/pkernfs/pkernfs.h diff --git a/fs/Kconfig b/fs/Kconfig index aa7e03cc1941..33a9770ae657 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -331,6 +331,7 @@ source "fs/sysv/Kconfig" source "fs/ufs/Kconfig" source "fs/erofs/Kconfig" source "fs/vboxsf/Kconfig" +source "fs/pkernfs/Kconfig" endif # MISC_FILESYSTEMS diff --git a/fs/Makefile b/fs/Makefile index f9541f40be4e..1af35b494b5d 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -19,6 +19,9 @@ obj-y := open.o read_write.o file_table.o super.o \ obj-$(CONFIG_BUFFER_HEAD) += buffer.o mpage.o obj-$(CONFIG_PROC_FS) += proc_namespace.o + +obj-y += pkernfs/ + obj-$(CONFIG_LEGACY_DIRECT_IO) += direct-io.o obj-y += notify/ obj-$(CONFIG_EPOLL) += eventpoll.o diff --git a/fs/pkernfs/Kconfig b/fs/pkernfs/Kconfig new file mode 100644 index 000000000000..59621a1d9aef --- /dev/null +++ b/fs/pkernfs/Kconfig @@ -0,0 +1,9 @@ +# SPDX-License-Identifier: GPL-2.0-only + +config PKERNFS_FS + bool "Persistent Kernel filesystem (pkernfs)" + help + An in-memory filesystem on top of reserved memory specified via + pkernfs= cmdline argument. Used for storing kernel state and + userspace memory which is preserved across kexec to support + live update. diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile new file mode 100644 index 000000000000..17258cb77f58 --- /dev/null +++ b/fs/pkernfs/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# Makefile for persistent kernel filesystem +# + +obj-$(CONFIG_PKERNFS_FS) += pkernfs.o diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c new file mode 100644 index 000000000000..4c476ddc35b6 --- /dev/null +++ b/fs/pkernfs/pkernfs.c @@ -0,0 +1,99 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "pkernfs.h" +#include +#include +#include +#include +#include + +static phys_addr_t pkernfs_base, pkernfs_size; +static void *pkernfs_mem; +static const struct super_operations pkernfs_super_ops = { }; + +static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct inode *inode; + struct dentry *dentry; + struct pkernfs_sb *psb; + + pkernfs_mem = memremap(pkernfs_base, pkernfs_size, MEMREMAP_WB); + psb = (struct pkernfs_sb *) pkernfs_mem; + + if (psb->magic_number == PKERNFS_MAGIC_NUMBER) { + pr_info("pkernfs: Restoring from super block\n"); + } else { + pr_info("pkernfs: Clean super block; initialising\n"); + psb->magic_number = PKERNFS_MAGIC_NUMBER; + } + + sb->s_op = &pkernfs_super_ops; + + inode = new_inode(sb); + if (!inode) + return -ENOMEM; + + inode->i_ino = 1; + inode->i_mode = S_IFDIR; + inode->i_op = &simple_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + inode->i_atime = inode->i_mtime = current_time(inode); + inode_set_ctime_current(inode); + /* directory inodes start off with i_nlink == 2 (for "." entry) */ + inc_nlink(inode); + + dentry = d_make_root(inode); + if (!dentry) + return -ENOMEM; + sb->s_root = dentry; + + return 0; +} + +static int pkernfs_get_tree(struct fs_context *fc) +{ + return get_tree_nodev(fc, pkernfs_fill_super); +} + +static const struct fs_context_operations pkernfs_context_ops = { + .get_tree = pkernfs_get_tree, +}; + +static int pkernfs_init_fs_context(struct fs_context *const fc) +{ + fc->ops = &pkernfs_context_ops; + return 0; +} + +static struct file_system_type pkernfs_fs_type = { + .owner = THIS_MODULE, + .name = "pkernfs", + .init_fs_context = pkernfs_init_fs_context, + .kill_sb = kill_litter_super, + .fs_flags = FS_USERNS_MOUNT, +}; + +static int __init pkernfs_init(void) +{ + int ret; + + ret = register_filesystem(&pkernfs_fs_type); + return ret; +} + +/** + * Format: pkernfs=: + * Just like: memmap=nn[KMG]!ss[KMG] + */ +static int __init parse_pkernfs_extents(char *p) +{ + pkernfs_size = memparse(p, &p); + p++; /* Skip over ! char */ + pkernfs_base = memparse(p, &p); + return 0; +} + +early_param("pkernfs", parse_pkernfs_extents); + +MODULE_ALIAS_FS("pkernfs"); +module_init(pkernfs_init); diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h new file mode 100644 index 000000000000..bd1e2a6fd336 --- /dev/null +++ b/fs/pkernfs/pkernfs.h @@ -0,0 +1,6 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#define PKERNFS_MAGIC_NUMBER 0x706b65726e6673 +struct pkernfs_sb { + unsigned long magic_number; +}; From patchwork Mon Feb 5 12:01:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545322 Received: from smtp-fw-80007.amazon.com (smtp-fw-80007.amazon.com [99.78.197.218]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB5331B7FC; Mon, 5 Feb 2024 12:02:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.218 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134551; cv=none; b=YtqlS9XWObXzGjvXO4VGj97Cfts5pzf17z4It4SHLmTQhFhcpb6E05grmxwGbtJq41X5gF6mOpr4918jN9QwGm++H1n/pKUvw2sxkrXf3Z+B2yjB9XsFta703QnLpq15zizBMqzsyYdowRFVYwDNSGe3oDB6Purbes2H2WnTld4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134551; c=relaxed/simple; bh=SnpnjsIe5mX0+E5JVEhCwr63Ah3ckizateAvEwVUJoo=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=QH326F/wTJacONs6mn3+FHQ4TbeWa+g3aaQRHrQMyg3MfpiMp6ryYETf/oQJ9ZZVO7HOJYnevAsYPOACpqpsB77wTWSFOnc0N0nnqGLIcaDU+gzNwvUSyNSoTLt+XjDAc5wo9La1rWKBu2ddDv3DZYjAoh7uIBE0KVdowWGQIpg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=TA0K/s1I; arc=none smtp.client-ip=99.78.197.218 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="TA0K/s1I" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134550; x=1738670550; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=DNmTqGFg0B4wwqijKuh2wndJnn1CFCRShUGGQJLOmFw=; b=TA0K/s1IJHf/prFJ+13DsO3hRi6rIgNVOyxZkj9Mc4tZPsOBC4X0GPps 4beNfJUsnGEfWYwxzHtz+IGhFINgJh5da5qq9tB4yyUzCYKcIS3rOnkyO FTdoHF0fuS35Wad33aVxPAF3orkzA+n9bitGEca8aj8giW0lsHjNSMJSJ o=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="271936637" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-80007.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:02:28 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.43.254:3818] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.33.186:2525] with esmtp (Farcaster) id e767a7b8-4373-41b8-af32-a81e33b618aa; Mon, 5 Feb 2024 12:02:27 +0000 (UTC) X-Farcaster-Flow-ID: e767a7b8-4373-41b8-af32-a81e33b618aa Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUC001.ant.amazon.com (10.252.51.193) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:02:26 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:02:20 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 02/18] pkernfs: Add persistent inodes hooked into directies Date: Mon, 5 Feb 2024 12:01:47 +0000 Message-ID: <20240205120203.60312-3-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D045UWA002.ant.amazon.com (10.13.139.12) To EX19D014EUC004.ant.amazon.com (10.252.51.182) Add the ability to create inodes for files and directories inside directories. Inodes are persistent in the in-memory filesystem; the second 2 MiB is used as an "inode store." The inode store is one big array of struct pkernfs_inodes and they use a linked list to point to the next sibling inode or in the case of a directory the child inode which is the first inode in that directory. Free inodese are similarly maintained in a linked list with the first free inode being pointed to by the super block. Directory file_operations are added to support iterating through the content of a directory. Simiarly inode operations are added to support creating a file inside a directory. This allocate the next free inode and makes it the head of tthe "child inode" linked list for the directory. Unlink is implemented to remove an inode from the linked list. This is a bit finicky as it is done differently depending on whether the inode is the first child of a directory or somewhere later in the linked list. --- fs/pkernfs/Makefile | 2 +- fs/pkernfs/dir.c | 43 +++++++++++++ fs/pkernfs/inode.c | 148 +++++++++++++++++++++++++++++++++++++++++++ fs/pkernfs/pkernfs.c | 13 ++-- fs/pkernfs/pkernfs.h | 34 ++++++++++ 5 files changed, 234 insertions(+), 6 deletions(-) create mode 100644 fs/pkernfs/dir.c create mode 100644 fs/pkernfs/inode.c diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile index 17258cb77f58..0a66e98bda07 100644 --- a/fs/pkernfs/Makefile +++ b/fs/pkernfs/Makefile @@ -3,4 +3,4 @@ # Makefile for persistent kernel filesystem # -obj-$(CONFIG_PKERNFS_FS) += pkernfs.o +obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o dir.o diff --git a/fs/pkernfs/dir.c b/fs/pkernfs/dir.c new file mode 100644 index 000000000000..b10ce745f19d --- /dev/null +++ b/fs/pkernfs/dir.c @@ -0,0 +1,43 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "pkernfs.h" + +static int pkernfs_dir_iterate(struct file *dir, struct dir_context *ctx) +{ + struct pkernfs_inode *pkernfs_inode; + struct super_block *sb = dir->f_inode->i_sb; + + /* Indication from previous invoke that there's no more to iterate. */ + if (ctx->pos == -1) + return 0; + + if (!dir_emit_dots(dir, ctx)) + return 0; + + /* + * Just emitted this dir; go to dir contents. Use pos to smuggle + * the next inode number to emit across iterations. + * -1 indicates no valid inode. Can't use 0 because first loop has pos=0 + */ + if (ctx->pos == 2) { + ctx->pos = pkernfs_get_persisted_inode(sb, dir->f_inode->i_ino)->child_ino; + /* Empty dir case. */ + if (ctx->pos == 0) + ctx->pos = -1; + } + + while (ctx->pos > 1) { + pkernfs_inode = pkernfs_get_persisted_inode(sb, ctx->pos); + dir_emit(ctx, pkernfs_inode->filename, PKERNFS_FILENAME_LEN, + ctx->pos, DT_UNKNOWN); + ctx->pos = pkernfs_inode->sibling_ino; + if (!ctx->pos) + ctx->pos = -1; + } + return 0; +} + +const struct file_operations pkernfs_dir_fops = { + .owner = THIS_MODULE, + .iterate_shared = pkernfs_dir_iterate, +}; diff --git a/fs/pkernfs/inode.c b/fs/pkernfs/inode.c new file mode 100644 index 000000000000..f6584c8b8804 --- /dev/null +++ b/fs/pkernfs/inode.c @@ -0,0 +1,148 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "pkernfs.h" +#include + +const struct inode_operations pkernfs_dir_inode_operations; + +struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino) +{ + /* + * Inode index starts at 1, so -1 to get memory index. + */ + return ((struct pkernfs_inode *) (pkernfs_mem + PMD_SIZE)) + ino - 1; +} + +struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino) +{ + struct inode *inode = iget_locked(sb, ino); + + /* If this inode is cached it is already populated; just return */ + if (!(inode->i_state & I_NEW)) + return inode; + inode->i_op = &pkernfs_dir_inode_operations; + inode->i_sb = sb; + inode->i_mode = S_IFREG; + unlock_new_inode(inode); + return inode; +} + +static unsigned long pkernfs_allocate_inode(struct super_block *sb) +{ + + unsigned long next_free_ino; + struct pkernfs_sb *psb = (struct pkernfs_sb *) pkernfs_mem; + + next_free_ino = psb->next_free_ino; + if (!next_free_ino) + return -ENOMEM; + psb->next_free_ino = + pkernfs_get_persisted_inode(sb, next_free_ino)->sibling_ino; + return next_free_ino; +} + +/* + * Zeroes the inode and makes it the head of the free list. + */ +static void pkernfs_free_inode(struct super_block *sb, unsigned long ino) +{ + struct pkernfs_sb *psb = (struct pkernfs_sb *) pkernfs_mem; + struct pkernfs_inode *inode = pkernfs_get_persisted_inode(sb, ino); + + memset(inode, 0, sizeof(struct pkernfs_inode)); + inode->sibling_ino = psb->next_free_ino; + psb->next_free_ino = ino; +} + +void pkernfs_initialise_inode_store(struct super_block *sb) +{ + /* Inode store is a PMD sized (ie: 2 MiB) page */ + memset(pkernfs_get_persisted_inode(sb, 1), 0, PMD_SIZE); + /* Point each inode for the next one; linked-list initialisation. */ + for (unsigned long ino = 2; ino * sizeof(struct pkernfs_inode) < PMD_SIZE; ino++) + pkernfs_get_persisted_inode(sb, ino - 1)->sibling_ino = ino; +} + +static int pkernfs_create(struct mnt_idmap *id, struct inode *dir, + struct dentry *dentry, umode_t mode, bool excl) +{ + unsigned long free_inode; + struct pkernfs_inode *pkernfs_inode; + struct inode *vfs_inode; + + free_inode = pkernfs_allocate_inode(dir->i_sb); + if (free_inode <= 0) + return -ENOMEM; + + pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, free_inode); + pkernfs_inode->sibling_ino = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino; + pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino = free_inode; + strscpy(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN); + pkernfs_inode->flags = PKERNFS_INODE_FLAG_FILE; + + vfs_inode = pkernfs_inode_get(dir->i_sb, free_inode); + d_instantiate(dentry, vfs_inode); + return 0; +} + +static struct dentry *pkernfs_lookup(struct inode *dir, + struct dentry *dentry, + unsigned int flags) +{ + struct pkernfs_inode *pkernfs_inode; + unsigned long ino; + + pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino); + ino = pkernfs_inode->child_ino; + while (ino) { + pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, ino); + if (!strncmp(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN)) { + d_add(dentry, pkernfs_inode_get(dir->i_sb, ino)); + break; + } + ino = pkernfs_inode->sibling_ino; + } + return NULL; +} + +static int pkernfs_unlink(struct inode *dir, struct dentry *dentry) +{ + unsigned long ino; + struct pkernfs_inode *inode; + + ino = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino; + + /* Special case for first file in dir */ + if (ino == dentry->d_inode->i_ino) { + pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino = + pkernfs_get_persisted_inode(dir->i_sb, dentry->d_inode->i_ino)->sibling_ino; + pkernfs_free_inode(dir->i_sb, ino); + return 0; + } + + /* + * Although we know exactly the inode to free, because we maintain only + * a singly linked list we need to scan for it to find the previous + * element so it's "next" pointer can be updated. + */ + while (ino) { + inode = pkernfs_get_persisted_inode(dir->i_sb, ino); + /* We've found the one pointing to the one we want to delete */ + if (inode->sibling_ino == dentry->d_inode->i_ino) { + inode->sibling_ino = + pkernfs_get_persisted_inode(dir->i_sb, + dentry->d_inode->i_ino)->sibling_ino; + pkernfs_free_inode(dir->i_sb, dentry->d_inode->i_ino); + break; + } + ino = pkernfs_get_persisted_inode(dir->i_sb, ino)->sibling_ino; + } + + return 0; +} + +const struct inode_operations pkernfs_dir_inode_operations = { + .create = pkernfs_create, + .lookup = pkernfs_lookup, + .unlink = pkernfs_unlink, +}; diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c index 4c476ddc35b6..518c610e3877 100644 --- a/fs/pkernfs/pkernfs.c +++ b/fs/pkernfs/pkernfs.c @@ -8,7 +8,7 @@ #include static phys_addr_t pkernfs_base, pkernfs_size; -static void *pkernfs_mem; +void *pkernfs_mem; static const struct super_operations pkernfs_super_ops = { }; static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc) @@ -24,23 +24,26 @@ static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc) pr_info("pkernfs: Restoring from super block\n"); } else { pr_info("pkernfs: Clean super block; initialising\n"); + pkernfs_initialise_inode_store(sb); psb->magic_number = PKERNFS_MAGIC_NUMBER; + pkernfs_get_persisted_inode(sb, 1)->flags = PKERNFS_INODE_FLAG_DIR; + strscpy(pkernfs_get_persisted_inode(sb, 1)->filename, ".", PKERNFS_FILENAME_LEN); + psb->next_free_ino = 2; } sb->s_op = &pkernfs_super_ops; - inode = new_inode(sb); + inode = pkernfs_inode_get(sb, 1); if (!inode) return -ENOMEM; - inode->i_ino = 1; inode->i_mode = S_IFDIR; - inode->i_op = &simple_dir_inode_operations; - inode->i_fop = &simple_dir_operations; + inode->i_fop = &pkernfs_dir_fops; inode->i_atime = inode->i_mtime = current_time(inode); inode_set_ctime_current(inode); /* directory inodes start off with i_nlink == 2 (for "." entry) */ inc_nlink(inode); + inode_init_owner(&nop_mnt_idmap, inode, NULL, inode->i_mode); dentry = d_make_root(inode); if (!dentry) diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h index bd1e2a6fd336..192e089b3151 100644 --- a/fs/pkernfs/pkernfs.h +++ b/fs/pkernfs/pkernfs.h @@ -1,6 +1,40 @@ /* SPDX-License-Identifier: GPL-2.0-only */ +#include + #define PKERNFS_MAGIC_NUMBER 0x706b65726e6673 +#define PKERNFS_FILENAME_LEN 255 + +extern void *pkernfs_mem; + struct pkernfs_sb { unsigned long magic_number; + /* Inode number */ + unsigned long next_free_ino; }; + +// If neither of these are set the inode is not in use. +#define PKERNFS_INODE_FLAG_FILE (1 << 0) +#define PKERNFS_INODE_FLAG_DIR (1 << 1) +struct pkernfs_inode { + int flags; + /* + * Points to next inode in the same directory, or + * 0 if last file in directory. + */ + unsigned long sibling_ino; + /* + * If this inode is a directory, this points to the + * first inode *in* that directory. + */ + unsigned long child_ino; + char filename[PKERNFS_FILENAME_LEN]; + int mappings_block; + int num_mappings; +}; + +void pkernfs_initialise_inode_store(struct super_block *sb); +struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino); +struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino); + +extern const struct file_operations pkernfs_dir_fops; From patchwork Mon Feb 5 12:01:48 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545323 Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4BC771BC46; Mon, 5 Feb 2024 12:03:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.219 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134581; cv=none; b=UGS50KRa6ZX0QVyhpyucufZJVSejsRLWvm8M7pHNl5y239WnfFWvQgAvp5GdyPMDunVHvgxQnake9gCob0RjpBhxOfWKPDZa8D41bdTpUzuMJIk4rKHgYNFksZc2ZIYdL+Qqh+46wvG5JTwyXZgTm2Hh6IhY6BnD9yCmwcnTUjk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134581; c=relaxed/simple; bh=lErGtpA70KiDfjFQXCKs90lxVYs+wLNGHzuwt7P6XCM=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Mv7crg3dWGqyJZfR4Sn2oSCFMUU71kVlC/lzIqgVUzHr3wV8emNUFCJV9fliN5Avw3XU/LZvYuqTLB3l1/HET9NYAFqohLgduhYRrukDDHQE2VMJnXcKm0PDqQFEh5ozIflOKHNHYjz/7r1TN/ZFSPq1/oVzrnpoKFGc3y8tI2M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=i6+E2Y7u; arc=none smtp.client-ip=99.78.197.219 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="i6+E2Y7u" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134580; x=1738670580; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=09HM9kCtJQ5fhByhNyO/r0il26Ag7KNxaSp1FSa3SKw=; b=i6+E2Y7uQY4J6eLqwZJnGtNvDDFRxqGbbuND7EMksfEJDoiaiB+GMEwn qwAXylnrSmo0XnSYzJKFCKClN1aqm1eeraOqBC7AK1Ua0gALOCa7vduN5 dj8oZu+tawgjUa8qDCjZiL4w4O6jEyOVJOqKo3Eq/WZbVYkqElKtr7YjC w=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="63755246" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:02:58 +0000 Received: from EX19MTAEUA001.ant.amazon.com [10.0.10.100:37383] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.33.186:2525] with esmtp (Farcaster) id 0e22be1f-cb32-4c25-b8b9-6a540783cffe; Mon, 5 Feb 2024 12:02:57 +0000 (UTC) X-Farcaster-Flow-ID: 0e22be1f-cb32-4c25-b8b9-6a540783cffe Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUA001.ant.amazon.com (10.252.50.192) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:02:57 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:02:50 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 03/18] pkernfs: Define an allocator for persistent pages Date: Mon, 5 Feb 2024 12:01:48 +0000 Message-ID: <20240205120203.60312-4-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D044UWB004.ant.amazon.com (10.13.139.134) To EX19D014EUC004.ant.amazon.com (10.252.51.182) This introduces the concept of a bitmap allocator for pages from the pkernfs filesystem. The allocation bitmap is stored in the second half of the first page. This imposes an artificial limit of the maximum size of the filesystem; this needs to be made extensible. The allocations can be zeroed, that's it so far. The next commit will add the ability to allocate and use it. --- fs/pkernfs/Makefile | 2 +- fs/pkernfs/allocator.c | 27 +++++++++++++++++++++++++++ fs/pkernfs/pkernfs.c | 1 + fs/pkernfs/pkernfs.h | 1 + 4 files changed, 30 insertions(+), 1 deletion(-) create mode 100644 fs/pkernfs/allocator.c diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile index 0a66e98bda07..d8b92a74fbc6 100644 --- a/fs/pkernfs/Makefile +++ b/fs/pkernfs/Makefile @@ -3,4 +3,4 @@ # Makefile for persistent kernel filesystem # -obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o dir.o +obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o diff --git a/fs/pkernfs/allocator.c b/fs/pkernfs/allocator.c new file mode 100644 index 000000000000..1d4aac9c4545 --- /dev/null +++ b/fs/pkernfs/allocator.c @@ -0,0 +1,27 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "pkernfs.h" + +/** + * For allocating blocks from the pkernfs filesystem. + * The first two blocks are special: + * - the first block is persitent filesystme metadata and + * a bitmap of allocated blocks + * - the second block is an array of persisted inodes; the + * inode store. + */ + +void *pkernfs_allocations_bitmap(struct super_block *sb) +{ + /* Allocations is 2nd half of first block */ + return pkernfs_mem + (1 << 20); +} + +void pkernfs_zero_allocations(struct super_block *sb) +{ + memset(pkernfs_allocations_bitmap(sb), 0, (1 << 20)); + /* First page is persisted super block and allocator bitmap */ + set_bit(0, pkernfs_allocations_bitmap(sb)); + /* Second page is inode store */ + set_bit(1, pkernfs_allocations_bitmap(sb)); +} diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c index 518c610e3877..199c2c648bca 100644 --- a/fs/pkernfs/pkernfs.c +++ b/fs/pkernfs/pkernfs.c @@ -25,6 +25,7 @@ static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc) } else { pr_info("pkernfs: Clean super block; initialising\n"); pkernfs_initialise_inode_store(sb); + pkernfs_zero_allocations(sb); psb->magic_number = PKERNFS_MAGIC_NUMBER; pkernfs_get_persisted_inode(sb, 1)->flags = PKERNFS_INODE_FLAG_DIR; strscpy(pkernfs_get_persisted_inode(sb, 1)->filename, ".", PKERNFS_FILENAME_LEN); diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h index 192e089b3151..4655780f31f2 100644 --- a/fs/pkernfs/pkernfs.h +++ b/fs/pkernfs/pkernfs.h @@ -34,6 +34,7 @@ struct pkernfs_inode { }; void pkernfs_initialise_inode_store(struct super_block *sb); +void pkernfs_zero_allocations(struct super_block *sb); struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino); struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino); From patchwork Mon Feb 5 12:01:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545324 Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E7D321BDCB; Mon, 5 Feb 2024 12:03:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.219 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134586; cv=none; b=n16IcvSLXjK47vsrXpzxvjyFXBwMAW2fYxSyOosY8LE80GH5ANpzQP2QEyqbwPbsazWlMwa564X7nwQQKf3Z4oQL8P/CFg3SgsqkzLWwxH3WgJGorgSv5BTMbObKIQPUtAQru4jfheBD2QYP862oMt3+cmRWtkgwCbeAv0cTttw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134586; c=relaxed/simple; bh=GdB6alF/k2cwvfHA4lSrVP44W/b6Zl/qxhtKU2oOfIg=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=b2+b4kYm44q/uncgm6QOG0skh58IckRaRhPVPK3S/pCljhBjny91igl/z1mmis3M42uoKAf/g5zLICu5HDLA+875LXUW2y7kqywArvbP1MnM/ipwieRB1kRRahT8QY4EUYqiDJMDpkbIW0NMzIxnshdYssG/E0daqDe2EXY66Mw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=tRoTa+6N; arc=none smtp.client-ip=99.78.197.219 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="tRoTa+6N" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134585; x=1738670585; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Am0Sels7NAPtGwBi87Q194PvOn79+zUBtD9YBYxO8Vc=; b=tRoTa+6NKF1TTmvU+S524n3sBkweAW2lRFo7R2Nd1951ePXPtp5iKLkI M0jhxHM66+/G9zFDQb8jbGOHRnw4BC7qaB/dAmVt2ochRIriJ/P5l8dq+ h3vFgYBLenYFvBsUtg3gK+cEoMb5iUIcJO3Mto/Yu3XFaTNU3eegf9qFL c=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="63755258" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:03:04 +0000 Received: from EX19MTAEUA002.ant.amazon.com [10.0.43.254:35084] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.28.192:2525] with esmtp (Farcaster) id a0419078-eb19-4b6b-acd7-2633051ba0ca; Mon, 5 Feb 2024 12:03:03 +0000 (UTC) X-Farcaster-Flow-ID: a0419078-eb19-4b6b-acd7-2633051ba0ca Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUA002.ant.amazon.com (10.252.50.124) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:03 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:02:57 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 04/18] pkernfs: support file truncation Date: Mon, 5 Feb 2024 12:01:49 +0000 Message-ID: <20240205120203.60312-5-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D044UWB004.ant.amazon.com (10.13.139.134) To EX19D014EUC004.ant.amazon.com (10.252.51.182) In the previous commit a block allocator was added. Now use that block allocator to allocate blocks for files when ftruncate is run on them. To do that a inode_operations is added on the file inodes with a getattr callback handling the ATTR_SIZE attribute. When this is invoked pages are allocated, the indexes of which are put into a mappings block. The mappings block is an array with the index being the file offset block and the value at that index being the pkernfs block backign that file offset. --- fs/pkernfs/Makefile | 2 +- fs/pkernfs/allocator.c | 24 +++++++++++++++++++ fs/pkernfs/file.c | 53 ++++++++++++++++++++++++++++++++++++++++++ fs/pkernfs/inode.c | 27 ++++++++++++++++++--- fs/pkernfs/pkernfs.h | 7 ++++++ 5 files changed, 109 insertions(+), 4 deletions(-) create mode 100644 fs/pkernfs/file.c diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile index d8b92a74fbc6..e41f06cc490f 100644 --- a/fs/pkernfs/Makefile +++ b/fs/pkernfs/Makefile @@ -3,4 +3,4 @@ # Makefile for persistent kernel filesystem # -obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o +obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o file.o diff --git a/fs/pkernfs/allocator.c b/fs/pkernfs/allocator.c index 1d4aac9c4545..3905ce92b4a9 100644 --- a/fs/pkernfs/allocator.c +++ b/fs/pkernfs/allocator.c @@ -25,3 +25,27 @@ void pkernfs_zero_allocations(struct super_block *sb) /* Second page is inode store */ set_bit(1, pkernfs_allocations_bitmap(sb)); } + +/* + * Allocs one 2 MiB block, and returns the block index. + * Index is 2 MiB chunk index. + */ +unsigned long pkernfs_alloc_block(struct super_block *sb) +{ + unsigned long free_bit; + + /* Allocations is 2nd half of first page */ + void *allocations_mem = pkernfs_allocations_bitmap(sb); + free_bit = bitmap_find_next_zero_area(allocations_mem, + PMD_SIZE / 2, /* Size */ + 0, /* Start */ + 1, /* Number of zeroed bits to look for */ + 0); /* Alignment mask - none required. */ + bitmap_set(allocations_mem, free_bit, 1); + return free_bit; +} + +void *pkernfs_addr_for_block(struct super_block *sb, int block_idx) +{ + return pkernfs_mem + (block_idx * PMD_SIZE); +} diff --git a/fs/pkernfs/file.c b/fs/pkernfs/file.c new file mode 100644 index 000000000000..27a637423178 --- /dev/null +++ b/fs/pkernfs/file.c @@ -0,0 +1,53 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "pkernfs.h" + +static int truncate(struct inode *inode, loff_t newsize) +{ + unsigned long free_block; + struct pkernfs_inode *pkernfs_inode; + unsigned long *mappings; + + pkernfs_inode = pkernfs_get_persisted_inode(inode->i_sb, inode->i_ino); + mappings = (unsigned long *)pkernfs_addr_for_block(inode->i_sb, + pkernfs_inode->mappings_block); + i_size_write(inode, newsize); + for (int block_idx = 0; block_idx * PMD_SIZE < newsize; ++block_idx) { + free_block = pkernfs_alloc_block(inode->i_sb); + if (free_block <= 0) + /* TODO: roll back allocations. */ + return -ENOMEM; + *(mappings + block_idx) = free_block; + ++pkernfs_inode->num_mappings; + } + return 0; +} + +static int inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct iattr *iattr) +{ + struct inode *inode = dentry->d_inode; + int error; + + error = setattr_prepare(idmap, dentry, iattr); + if (error) + return error; + + if (iattr->ia_valid & ATTR_SIZE) { + error = truncate(inode, iattr->ia_size); + if (error) + return error; + } + setattr_copy(idmap, inode, iattr); + mark_inode_dirty(inode); + return 0; +} + +const struct inode_operations pkernfs_file_inode_operations = { + .setattr = inode_setattr, + .getattr = simple_getattr, +}; + +const struct file_operations pkernfs_file_fops = { + .owner = THIS_MODULE, + .iterate_shared = NULL, +}; diff --git a/fs/pkernfs/inode.c b/fs/pkernfs/inode.c index f6584c8b8804..7fe4e7b220cc 100644 --- a/fs/pkernfs/inode.c +++ b/fs/pkernfs/inode.c @@ -15,14 +15,28 @@ struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int in struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino) { + struct pkernfs_inode *pkernfs_inode; struct inode *inode = iget_locked(sb, ino); /* If this inode is cached it is already populated; just return */ if (!(inode->i_state & I_NEW)) return inode; - inode->i_op = &pkernfs_dir_inode_operations; + pkernfs_inode = pkernfs_get_persisted_inode(sb, ino); inode->i_sb = sb; - inode->i_mode = S_IFREG; + if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_DIR) { + inode->i_op = &pkernfs_dir_inode_operations; + inode->i_mode = S_IFDIR; + } else { + inode->i_op = &pkernfs_file_inode_operations; + inode->i_mode = S_IFREG; + inode->i_fop = &pkernfs_file_fops; + } + + inode->i_atime = inode->i_mtime = current_time(inode); + inode_set_ctime_current(inode); + set_nlink(inode, 1); + + /* Switch based on file type */ unlock_new_inode(inode); return inode; } @@ -79,6 +93,8 @@ static int pkernfs_create(struct mnt_idmap *id, struct inode *dir, pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino = free_inode; strscpy(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN); pkernfs_inode->flags = PKERNFS_INODE_FLAG_FILE; + pkernfs_inode->mappings_block = pkernfs_alloc_block(dir->i_sb); + memset(pkernfs_addr_for_block(dir->i_sb, pkernfs_inode->mappings_block), 0, (2 << 20)); vfs_inode = pkernfs_inode_get(dir->i_sb, free_inode); d_instantiate(dentry, vfs_inode); @@ -90,6 +106,7 @@ static struct dentry *pkernfs_lookup(struct inode *dir, unsigned int flags) { struct pkernfs_inode *pkernfs_inode; + struct inode *vfs_inode; unsigned long ino; pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino); @@ -97,7 +114,10 @@ static struct dentry *pkernfs_lookup(struct inode *dir, while (ino) { pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, ino); if (!strncmp(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN)) { - d_add(dentry, pkernfs_inode_get(dir->i_sb, ino)); + vfs_inode = pkernfs_inode_get(dir->i_sb, ino); + mark_inode_dirty(dir); + dir->i_atime = current_time(dir); + d_add(dentry, vfs_inode); break; } ino = pkernfs_inode->sibling_ino; @@ -146,3 +166,4 @@ const struct inode_operations pkernfs_dir_inode_operations = { .lookup = pkernfs_lookup, .unlink = pkernfs_unlink, }; + diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h index 4655780f31f2..8b4fee8c5b2e 100644 --- a/fs/pkernfs/pkernfs.h +++ b/fs/pkernfs/pkernfs.h @@ -34,8 +34,15 @@ struct pkernfs_inode { }; void pkernfs_initialise_inode_store(struct super_block *sb); + void pkernfs_zero_allocations(struct super_block *sb); +unsigned long pkernfs_alloc_block(struct super_block *sb); struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino); +void *pkernfs_addr_for_block(struct super_block *sb, int block_idx); + struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino); + extern const struct file_operations pkernfs_dir_fops; +extern const struct file_operations pkernfs_file_fops; +extern const struct inode_operations pkernfs_file_inode_operations; From patchwork Mon Feb 5 12:01:50 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545325 Received: from smtp-fw-52005.amazon.com (smtp-fw-52005.amazon.com [52.119.213.156]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4225F1BF47; Mon, 5 Feb 2024 12:03:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.156 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134595; cv=none; b=ayQzVLTr9ZfuqcEx6ngTRe1vg8OYjIKDD+qM4xCtWVEPB5GGS4qlqvAuablf/8/keoNDCLoYcZtPit6GLvQttKAE7NU560vnhB6npThdG7YjORgPU1uTloiQZto2YsDBsgWdfu3+2zVTYjU5D6yLRzkKaQytItuLkSovjAyfm4I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134595; c=relaxed/simple; bh=X0DVeAhkLd+qmZjH5rmDRSG3fWO+VNwd2pVbOIDFh9k=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=gvdoX61FDn4IdJ5Xkw3OKznvZiI6C34TzEMV9nvxsdv8p2dvnlGSih61vm9pZ4NAOFd33N+zuo/Z10QTIwfWhT6VUc7wg89H2XpcuhTnVil3xtTLsMg4GK4+o+AU37+wUXfwdpHr5cc4owyIZY0SHpNG7apMF5ayeNV+bDohor4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=Bo62Hq4w; arc=none smtp.client-ip=52.119.213.156 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="Bo62Hq4w" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134594; x=1738670594; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=0/oASfY5H1JyciZlNLWYHif+Zbb5czst+5Ds70M97DI=; b=Bo62Hq4wZkiJjgZ4E8Art4Q5jS9hXwFz/gPSBAwIZ/MeOo9pcaoKWuCH OxyO1VwyVw+B4tTI0M7vt3288m2mDKFdDtq56Wc3Eh3GaorD92eDgQ2vV BU8RTrl0enJ4PYciaw55awgS/kEHlLTXXPMfWqugIF6gTWf+ytZap3iuJ g=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="632119672" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52005.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:03:11 +0000 Received: from EX19MTAEUB001.ant.amazon.com [10.0.17.79:5859] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.31.207:2525] with esmtp (Farcaster) id 80b5446f-c47d-4e7f-ae7e-e8668aadf81b; Mon, 5 Feb 2024 12:03:10 +0000 (UTC) X-Farcaster-Flow-ID: 80b5446f-c47d-4e7f-ae7e-e8668aadf81b Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUB001.ant.amazon.com (10.252.51.26) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:10 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:03 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 05/18] pkernfs: add file mmap callback Date: Mon, 5 Feb 2024 12:01:50 +0000 Message-ID: <20240205120203.60312-6-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D044UWB004.ant.amazon.com (10.13.139.134) To EX19D014EUC004.ant.amazon.com (10.252.51.182) Make the file data useable to userspace by adding mmap. That's all that QEMU needs for guest RAM, so that's all be bother implementing for now. When mmaping the file the VMA is marked as PFNMAP to indicate that there are no struct pages for the memory in this VMA. Remap_pfn_range() is used to actually populate the page tables. All PTEs are pre-faulted into the pgtables at mmap time so that the pgtables are useable when this virtual address range is given to VFIO's MAP_DMA. --- fs/pkernfs/file.c | 42 +++++++++++++++++++++++++++++++++++++++++- fs/pkernfs/pkernfs.c | 2 +- fs/pkernfs/pkernfs.h | 2 ++ 3 files changed, 44 insertions(+), 2 deletions(-) diff --git a/fs/pkernfs/file.c b/fs/pkernfs/file.c index 27a637423178..844b6cc63840 100644 --- a/fs/pkernfs/file.c +++ b/fs/pkernfs/file.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0-only #include "pkernfs.h" +#include static int truncate(struct inode *inode, loff_t newsize) { @@ -42,6 +43,45 @@ static int inode_setattr(struct mnt_idmap *idmap, struct dentry *dentry, struct return 0; } +/* + * To be able to use PFNMAP VMAs for VFIO DMA mapping we need the page tables + * populated with mappings. Pre-fault everything. + */ +static int mmap(struct file *filp, struct vm_area_struct *vma) +{ + int rc; + unsigned long *mappings_block; + struct pkernfs_inode *pkernfs_inode; + + pkernfs_inode = pkernfs_get_persisted_inode(filp->f_inode->i_sb, filp->f_inode->i_ino); + + mappings_block = (unsigned long *)pkernfs_addr_for_block(filp->f_inode->i_sb, + pkernfs_inode->mappings_block); + + /* Remap-pfn-range will mark the range VM_IO */ + for (unsigned long vma_addr_offset = vma->vm_start; + vma_addr_offset < vma->vm_end; + vma_addr_offset += PMD_SIZE) { + int block, mapped_block; + + block = (vma_addr_offset - vma->vm_start) / PMD_SIZE; + mapped_block = *(mappings_block + block); + /* + * It's wrong to use rempa_pfn_range; this will install PTE-level entries. + * The whole point of 2 MiB allocs is to improve TLB perf! + * We should use something like mm/huge_memory.c#insert_pfn_pmd + * but that is currently static. + * TODO: figure out the best way to install PMDs. + */ + rc = remap_pfn_range(vma, + vma_addr_offset, + (pkernfs_base >> PAGE_SHIFT) + (mapped_block * 512), + PMD_SIZE, + vma->vm_page_prot); + } + return 0; +} + const struct inode_operations pkernfs_file_inode_operations = { .setattr = inode_setattr, .getattr = simple_getattr, @@ -49,5 +89,5 @@ const struct inode_operations pkernfs_file_inode_operations = { const struct file_operations pkernfs_file_fops = { .owner = THIS_MODULE, - .iterate_shared = NULL, + .mmap = mmap, }; diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c index 199c2c648bca..f010c2d76c76 100644 --- a/fs/pkernfs/pkernfs.c +++ b/fs/pkernfs/pkernfs.c @@ -7,7 +7,7 @@ #include #include -static phys_addr_t pkernfs_base, pkernfs_size; +phys_addr_t pkernfs_base, pkernfs_size; void *pkernfs_mem; static const struct super_operations pkernfs_super_ops = { }; diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h index 8b4fee8c5b2e..1a7aa783a9be 100644 --- a/fs/pkernfs/pkernfs.h +++ b/fs/pkernfs/pkernfs.h @@ -6,6 +6,8 @@ #define PKERNFS_FILENAME_LEN 255 extern void *pkernfs_mem; +/* Units of bytes */ +extern phys_addr_t pkernfs_base, pkernfs_size; struct pkernfs_sb { unsigned long magic_number; From patchwork Mon Feb 5 12:01:51 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545326 Received: from smtp-fw-9106.amazon.com (smtp-fw-9106.amazon.com [207.171.188.206]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 468BD1B80F; Mon, 5 Feb 2024 12:03:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=207.171.188.206 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134626; cv=none; b=Z6n1QEHQXpQ9dBaQHl8UfHgkOPaQMANm3/dGOEpILfMmKHSlRU2fj73/PGHxwAKyHDXiu8OLiY1JxVX8sHFP1iI2qotLeR3yFwCj3vQTN25YuTVs4TuEXaVwpQuyqg6ksOUtlJgCNNf8dtthJcInKoviObyGtg9Nd0+4weX9VJo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134626; c=relaxed/simple; bh=vP7xqQVckvYXPZog0UglD7fg0Qi7KmKUf6ygoE4dg1o=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=n9TQJb1gJJpw68htqAJ4g5fFJb9negbe3kIsW24w6g3D8ozBKRuUZJ5VTEQ60SyntvpKE+ZhNy1CTZIvsKKc/fd69RQTtDHPU6XVl2R3mNsR5FiciPMRXvctBgD4M7wUB1M6ZDGayyzyEDg9+MoGZcv601iyaztHOJzDZ9vdiX0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=kcEW2jEB; arc=none smtp.client-ip=207.171.188.206 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="kcEW2jEB" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134626; x=1738670626; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=GuAtgRmAwOIaihgufqnss82d+hBkVwofjU/pF2m60/o=; b=kcEW2jEB1SevwyiGV9tiPNgSpSSB4Eb54eOLEyLccIv4DAiivPzXAGdS 9wOFjIVdVzVLx6YG1T11NREeoN35/V7msL4/WDiRJ5SeQMU/NAEWMVTsQ SeU/vkmJ4JlP2YuDU6wH0Azevl/s1u72Sf980FYpMlpeIR/3y0hulM5zX 0=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="702146151" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-9106.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:03:45 +0000 Received: from EX19MTAEUB002.ant.amazon.com [10.0.43.254:55484] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.32.190:2525] with esmtp (Farcaster) id 54deb5d5-b17f-4ae2-b6be-2dc346f855b1; Mon, 5 Feb 2024 12:03:43 +0000 (UTC) X-Farcaster-Flow-ID: 54deb5d5-b17f-4ae2-b6be-2dc346f855b1 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUB002.ant.amazon.com (10.252.51.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:40 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:33 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 06/18] init: Add liveupdate cmdline param Date: Mon, 5 Feb 2024 12:01:51 +0000 Message-ID: <20240205120203.60312-7-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D033UWC001.ant.amazon.com (10.13.139.218) To EX19D014EUC004.ant.amazon.com (10.252.51.182) This will allow other subsystems to know when we're going a LU and hence when they should be restoring rather than reinitialising state. --- include/linux/init.h | 1 + init/main.c | 10 ++++++++++ 2 files changed, 11 insertions(+) diff --git a/include/linux/init.h b/include/linux/init.h index 266c3e1640d4..d7c68c7bfaf0 100644 --- a/include/linux/init.h +++ b/include/linux/init.h @@ -146,6 +146,7 @@ extern int do_one_initcall(initcall_t fn); extern char __initdata boot_command_line[]; extern char *saved_command_line; extern unsigned int saved_command_line_len; +extern bool liveupdate; extern unsigned int reset_devices; /* used by init/main.c */ diff --git a/init/main.c b/init/main.c index e24b0780fdff..7807a56c3473 100644 --- a/init/main.c +++ b/init/main.c @@ -165,6 +165,16 @@ static char *ramdisk_execute_command = "/init"; bool static_key_initialized __read_mostly; EXPORT_SYMBOL_GPL(static_key_initialized); +bool liveupdate __read_mostly; +EXPORT_SYMBOL(liveupdate); + +static int __init set_liveupdate(char *param) +{ + liveupdate = true; + return 0; +} +early_param("liveupdate", set_liveupdate); + /* * If set, this is an indication to the drivers that reset the underlying * device before going ahead with the initialization otherwise driver might From patchwork Mon Feb 5 12:01:52 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545328 Received: from smtp-fw-9102.amazon.com (smtp-fw-9102.amazon.com [207.171.184.29]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2AE291C68D; Mon, 5 Feb 2024 12:03:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=207.171.184.29 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134639; cv=none; b=NVgyJDGkp3qdvbci56kmhgp3kQQTNmE1SlYeUhPQ6An0nw1OiJUfajyI+LuuWRs4seyHPOT1hPaK4cku+FoYyfjvfx0g1ekI9xpty28UsZvrdo/npkDUnaHE1h7jSTIhUYKOFdJKpO0N9ue9BLeLIy7RwjpsQKhS2IIH8HwhEZw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134639; c=relaxed/simple; bh=KJj1B5pipHCNG5+0xARVhv9lBaZP4+Uag/y5HQlG81E=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Ku3DHDTq6tsE9Me3BLFiCXMiicfTsn/bbbVh372Zv7aeiPgMa9T2ukqZSUk0HSIPaWF1wyzEchBWZn4+m1V7HRs8kqQjSxAlp9CX2ZCpRLpLyOGqbCBUn982+rTVmsAOrSQP4/rk9DgUYecEcoO+dqaVMQR9YQgSzYR0YLMtm9c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=oVzsA6nA; arc=none smtp.client-ip=207.171.184.29 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="oVzsA6nA" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134638; x=1738670638; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TzDhoCjRqYjwB6Ud3BxBgq+3tlmLodP3ZHF9TRx+87E=; b=oVzsA6nAcfsX2N9xvk+a6q/VX6J3iUMxPcWj8ZReekVCI1QcTwQpPgY1 7pufPRraCq/vpoLSpm9GWvVjHX/gwtpQk3b260sTs+RCcqKuFfnY5SWJX KPF0eYZXHHULCJq1p4kiXiEsmaqrWgvxagCOCHboX6GAzuQ7jC2dxTgOG Q=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="394883262" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-9102.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:03:52 +0000 Received: from EX19MTAEUC002.ant.amazon.com [10.0.17.79:26775] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.8.155:2525] with esmtp (Farcaster) id 37d56f4d-adef-4cf1-bb52-948bac8bacd3; Mon, 5 Feb 2024 12:03:50 +0000 (UTC) X-Farcaster-Flow-ID: 37d56f4d-adef-4cf1-bb52-948bac8bacd3 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUC002.ant.amazon.com (10.252.51.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:46 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:40 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 07/18] pkernfs: Add file type for IOMMU root pgtables Date: Mon, 5 Feb 2024 12:01:52 +0000 Message-ID: <20240205120203.60312-8-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D033UWC001.ant.amazon.com (10.13.139.218) To EX19D014EUC004.ant.amazon.com (10.252.51.182) So far pkernfs is able to hold regular files for userspace to mmap and in which store persisted data. Now begin the IOMMU integration for persistent IOMMU pgtables. A new type of inode is created for an IOMMU data directory. A new type of inode is also created for a file which holds the IOMMU root pgtables. The inode types are specified by flags on the inodes. Different inode ops are also registed on the IOMMU pgtables file to ensure that userspace can't access it. These IOMMU directory and data inodes are created lazily: pkernfs_alloc_iommu_root_pgtables() scans for these and returns them if they already exist (ie: after kexec) or creates them if they don't exist (ie: cold boot). The data in the IOMMU root pgtables file needs to be accessible early in system boot: before filesystems are initialised and before anything is mounted. To support this the pkernfs initialisation code is split out into an pkernfs_init() function which is responsible for making the pkernfs memory available. Here the filesystem abstraction starts to creak: the pkernfs functions responsible for the IOMMU pgtables files manipulated persisted inodes directly. It may be preferable to somehow get pkernfs mounted early in system boot before it's needed by the IOMMU so that filesystem paths can be used, but it is unclear if that's possible. The need for super blocks in the pkernfs functions has been limited so far, super blocks are barely used because the pkernfs extents are stored as global variables in pkernfs.c. Now NULLs are actually supplied to functions which take a super block. This is also not pretty and this code should probably rather be plumbing some sort of wrapper around the persisted super block which would allow supporting multiple mount moints. Additionally, the memory backing the IOMMU root pgtable file is mapped into the direct map by registering it as a device. This is needed because the IOMMU does phys_to_virt in a few places when traversing the pgtables so the direct map virtual address should be populated. The alternative would be to replace all of the phy_to_virt calls in the IOMMU driver with wrappers which understand if the phys_addr is part of a pkernfs file. The next commit will use this pkernfs file for root pgtables. --- fs/pkernfs/Makefile | 2 +- fs/pkernfs/inode.c | 17 +++++-- fs/pkernfs/iommu.c | 98 +++++++++++++++++++++++++++++++++++++++++ fs/pkernfs/pkernfs.c | 38 ++++++++++------ fs/pkernfs/pkernfs.h | 7 +++ include/linux/pkernfs.h | 36 +++++++++++++++ 6 files changed, 181 insertions(+), 17 deletions(-) create mode 100644 fs/pkernfs/iommu.c create mode 100644 include/linux/pkernfs.h diff --git a/fs/pkernfs/Makefile b/fs/pkernfs/Makefile index e41f06cc490f..7f0f7a4cd3a1 100644 --- a/fs/pkernfs/Makefile +++ b/fs/pkernfs/Makefile @@ -3,4 +3,4 @@ # Makefile for persistent kernel filesystem # -obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o file.o +obj-$(CONFIG_PKERNFS_FS) += pkernfs.o inode.o allocator.o dir.o file.o iommu.o diff --git a/fs/pkernfs/inode.c b/fs/pkernfs/inode.c index 7fe4e7b220cc..1d712e0a82a1 100644 --- a/fs/pkernfs/inode.c +++ b/fs/pkernfs/inode.c @@ -25,11 +25,18 @@ struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino) inode->i_sb = sb; if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_DIR) { inode->i_op = &pkernfs_dir_inode_operations; + inode->i_fop = &pkernfs_dir_fops; inode->i_mode = S_IFDIR; - } else { + } else if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_FILE) { inode->i_op = &pkernfs_file_inode_operations; - inode->i_mode = S_IFREG; inode->i_fop = &pkernfs_file_fops; + inode->i_mode = S_IFREG; + } else if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_IOMMU_DIR) { + inode->i_op = &pkernfs_iommu_dir_inode_operations; + inode->i_fop = &pkernfs_dir_fops; + inode->i_mode = S_IFDIR; + } else if (pkernfs_inode->flags | PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES) { + inode->i_mode = S_IFREG; } inode->i_atime = inode->i_mtime = current_time(inode); @@ -41,7 +48,7 @@ struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino) return inode; } -static unsigned long pkernfs_allocate_inode(struct super_block *sb) +unsigned long pkernfs_allocate_inode(struct super_block *sb) { unsigned long next_free_ino; @@ -167,3 +174,7 @@ const struct inode_operations pkernfs_dir_inode_operations = { .unlink = pkernfs_unlink, }; +const struct inode_operations pkernfs_iommu_dir_inode_operations = { + .lookup = pkernfs_lookup, +}; + diff --git a/fs/pkernfs/iommu.c b/fs/pkernfs/iommu.c new file mode 100644 index 000000000000..5bce8146d7bb --- /dev/null +++ b/fs/pkernfs/iommu.c @@ -0,0 +1,98 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "pkernfs.h" +#include + + +void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region) +{ + unsigned long *mappings_block_vaddr; + unsigned long inode_idx; + struct pkernfs_inode *iommu_pgtables, *iommu_dir = NULL; + int rc; + + pkernfs_init(); + + /* Try find a 'iommu' directory */ + inode_idx = pkernfs_get_persisted_inode(NULL, 1)->child_ino; + while (inode_idx) { + if (!strncmp(pkernfs_get_persisted_inode(NULL, inode_idx)->filename, + "iommu", PKERNFS_FILENAME_LEN)) { + iommu_dir = pkernfs_get_persisted_inode(NULL, inode_idx); + break; + } + inode_idx = pkernfs_get_persisted_inode(NULL, inode_idx)->sibling_ino; + } + + if (!iommu_dir) { + unsigned long root_pgtables_ino = 0; + unsigned long iommu_dir_ino = pkernfs_allocate_inode(NULL); + + iommu_dir = pkernfs_get_persisted_inode(NULL, iommu_dir_ino); + strscpy(iommu_dir->filename, "iommu", PKERNFS_FILENAME_LEN); + iommu_dir->flags = PKERNFS_INODE_FLAG_IOMMU_DIR; + + /* Make this the head of the list. */ + iommu_dir->sibling_ino = pkernfs_get_persisted_inode(NULL, 1)->child_ino; + pkernfs_get_persisted_inode(NULL, 1)->child_ino = iommu_dir_ino; + + /* Add a child file for pgtables. */ + root_pgtables_ino = pkernfs_allocate_inode(NULL); + iommu_pgtables = pkernfs_get_persisted_inode(NULL, root_pgtables_ino); + strscpy(iommu_pgtables->filename, "root-pgtables", PKERNFS_FILENAME_LEN); + iommu_pgtables->sibling_ino = iommu_dir->child_ino; + iommu_dir->child_ino = root_pgtables_ino; + iommu_pgtables->flags = PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES; + iommu_pgtables->mappings_block = pkernfs_alloc_block(NULL); + /* TODO: make alloc zero. */ + memset(pkernfs_addr_for_block(NULL, iommu_pgtables->mappings_block), 0, (2 << 20)); + } else { + inode_idx = iommu_dir->child_ino; + while (inode_idx) { + if (!strncmp(pkernfs_get_persisted_inode(NULL, inode_idx)->filename, + "root-pgtables", PKERNFS_FILENAME_LEN)) { + iommu_pgtables = pkernfs_get_persisted_inode(NULL, inode_idx); + break; + } + inode_idx = pkernfs_get_persisted_inode(NULL, inode_idx)->sibling_ino; + } + } + + /* + * For a pkernfs region block, the "mappings_block" field is still + * just a block index, but that block doesn't actually contain mappings + * it contains the pkernfs_region data + */ + + mappings_block_vaddr = (unsigned long *)pkernfs_addr_for_block(NULL, + iommu_pgtables->mappings_block); + set_bit(0, mappings_block_vaddr); + pkernfs_region->vaddr = mappings_block_vaddr; + pkernfs_region->paddr = pkernfs_base + (iommu_pgtables->mappings_block * PMD_SIZE); + pkernfs_region->bytes = PMD_SIZE; + + dev_set_name(&pkernfs_region->dev, "iommu_root_pgtables"); + rc = device_register(&pkernfs_region->dev); + if (rc) + pr_err("device_register failed: %i\n", rc); + + pkernfs_region->pgmap.range.start = pkernfs_base + + (iommu_pgtables->mappings_block * PMD_SIZE); + pkernfs_region->pgmap.range.end = + pkernfs_region->pgmap.range.start + PMD_SIZE - 1; + pkernfs_region->pgmap.nr_range = 1; + pkernfs_region->pgmap.type = MEMORY_DEVICE_GENERIC; + pkernfs_region->vaddr = + devm_memremap_pages(&pkernfs_region->dev, &pkernfs_region->pgmap); + pkernfs_region->paddr = pkernfs_base + + (iommu_pgtables->mappings_block * PMD_SIZE); +} + +void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr) +{ + if (WARN_ON(paddr >= region->paddr + region->bytes)) + return NULL; + if (WARN_ON(paddr < region->paddr)) + return NULL; + return region->vaddr + (paddr - region->paddr); +} diff --git a/fs/pkernfs/pkernfs.c b/fs/pkernfs/pkernfs.c index f010c2d76c76..2e8c4b0a5807 100644 --- a/fs/pkernfs/pkernfs.c +++ b/fs/pkernfs/pkernfs.c @@ -11,12 +11,14 @@ phys_addr_t pkernfs_base, pkernfs_size; void *pkernfs_mem; static const struct super_operations pkernfs_super_ops = { }; -static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc) +void pkernfs_init(void) { - struct inode *inode; - struct dentry *dentry; + static int inited; struct pkernfs_sb *psb; + if (inited++) + return; + pkernfs_mem = memremap(pkernfs_base, pkernfs_size, MEMREMAP_WB); psb = (struct pkernfs_sb *) pkernfs_mem; @@ -24,13 +26,21 @@ static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc) pr_info("pkernfs: Restoring from super block\n"); } else { pr_info("pkernfs: Clean super block; initialising\n"); - pkernfs_initialise_inode_store(sb); - pkernfs_zero_allocations(sb); + pkernfs_initialise_inode_store(NULL); + pkernfs_zero_allocations(NULL); psb->magic_number = PKERNFS_MAGIC_NUMBER; - pkernfs_get_persisted_inode(sb, 1)->flags = PKERNFS_INODE_FLAG_DIR; - strscpy(pkernfs_get_persisted_inode(sb, 1)->filename, ".", PKERNFS_FILENAME_LEN); + pkernfs_get_persisted_inode(NULL, 1)->flags = PKERNFS_INODE_FLAG_DIR; + strscpy(pkernfs_get_persisted_inode(NULL, 1)->filename, ".", PKERNFS_FILENAME_LEN); psb->next_free_ino = 2; } +} + +static int pkernfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct inode *inode; + struct dentry *dentry; + + pkernfs_init(); sb->s_op = &pkernfs_super_ops; @@ -77,12 +87,9 @@ static struct file_system_type pkernfs_fs_type = { .fs_flags = FS_USERNS_MOUNT, }; -static int __init pkernfs_init(void) +static int __init pkernfs_fs_init(void) { - int ret; - - ret = register_filesystem(&pkernfs_fs_type); - return ret; + return register_filesystem(&pkernfs_fs_type); } /** @@ -97,7 +104,12 @@ static int __init parse_pkernfs_extents(char *p) return 0; } +bool pkernfs_enabled(void) +{ + return !!pkernfs_base; +} + early_param("pkernfs", parse_pkernfs_extents); MODULE_ALIAS_FS("pkernfs"); -module_init(pkernfs_init); +module_init(pkernfs_fs_init); diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h index 1a7aa783a9be..e1b7ae3fe7f1 100644 --- a/fs/pkernfs/pkernfs.h +++ b/fs/pkernfs/pkernfs.h @@ -1,6 +1,7 @@ /* SPDX-License-Identifier: GPL-2.0-only */ #include +#include #define PKERNFS_MAGIC_NUMBER 0x706b65726e6673 #define PKERNFS_FILENAME_LEN 255 @@ -18,6 +19,8 @@ struct pkernfs_sb { // If neither of these are set the inode is not in use. #define PKERNFS_INODE_FLAG_FILE (1 << 0) #define PKERNFS_INODE_FLAG_DIR (1 << 1) +#define PKERNFS_INODE_FLAG_IOMMU_DIR (1 << 2) +#define PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES (1 << 3) struct pkernfs_inode { int flags; /* @@ -31,20 +34,24 @@ struct pkernfs_inode { */ unsigned long child_ino; char filename[PKERNFS_FILENAME_LEN]; + /* Block index for where the mappings live. */ int mappings_block; int num_mappings; }; void pkernfs_initialise_inode_store(struct super_block *sb); +void pkernfs_init(void); void pkernfs_zero_allocations(struct super_block *sb); unsigned long pkernfs_alloc_block(struct super_block *sb); struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino); void *pkernfs_addr_for_block(struct super_block *sb, int block_idx); +unsigned long pkernfs_allocate_inode(struct super_block *sb); struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino); extern const struct file_operations pkernfs_dir_fops; extern const struct file_operations pkernfs_file_fops; extern const struct inode_operations pkernfs_file_inode_operations; +extern const struct inode_operations pkernfs_iommu_dir_inode_operations; diff --git a/include/linux/pkernfs.h b/include/linux/pkernfs.h new file mode 100644 index 000000000000..0110e4784109 --- /dev/null +++ b/include/linux/pkernfs.h @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: MIT */ + +#ifndef _LINUX_PKERNFS_H +#define _LINUX_PKERNFS_H + +#include +#include + +#ifdef CONFIG_PKERNFS_FS +extern bool pkernfs_enabled(void); +#else +static inline bool pkernfs_enabled(void) +{ + return false; +} +#endif + +/* + * This is a light wrapper around the data behind a pkernfs + * file. Really it should be a file but the filesystem comes + * up too late: IOMMU needs root pgtables before fs is up. + */ +struct pkernfs_region { + void *vaddr; + unsigned long paddr; + unsigned long bytes; + struct dev_pagemap pgmap; + struct device dev; +}; + +void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region); +void pkernfs_alloc_page_from_region(struct pkernfs_region *pkernfs_region, + void **vaddr, unsigned long *paddr); +void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr); + +#endif /* _LINUX_PKERNFS_H */ From patchwork Mon Feb 5 12:01:53 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545327 Received: from smtp-fw-80009.amazon.com (smtp-fw-80009.amazon.com [99.78.197.220]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC28D1C686; Mon, 5 Feb 2024 12:03:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.220 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134638; cv=none; b=qYAVKrOnyIG7xicsCDxV3ckmQfZ1w/Uor2B2GKDMINbzXDCNgZIhk1KdmJI1SecoWgF3p0gHg3Q8yCLID+RSmD9DkwbEruq75np1yEcmA0anutU0DU1jgQXaQk0aQ0wCJ+450tiREjp3jMwgoEme77PH3EnDAD4DE69PlIfBFX4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134638; c=relaxed/simple; bh=HphmXgXmCKa5mqdnPL9wBxqcBYTxkbzFB0ny6/qkCB4=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=HZdlIRaBysRkb9SkBJxIXoeqsdwmtPqTHHEREieG2RT6z/jlGM+AknRvsEXMkUDW/hOPGsk2pU64TbJExN4wfmSSpI33qnLtfvZlqNuCe4Xm0On/FIR6SmC14l3FQ6roh1RIp5W8E419GPYkaaUgb9CUOLPS5Y5ngpFsyO4pR4E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=JSkDIGyr; arc=none smtp.client-ip=99.78.197.220 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="JSkDIGyr" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134637; x=1738670637; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=iY0J0IZ5qTqLG1CI5oIteOfffrC6hlfw7IE0MCx8M3k=; b=JSkDIGyrOfNkdmuDUcA1wlHoiTEd4BH8dtO97qK0NJ2Pte20i/a0OOCm C0j4dmJ+xT+dqfxYjkzyAaHx0JvaB1O29vk4fcn3dFp+/HSh1BHdFnrJJ 7nrvM4D8CHbVq1DsiDeLNIbnc82oZ+jmY+bgefcK6RMaqaeUsJlwrjaNP Y=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="63724432" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-80009.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:03:55 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.10.100:51867] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.28.192:2525] with esmtp (Farcaster) id 0b48539d-2334-4d72-bccb-1f938dcbdb04; Mon, 5 Feb 2024 12:03:53 +0000 (UTC) X-Farcaster-Flow-ID: 0b48539d-2334-4d72-bccb-1f938dcbdb04 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUC001.ant.amazon.com (10.252.51.155) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:53 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:03:46 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 08/18] iommu: Add allocator for pgtables from persistent region Date: Mon, 5 Feb 2024 12:01:53 +0000 Message-ID: <20240205120203.60312-9-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D033UWC001.ant.amazon.com (10.13.139.218) To EX19D014EUC004.ant.amazon.com (10.252.51.182) The specific IOMMU drivers will need to ability to allocate pages from a pkernfs IOMMU pgtable file for their pgtables. Also, the IOMMU drivers will need to ability to consistent get the same page for the root PGD page - add a specific function to get this PGD "root" page. This is different to allocating regular pgtable pages because the exact same page needs to be *restored* after kexec into the pgd pointer on the IOMMU domain struct. To support this sort of allocation the pkernfs region is treated as an array of 512 4 KiB pages, the first of which is an allocation bitmap. --- drivers/iommu/Makefile | 1 + drivers/iommu/pgtable_alloc.c | 36 +++++++++++++++++++++++++++++++++++ drivers/iommu/pgtable_alloc.h | 9 +++++++++ 3 files changed, 46 insertions(+) create mode 100644 drivers/iommu/pgtable_alloc.c create mode 100644 drivers/iommu/pgtable_alloc.h diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 769e43d780ce..cadebabe9581 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -1,5 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 obj-y += amd/ intel/ arm/ iommufd/ +obj-y += pgtable_alloc.o obj-$(CONFIG_IOMMU_API) += iommu.o obj-$(CONFIG_IOMMU_API) += iommu-traces.o obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o diff --git a/drivers/iommu/pgtable_alloc.c b/drivers/iommu/pgtable_alloc.c new file mode 100644 index 000000000000..f0c2e12f8a8b --- /dev/null +++ b/drivers/iommu/pgtable_alloc.c @@ -0,0 +1,36 @@ +// SPDX-License-Identifier: GPL-2.0-only + +#include "pgtable_alloc.h" +#include + +/* + * The first 4 KiB is the bitmap - set the first bit in the bitmap. + * Scan bitmap to find next free bits - it's next free page. + */ + +void iommu_alloc_page_from_region(struct pkernfs_region *region, void **vaddr, unsigned long *paddr) +{ + int page_idx; + + page_idx = bitmap_find_free_region(region->vaddr, 512, 0); + *vaddr = region->vaddr + (page_idx << PAGE_SHIFT); + if (paddr) + *paddr = region->paddr + (page_idx << PAGE_SHIFT); +} + + +void *pgtable_get_root_page(struct pkernfs_region *region, bool liveupdate) +{ + /* + * The page immediately after the bitmap is the root page. + * It would be wrong for the page to be allocated if we're + * NOT doing a liveupdate, or for a liveupdate to happen + * with no allocated page. Detect this mismatch. + */ + if (test_bit(1, region->vaddr) ^ liveupdate) { + pr_err("%sdoing a liveupdate but root pg bit incorrect", + liveupdate ? "" : "NOT "); + } + set_bit(1, region->vaddr); + return region->vaddr + PAGE_SIZE; +} diff --git a/drivers/iommu/pgtable_alloc.h b/drivers/iommu/pgtable_alloc.h new file mode 100644 index 000000000000..c1666a7be3d3 --- /dev/null +++ b/drivers/iommu/pgtable_alloc.h @@ -0,0 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ + +#include +#include + +void iommu_alloc_page_from_region(struct pkernfs_region *region, + void **vaddr, unsigned long *paddr); + +void *pgtable_get_root_page(struct pkernfs_region *region, bool liveupdate); From patchwork Mon Feb 5 12:01:54 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545365 Received: from smtp-fw-9105.amazon.com (smtp-fw-9105.amazon.com [207.171.188.204]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 75A591AAB9; Mon, 5 Feb 2024 12:04:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=207.171.188.204 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134676; cv=none; b=fqNjPzPIn2HowPGUce4T9p+JVE5nBCQQChxYNPiqhyc+KterqzkQKmU1ZFgUSi+wTYL8Q2vy54pMm8zeF4bZLhzL89H6tvVdIOZEQYUwVygHyXApbUcMfDKWTQcRufGgIK6YAhzLsAZlVAcpYcEBG4Mcv0J2MGksry0zYBKiMkQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134676; c=relaxed/simple; bh=5Uxpgc0dI22+UGWqfon6DoCVovwe1kf0HryqXSj4tnM=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ah2mlriwc4GPE2ctEnazWiq/jPss9WrHL6rnoYO1I44DFzSyNmP8V1WYUZFBTqUmN1aH8X4Fi+XMK+wEZFPSAu19It774Sdlcn/WoT+m/s9ZDvHZHD0IrCnecOKvYKeGowIBhr5VXhA8/v/E4LiZ8w+2qIOTu6GZauK/Ueylypw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=hgi7L/iN; arc=none smtp.client-ip=207.171.188.204 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="hgi7L/iN" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134674; x=1738670674; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=TxdyyN106bvAzVLll6ZOpOdKPjlViDjzUv1y30GOivQ=; b=hgi7L/iNQzybvsy8wLvuACoX1rTQU+9BfNz+mOnOnpCmhgHmszw0AzbN ecx4I6yCuQAQaPR1PdW9mUikqdyGrxDITMckVJYkohrrJGpepLpavihnW 9hFJknCqV2h1HfqVPju/OBrKJiKd1diIYGnwliuFrQotg0gn/AKkqR0aI w=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="702759804" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-9105.sea19.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:04:26 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.10.100:58296] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.14.129:2525] with esmtp (Farcaster) id e132c2ad-2d34-4ccb-890d-621a7e0c08cd; Mon, 5 Feb 2024 12:04:25 +0000 (UTC) X-Farcaster-Flow-ID: e132c2ad-2d34-4ccb-890d-621a7e0c08cd Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUC001.ant.amazon.com (10.252.51.155) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:04:23 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:04:17 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 09/18] intel-iommu: Use pkernfs for root/context pgtable pages Date: Mon, 5 Feb 2024 12:01:54 +0000 Message-ID: <20240205120203.60312-10-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D041UWA001.ant.amazon.com (10.13.139.124) To EX19D014EUC004.ant.amazon.com (10.252.51.182) The previous commits were preparation for using pkernfs memory for IOMMU pgtables: a file in the filesystem is available and an allocator to allocate 4-KiB pages from that file is available. Now use those to actually use pkernfs memory for root and context pgtable pages. If pkernfs is enabled then a "region" (physical and virtual memory chunk) is fetch from pkernfs and used to drive the allocator. Should this rather just be a pointer to a pkernfs inode? That abstraction seems leaky but without having the ability to store struct files at this point it's probably the more accurate. The freeing still needs to be hooked into the allocator... --- drivers/iommu/intel/iommu.c | 24 ++++++++++++++++++++---- drivers/iommu/intel/iommu.h | 2 ++ 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 744e4e6b8d72..2dd3f055dbce 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -28,6 +29,7 @@ #include "../dma-iommu.h" #include "../irq_remapping.h" #include "../iommu-sva.h" +#include "../pgtable_alloc.h" #include "pasid.h" #include "cap_audit.h" #include "perfmon.h" @@ -617,7 +619,12 @@ struct context_entry *iommu_context_addr(struct intel_iommu *iommu, u8 bus, if (!alloc) return NULL; - context = alloc_pgtable_page(iommu->node, GFP_ATOMIC); + if (pkernfs_enabled()) + iommu_alloc_page_from_region( + &iommu->pkernfs_region, + (void **) &context, NULL); + else + context = alloc_pgtable_page(iommu->node, GFP_ATOMIC); if (!context) return NULL; @@ -1190,7 +1197,15 @@ static int iommu_alloc_root_entry(struct intel_iommu *iommu) { struct root_entry *root; - root = alloc_pgtable_page(iommu->node, GFP_ATOMIC); + if (pkernfs_enabled()) { + pkernfs_alloc_iommu_root_pgtables(&iommu->pkernfs_region); + root = pgtable_get_root_page( + &iommu->pkernfs_region, + liveupdate); + } else { + root = alloc_pgtable_page(iommu->node, GFP_ATOMIC); + } + if (!root) { pr_err("Allocating root entry for %s failed\n", iommu->name); @@ -2790,7 +2805,7 @@ static int __init init_dmars(void) init_translation_status(iommu); - if (translation_pre_enabled(iommu) && !is_kdump_kernel()) { + if (translation_pre_enabled(iommu) && !is_kdump_kernel() && !liveupdate) { iommu_disable_translation(iommu); clear_translation_pre_enabled(iommu); pr_warn("Translation was enabled for %s but we are not in kdump mode\n", @@ -2806,7 +2821,8 @@ static int __init init_dmars(void) if (ret) goto free_iommu; - if (translation_pre_enabled(iommu)) { + /* For the live update case restore pgtables, don't copy */ + if (translation_pre_enabled(iommu) && !liveupdate) { pr_info("Translation already enabled - trying to copy translation structures\n"); ret = copy_translation_tables(iommu); diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index e6a3e7065616..a2338e398ba3 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -22,6 +22,7 @@ #include #include #include +#include #include #include @@ -672,6 +673,7 @@ struct intel_iommu { unsigned long *copied_tables; /* bitmap of copied tables */ spinlock_t lock; /* protect context, domain ids */ struct root_entry *root_entry; /* virtual address */ + struct pkernfs_region pkernfs_region; struct iommu_flush flush; #endif From patchwork Mon Feb 5 12:01:55 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545366 Received: from smtp-fw-52004.amazon.com (smtp-fw-52004.amazon.com [52.119.213.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A0841B5AA; Mon, 5 Feb 2024 12:04:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.154 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134677; cv=none; b=YSzcTq8d/8Om7LYg9kQFFgPXbkweq5YueINlGL8+g6A/ig5+YUz+cvBBrSNq/eDlnEqaszCL5s/0E3yUs/q/Ut83H0cz3iIAA5QN0PwjkBUpacOcB2ru+KRliuugsXu68Oo4IXzmqENYMCeNzO8jDLOGKsLPF9FmlEgvr0+RAxM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134677; c=relaxed/simple; bh=pZgiiTAlLapO69RdGzjL9bPIQ+8hoYW8Z/jLkX/PZWw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Uv7FPGFSDLgtlsiqMTPG2ubsYin/qTuXBX69pYpwqgy17tEY9TbklgH60CzFtyM3gXjfHBp2ftSoadRIfgL6rW2o8mnkg/q9sHQ8s/jbePy9w6uF2vN7gHeXjtHu8xW7PRn2gjVLhthXWne04rSb6Q6epp4udKbC5Yz7/M9Z95I= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=ZpHiYYS7; arc=none smtp.client-ip=52.119.213.154 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="ZpHiYYS7" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134676; x=1738670676; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=R61AUSOacFWGGPMbVgCPWnIrJ1KmkLbm3r8EEWKzIHg=; b=ZpHiYYS7bX7Ee49iWVwgn6tXTcpzrQg/OJnRaSUU+v0EuUDGFcIda7LO mY2xzJKofyFUzM1DcCoHChqTwj4RQkUcyhJHjPRSfxKtVcqRRmNS7Y+ay ACpFwNad/ZcuF1gsLqjjysVMrpa1wm5F71PkDijz43Vm0nZt15Jgnizj+ g=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="182633292" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.2]) by smtp-border-fw-52004.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:04:31 +0000 Received: from EX19MTAEUA001.ant.amazon.com [10.0.43.254:29990] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.14.129:2525] with esmtp (Farcaster) id 479437ca-efef-4c96-b04c-24208e90e4af; Mon, 5 Feb 2024 12:04:30 +0000 (UTC) X-Farcaster-Flow-ID: 479437ca-efef-4c96-b04c-24208e90e4af Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUA001.ant.amazon.com (10.252.50.192) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:04:29 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:04:23 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 10/18] iommu/intel: zap context table entries on kexec Date: Mon, 5 Feb 2024 12:01:55 +0000 Message-ID: <20240205120203.60312-11-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D041UWA001.ant.amazon.com (10.13.139.124) To EX19D014EUC004.ant.amazon.com (10.252.51.182) In the next commit the IOMMU shutdown function will be modified to not actually shut down the IOMMU when doing a kexec. To prevent leaving DMA mappings for non-persistent devices around during kexec we add a function to the kexec flow which iterates though all IOMMU domains and zaps the context entries for the devices belonging to those domain. A list of domains for the IOMMU is added and maintained. --- drivers/iommu/intel/dmar.c | 1 + drivers/iommu/intel/iommu.c | 34 ++++++++++++++++++++++++++++++---- drivers/iommu/intel/iommu.h | 2 ++ 3 files changed, 33 insertions(+), 4 deletions(-) diff --git a/drivers/iommu/intel/dmar.c b/drivers/iommu/intel/dmar.c index 23cb80d62a9a..00f69f40a4ac 100644 --- a/drivers/iommu/intel/dmar.c +++ b/drivers/iommu/intel/dmar.c @@ -1097,6 +1097,7 @@ static int alloc_iommu(struct dmar_drhd_unit *drhd) iommu->segment = drhd->segment; iommu->node = NUMA_NO_NODE; + INIT_LIST_HEAD(&iommu->domains); ver = readl(iommu->reg + DMAR_VER_REG); pr_info("%s: reg_base_addr %llx ver %d:%d cap %llx ecap %llx\n", diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 2dd3f055dbce..315c6b7f901c 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1831,6 +1831,7 @@ static int domain_attach_iommu(struct dmar_domain *domain, goto err_clear; } domain_update_iommu_cap(domain); + list_add(&domain->domains, &iommu->domains); spin_unlock(&iommu->lock); return 0; @@ -3608,6 +3609,33 @@ static void intel_disable_iommus(void) iommu_disable_translation(iommu); } +void zap_context_table_entries(struct intel_iommu *iommu) +{ + struct context_entry *context; + struct dmar_domain *domain; + struct device_domain_info *device; + int bus, devfn; + u16 did_old; + + list_for_each_entry(domain, &iommu->domains, domains) { + list_for_each_entry(device, &domain->devices, link) { + context = iommu_context_addr(iommu, device->bus, device->devfn, 0); + if (!context || !context_present(context)) + continue; + context_domain_id(context); + context_clear_entry(context); + __iommu_flush_cache(iommu, context, sizeof(*context)); + iommu->flush.flush_context(iommu, + did_old, + (((u16)bus) << 8) | devfn, + DMA_CCMD_MASK_NOBIT, + DMA_CCMD_DEVICE_INVL); + iommu->flush.flush_iotlb(iommu, did_old, 0, 0, + DMA_TLB_DSI_FLUSH); + } + } +} + void intel_iommu_shutdown(void) { struct dmar_drhd_unit *drhd; @@ -3620,10 +3648,8 @@ void intel_iommu_shutdown(void) /* Disable PMRs explicitly here. */ for_each_iommu(iommu, drhd) - iommu_disable_protect_mem_regions(iommu); - - /* Make sure the IOMMUs are switched off */ - intel_disable_iommus(); + zap_context_table_entries(iommu); + return up_write(&dmar_global_lock); } diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index a2338e398ba3..4a2f163a86f3 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -600,6 +600,7 @@ struct dmar_domain { spinlock_t lock; /* Protect device tracking lists */ struct list_head devices; /* all devices' list */ struct list_head dev_pasids; /* all attached pasids */ + struct list_head domains; /* all struct dmar_domains on this IOMMU */ struct dma_pte *pgd; /* virtual address */ int gaw; /* max guest address width */ @@ -700,6 +701,7 @@ struct intel_iommu { void *perf_statistic; struct iommu_pmu *pmu; + struct list_head domains; /* all struct dmar_domains on this IOMMU */ }; /* PCI domain-device relationship */ From patchwork Mon Feb 5 12:01:56 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545367 Received: from smtp-fw-80009.amazon.com (smtp-fw-80009.amazon.com [99.78.197.220]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 47BFD1CD35; Mon, 5 Feb 2024 12:04:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.220 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134680; cv=none; b=YIByIh6zRP3VGUbzHnVFM8Am4kH8qnxo1ooVupDGwnn1jMtx6owrFatAyu2HgdLUZ1lQ+XDwtoIK9CPGNFGcp4owGqJm4JzrWyFm1Lsnru4k4XZzVAXvamk1QZbD5KyB4uO9ybm20nnHcPMJGd90mDSrPaL47/8hYhI/l62nvPc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134680; c=relaxed/simple; bh=fscNGkpC5vVpP/8B8akCxtTGKONgvg1AB8vehQ7smFU=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=X4Y+0NAqKIGDdiT48cXfR0058rHy9OxalbUwtpk2JcyO8NDI5pD6/nlSqOiVEAUOWnJ7J9z/xQ2twUgYvThevuQMi/hFDI99yugBujF6ufJvNoOBGIiIjp57Onh+GD8n6zla891IzS9hgJWWp0yshv5df7pE3+9LaTvmxBCxZmM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=jBgeXVbC; arc=none smtp.client-ip=99.78.197.220 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="jBgeXVbC" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134680; x=1738670680; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=J/26rYh9SEOln6y0M0icroO5+g+Mdt4aS3KBehgGFN8=; b=jBgeXVbC7MQ7ZeIq302ImADqVc71l2YNCmbm/gxsDbKY2jZ4RtRubVSp BMrt66UyyL0Rk72Q8hIhiQAcndzT2dObMXoEZ8dAKfdA7IERIGzWm/p2L P8lNTmaiQmpbvrKZomnyha00s5pOWg0qd2QccLfBLKpD+5MDvaNB6k6aE c=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="63724561" Received: from pdx4-co-svc-p1-lb2-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.210]) by smtp-border-fw-80009.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:04:38 +0000 Received: from EX19MTAEUA002.ant.amazon.com [10.0.43.254:15018] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.28.192:2525] with esmtp (Farcaster) id 82f5a304-b865-4a4d-9902-a5be57c26a04; Mon, 5 Feb 2024 12:04:36 +0000 (UTC) X-Farcaster-Flow-ID: 82f5a304-b865-4a4d-9902-a5be57c26a04 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUA002.ant.amazon.com (10.252.50.124) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:04:36 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:04:30 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 11/18] dma-iommu: Always enable deferred attaches for liveupdate Date: Mon, 5 Feb 2024 12:01:56 +0000 Message-ID: <20240205120203.60312-12-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D041UWA001.ant.amazon.com (10.13.139.124) To EX19D014EUC004.ant.amazon.com (10.252.51.182) Seeing as translations are pre-enabled, all devices will be set for deferred attach. The deferred attached actually has to be done when doing DMA mapping for devices to work. There may be a better way to do this be, for example, consulting the context entry table and only deferring attach if there is a persisted context table entry for this device. --- drivers/iommu/dma-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index e5d087bd6da1..76f916848f48 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -1750,7 +1750,7 @@ void iommu_dma_compose_msi_msg(struct msi_desc *desc, struct msi_msg *msg) static int iommu_dma_init(void) { - if (is_kdump_kernel()) + if (is_kdump_kernel() || liveupdate) static_branch_enable(&iommu_deferred_attach_enabled); return iova_cache_get(); From patchwork Mon Feb 5 12:01:57 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545368 Received: from smtp-fw-52002.amazon.com (smtp-fw-52002.amazon.com [52.119.213.150]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 48F0F1D532; Mon, 5 Feb 2024 12:05:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.150 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134715; cv=none; b=FCgsm2jH2tPppUIa65dYnNVJmFUN/oB7RHyAqCzr9gv5sty3uAgUGRxh4D/IW1WysTbTOkBLEJ/4+pM0dQPF0hXa+DThYYq9xvu8nK5Z2e0/8I5JdpgGSrXYRusL70vltKISELMrh0tcye6sfP2cePuRZvojBAZ85fKreGk4uT8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134715; c=relaxed/simple; bh=3vrGczpyrruaMjXatBGYsYV2shf2AH13YtkDkq8j5sY=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=GCTrGrkcp2urfikKzgw4TlO9k9dSjsxy0lEdmtcK9NASUA/keVhN728V+cCt5bUD3FjGyrnMciWcw+Rh1g00UiUX3G9R3Xmb0bM6SqseKgLPJ+BaVOADPXLu2WSyzimR41oVOpGLIqGV1uCh4ifbFKcpvpKhxHR164+Nte4xS/0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=rBiC7ikK; arc=none smtp.client-ip=52.119.213.150 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="rBiC7ikK" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134713; x=1738670713; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FfAwJVvQHyFiTbONYCZnQNlkav6R7SrA1Dbs6XI8844=; b=rBiC7ikKOvQFn4asAe08Bc5Aazz7SkCi9MG/65Unu2c8sm+oWJpXnFwV g+F5I3SDvv2ophIVeXpl8l4FKUoYjdNTM2Ec4h3KnqbFNJbGLX/tjQzz7 xwH9VULxVmg7l9CbjzZh0S+ckkRukr1velW8RuvRnSIQp/pUGh3BkNWLW Q=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="610940405" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52002.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:05:08 +0000 Received: from EX19MTAEUB002.ant.amazon.com [10.0.43.254:20056] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.8.155:2525] with esmtp (Farcaster) id 0814e71f-b1b8-4c6d-9ee2-790efdbab159; Mon, 5 Feb 2024 12:05:06 +0000 (UTC) X-Farcaster-Flow-ID: 0814e71f-b1b8-4c6d-9ee2-790efdbab159 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUB002.ant.amazon.com (10.252.51.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:06 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:00 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 12/18] pkernfs: Add IOMMU domain pgtables file Date: Mon, 5 Feb 2024 12:01:57 +0000 Message-ID: <20240205120203.60312-13-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D046UWA004.ant.amazon.com (10.13.139.76) To EX19D014EUC004.ant.amazon.com (10.252.51.182) Similar to the IOMMU root pgtables file which was added in a previous commit, now support a file type for IOMMU domain pgtables in the IOMMU directory. These domain pgtable files only need to be useable after the system has booted up, for example by QEMU creating one of these files and using it to back the IOMMU pgtables for a persistent VM. As such the filesystem abstraction can be better maintained here as the kernel code doesn't need to reach "behind" the filesystem abstraction like it does for the root pgtables. A new inode type is created for domain pgtable files, and the IOMMU directory gets inode_operation callbacks to support creating and deleting these files in it. Note: there is a use-after-free risk here too: if the domain pgtable file is truncated while it's in-use for IOMMU pgtables then freed memory could still be mapped into the IOMMU. To mitigate this there should be a machanism to "freeze" the files once they've been given to the IOMMU. --- fs/pkernfs/inode.c | 9 +++++-- fs/pkernfs/iommu.c | 55 +++++++++++++++++++++++++++++++++++++++-- fs/pkernfs/pkernfs.h | 4 +++ include/linux/pkernfs.h | 1 + 4 files changed, 65 insertions(+), 4 deletions(-) diff --git a/fs/pkernfs/inode.c b/fs/pkernfs/inode.c index 1d712e0a82a1..35842cd61002 100644 --- a/fs/pkernfs/inode.c +++ b/fs/pkernfs/inode.c @@ -35,7 +35,11 @@ struct inode *pkernfs_inode_get(struct super_block *sb, unsigned long ino) inode->i_op = &pkernfs_iommu_dir_inode_operations; inode->i_fop = &pkernfs_dir_fops; inode->i_mode = S_IFDIR; - } else if (pkernfs_inode->flags | PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES) { + } else if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES) { + inode->i_fop = &pkernfs_file_fops; + inode->i_mode = S_IFREG; + } else if (pkernfs_inode->flags & PKERNFS_INODE_FLAG_IOMMU_DOMAIN_PGTABLES) { + inode->i_fop = &pkernfs_file_fops; inode->i_mode = S_IFREG; } @@ -175,6 +179,7 @@ const struct inode_operations pkernfs_dir_inode_operations = { }; const struct inode_operations pkernfs_iommu_dir_inode_operations = { + .create = pkernfs_create_iommu_pgtables, .lookup = pkernfs_lookup, + .unlink = pkernfs_unlink, }; - diff --git a/fs/pkernfs/iommu.c b/fs/pkernfs/iommu.c index 5bce8146d7bb..f14e76013e85 100644 --- a/fs/pkernfs/iommu.c +++ b/fs/pkernfs/iommu.c @@ -4,6 +4,27 @@ #include +void pkernfs_alloc_iommu_domain_pgtables(struct file *ppts, struct pkernfs_region *pkernfs_region) +{ + struct pkernfs_inode *pkernfs_inode; + unsigned long *mappings_block_vaddr; + unsigned long inode_idx; + + /* + * For a pkernfs region block, the "mappings_block" field is still + * just a block index, but that block doesn't actually contain mappings + * it contains the pkernfs_region data + */ + + inode_idx = ppts->f_inode->i_ino; + pkernfs_inode = pkernfs_get_persisted_inode(NULL, inode_idx); + + mappings_block_vaddr = (unsigned long *)pkernfs_addr_for_block(NULL, + pkernfs_inode->mappings_block); + set_bit(0, mappings_block_vaddr); + pkernfs_region->vaddr = mappings_block_vaddr; + pkernfs_region->paddr = pkernfs_base + (pkernfs_inode->mappings_block * (2 << 20)); +} void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region) { unsigned long *mappings_block_vaddr; @@ -63,9 +84,8 @@ void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region) * just a block index, but that block doesn't actually contain mappings * it contains the pkernfs_region data */ - mappings_block_vaddr = (unsigned long *)pkernfs_addr_for_block(NULL, - iommu_pgtables->mappings_block); + iommu_pgtables->mappings_block); set_bit(0, mappings_block_vaddr); pkernfs_region->vaddr = mappings_block_vaddr; pkernfs_region->paddr = pkernfs_base + (iommu_pgtables->mappings_block * PMD_SIZE); @@ -88,6 +108,29 @@ void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region) (iommu_pgtables->mappings_block * PMD_SIZE); } +int pkernfs_create_iommu_pgtables(struct mnt_idmap *id, struct inode *dir, + struct dentry *dentry, umode_t mode, bool excl) +{ + unsigned long free_inode; + struct pkernfs_inode *pkernfs_inode; + struct inode *vfs_inode; + + free_inode = pkernfs_allocate_inode(dir->i_sb); + if (free_inode <= 0) + return -ENOMEM; + + pkernfs_inode = pkernfs_get_persisted_inode(dir->i_sb, free_inode); + pkernfs_inode->sibling_ino = pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino; + pkernfs_get_persisted_inode(dir->i_sb, dir->i_ino)->child_ino = free_inode; + strscpy(pkernfs_inode->filename, dentry->d_name.name, PKERNFS_FILENAME_LEN); + pkernfs_inode->flags = PKERNFS_INODE_FLAG_IOMMU_DOMAIN_PGTABLES; + pkernfs_inode->mappings_block = pkernfs_alloc_block(dir->i_sb); + memset(pkernfs_addr_for_block(dir->i_sb, pkernfs_inode->mappings_block), 0, (2 << 20)); + vfs_inode = pkernfs_inode_get(dir->i_sb, free_inode); + d_add(dentry, vfs_inode); + return 0; +} + void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr) { if (WARN_ON(paddr >= region->paddr + region->bytes)) @@ -96,3 +139,11 @@ void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long return NULL; return region->vaddr + (paddr - region->paddr); } + +bool pkernfs_is_iommu_domain_pgtables(struct file *f) +{ + return f && + pkernfs_get_persisted_inode(f->f_inode->i_sb, f->f_inode->i_ino)->flags & + PKERNFS_INODE_FLAG_IOMMU_DOMAIN_PGTABLES; +} + diff --git a/fs/pkernfs/pkernfs.h b/fs/pkernfs/pkernfs.h index e1b7ae3fe7f1..9bea827f8b40 100644 --- a/fs/pkernfs/pkernfs.h +++ b/fs/pkernfs/pkernfs.h @@ -21,6 +21,7 @@ struct pkernfs_sb { #define PKERNFS_INODE_FLAG_DIR (1 << 1) #define PKERNFS_INODE_FLAG_IOMMU_DIR (1 << 2) #define PKERNFS_INODE_FLAG_IOMMU_ROOT_PGTABLES (1 << 3) +#define PKERNFS_INODE_FLAG_IOMMU_DOMAIN_PGTABLES (1 << 4) struct pkernfs_inode { int flags; /* @@ -50,8 +51,11 @@ void *pkernfs_addr_for_block(struct super_block *sb, int block_idx); unsigned long pkernfs_allocate_inode(struct super_block *sb); struct pkernfs_inode *pkernfs_get_persisted_inode(struct super_block *sb, int ino); +int pkernfs_create_iommu_pgtables(struct mnt_idmap *id, struct inode *dir, + struct dentry *dentry, umode_t mode, bool excl); extern const struct file_operations pkernfs_dir_fops; extern const struct file_operations pkernfs_file_fops; extern const struct inode_operations pkernfs_file_inode_operations; extern const struct inode_operations pkernfs_iommu_dir_inode_operations; +extern const struct inode_operations pkernfs_iommu_domain_pgtables_inode_operations; diff --git a/include/linux/pkernfs.h b/include/linux/pkernfs.h index 0110e4784109..4ca923ee0d82 100644 --- a/include/linux/pkernfs.h +++ b/include/linux/pkernfs.h @@ -33,4 +33,5 @@ void pkernfs_alloc_page_from_region(struct pkernfs_region *pkernfs_region, void **vaddr, unsigned long *paddr); void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr); +bool pkernfs_is_iommu_domain_pgtables(struct file *f); #endif /* _LINUX_PKERNFS_H */ From patchwork Mon Feb 5 12:01:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545369 Received: from smtp-fw-80006.amazon.com (smtp-fw-80006.amazon.com [99.78.197.217]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2D1FE1BDED; Mon, 5 Feb 2024 12:05:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.217 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134718; cv=none; b=IMpi1JE6/plHPXfoTrG1lv8DK1G4m0qLP9hDABUv7auautwokDOz6hAm8RJGHAtlhXdabhGXo23sgH9p48+F3sWMHQN8P3ZjPdX/COGKX9gNQhTbTr8zXWr+Opubk1PTejf5c8GvgGB+zJ+8qKiyExYuZ2GCiZe2mB5ApVqQokc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134718; c=relaxed/simple; bh=z8BoM9mBaSRQmeip//Yd1XmcxTr/klkAQB0/G6dx2tY=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=S7jOIy6WnHIF5sFYrhDhXFafXN+ZK731Ezovmo9A3ryHtKNjHulaAvaVMDBdQhVGiJXSpD077Sfljdbljeih4hv5ny3iMB2apApqvTLhYw/AqfdP0oFEr5wZjZXG/uBhs43LdiqC4TsHu369gyqEaFIm5kYCJY1Rde3XhM/zkL4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=R3AYRMnS; arc=none smtp.client-ip=99.78.197.217 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="R3AYRMnS" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134717; x=1738670717; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=mVhmItHN9qmSrpLWsRnSDOh51VJYGjIwmSMhmIPWHpo=; b=R3AYRMnSiKwRmMKtH/lqN0t0jVA1hEeWujguPv2alxxY26X3S7XzQefN CbMyAY1Rw40hw8Q32U8gZ8gatsvonTW4i0VA21ckDc9rQ1AWwQz/O6p+M FzpA83tH+NQsPmMmtOZu3T0MuoB6TXvnFWVtb+BYZJp3/CpOeYlh70u6f A=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="271102836" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80006.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:05:14 +0000 Received: from EX19MTAEUC002.ant.amazon.com [10.0.17.79:18332] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.28.192:2525] with esmtp (Farcaster) id 430958cf-6b4e-40ab-be7a-e53a50074ea7; Mon, 5 Feb 2024 12:05:13 +0000 (UTC) X-Farcaster-Flow-ID: 430958cf-6b4e-40ab-be7a-e53a50074ea7 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUC002.ant.amazon.com (10.252.51.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:13 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:06 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 13/18] vfio: add ioctl to define persistent pgtables on container Date: Mon, 5 Feb 2024 12:01:58 +0000 Message-ID: <20240205120203.60312-14-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D046UWA004.ant.amazon.com (10.13.139.76) To EX19D014EUC004.ant.amazon.com (10.252.51.182) The previous commits added a file type in pkernfs for IOMMU persistent page tables. Now support actually setting persistent page tables on an IOMMU domain. This is done via a VFIO ioctl on a VFIO container. Userspace needs to create and open a IOMMU persistent page tables file and then supply that fd to the new VFIO_CONTAINER_SET_PERSISTENT_PGTABLES ioctl. That ioctl sets the supplied struct file on the struct vfio_container. Later when the IOMMU domain is allocated by VFIO, VFIO will check to see if the persistent pagetables have been defined and if they have will use the iommu_domain_alloc_persistent API which was introduced in the previous commit to pass the struct file down to the IOMMU which will actually use it for page tables. After kexec userspace needs to open the same IOMMU page table file and set it again via the same ioctl so that the IOMMU continues to use the same memory region for its page tables for that domain. --- drivers/vfio/container.c | 27 +++++++++++++++++++++++++++ drivers/vfio/vfio.h | 2 ++ drivers/vfio/vfio_iommu_type1.c | 27 +++++++++++++++++++++++++-- include/uapi/linux/vfio.h | 9 +++++++++ 4 files changed, 63 insertions(+), 2 deletions(-) diff --git a/drivers/vfio/container.c b/drivers/vfio/container.c index d53d08f16973..b60fcbf7bad0 100644 --- a/drivers/vfio/container.c +++ b/drivers/vfio/container.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include "vfio.h" @@ -21,6 +22,7 @@ struct vfio_container { struct rw_semaphore group_lock; struct vfio_iommu_driver *iommu_driver; void *iommu_data; + struct file *persistent_pgtables; bool noiommu; }; @@ -306,6 +308,8 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container, continue; } + driver->ops->set_persistent_pgtables(data, container->persistent_pgtables); + ret = __vfio_container_attach_groups(container, driver, data); if (ret) { driver->ops->release(data); @@ -324,6 +328,26 @@ static long vfio_ioctl_set_iommu(struct vfio_container *container, return ret; } +static int vfio_ioctl_set_persistent_pgtables(struct vfio_container *container, + unsigned long arg) +{ + struct vfio_set_persistent_pgtables set_ppts; + struct file *ppts; + + if (copy_from_user(&set_ppts, (void __user *)arg, sizeof(set_ppts))) + return -EFAULT; + + ppts = fget(set_ppts.persistent_pgtables_fd); + if (!ppts) + return -EBADF; + if (!pkernfs_is_iommu_domain_pgtables(ppts)) { + fput(ppts); + return -EBADF; + } + container->persistent_pgtables = ppts; + return 0; +} + static long vfio_fops_unl_ioctl(struct file *filep, unsigned int cmd, unsigned long arg) { @@ -345,6 +369,9 @@ static long vfio_fops_unl_ioctl(struct file *filep, case VFIO_SET_IOMMU: ret = vfio_ioctl_set_iommu(container, arg); break; + case VFIO_CONTAINER_SET_PERSISTENT_PGTABLES: + ret = vfio_ioctl_set_persistent_pgtables(container, arg); + break; default: driver = container->iommu_driver; data = container->iommu_data; diff --git a/drivers/vfio/vfio.h b/drivers/vfio/vfio.h index 307e3f29b527..6fa301bf6474 100644 --- a/drivers/vfio/vfio.h +++ b/drivers/vfio/vfio.h @@ -226,6 +226,8 @@ struct vfio_iommu_driver_ops { void *data, size_t count, bool write); struct iommu_domain *(*group_iommu_domain)(void *iommu_data, struct iommu_group *group); + int (*set_persistent_pgtables)(void *iommu_data, + struct file *ppts); }; struct vfio_iommu_driver { diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index eacd6ec04de5..b36edfc5c9ef 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -75,6 +75,7 @@ struct vfio_iommu { bool nesting; bool dirty_page_tracking; struct list_head emulated_iommu_groups; + struct file *persistent_pgtables; }; struct vfio_domain { @@ -2143,9 +2144,14 @@ static void vfio_iommu_iova_insert_copy(struct vfio_iommu *iommu, static int vfio_iommu_domain_alloc(struct device *dev, void *data) { + /* data is an in pointer to PPTs, and an out to the new domain. */ + struct file *ppts = *(struct file **) data; struct iommu_domain **domain = data; - *domain = iommu_domain_alloc(dev->bus); + if (ppts) + *domain = iommu_domain_alloc_persistent(dev->bus, ppts); + else + *domain = iommu_domain_alloc(dev->bus); return 1; /* Don't iterate */ } @@ -2156,6 +2162,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, struct vfio_iommu_group *group; struct vfio_domain *domain, *d; bool resv_msi; + /* In/out ptr to iommu_domain_alloc. */ + void *domain_alloc_data; phys_addr_t resv_msi_base = 0; struct iommu_domain_geometry *geo; LIST_HEAD(iova_copy); @@ -2203,8 +2211,12 @@ static int vfio_iommu_type1_attach_group(void *iommu_data, * want to iterate beyond the first device (if any). */ ret = -EIO; - iommu_group_for_each_dev(iommu_group, &domain->domain, + /* Smuggle the PPTs in the data field; it will be clobbered with the new domain */ + domain_alloc_data = iommu->persistent_pgtables; + iommu_group_for_each_dev(iommu_group, &domain_alloc_data, vfio_iommu_domain_alloc); + domain->domain = domain_alloc_data; + if (!domain->domain) goto out_free_domain; @@ -3165,6 +3177,16 @@ vfio_iommu_type1_group_iommu_domain(void *iommu_data, return domain; } +int vfio_iommu_type1_set_persistent_pgtables(void *iommu_data, + struct file *ppts) +{ + + struct vfio_iommu *iommu = iommu_data; + + iommu->persistent_pgtables = ppts; + return 0; +} + static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = { .name = "vfio-iommu-type1", .owner = THIS_MODULE, @@ -3179,6 +3201,7 @@ static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_type1 = { .unregister_device = vfio_iommu_type1_unregister_device, .dma_rw = vfio_iommu_type1_dma_rw, .group_iommu_domain = vfio_iommu_type1_group_iommu_domain, + .set_persistent_pgtables = vfio_iommu_type1_set_persistent_pgtables, }; static int __init vfio_iommu_type1_init(void) diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index afc1369216d9..fa9676bb4b26 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1797,6 +1797,15 @@ struct vfio_iommu_spapr_tce_remove { }; #define VFIO_IOMMU_SPAPR_TCE_REMOVE _IO(VFIO_TYPE, VFIO_BASE + 20) +struct vfio_set_persistent_pgtables { + /* + * File descriptor for a pkernfs IOMMU pgtables + * file to be used for persistence. + */ + __u32 persistent_pgtables_fd; +}; +#define VFIO_CONTAINER_SET_PERSISTENT_PGTABLES _IO(VFIO_TYPE, VFIO_BASE + 21) + /* ***************************************************************** */ #endif /* _UAPIVFIO_H */ From patchwork Mon Feb 5 12:01:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545370 Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5A0F1D555; Mon, 5 Feb 2024 12:05:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.219 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134724; cv=none; b=Xj38B/7cbmL4nhpLOAyn3dTSbJi37O08IeNFU1KushXWVTyOJTf/GSsqOlfEmOOHJDS5I5UE5Avs9md3tRBLGcNngpOOld96M03ggHEyIQmwZn/2uAUMQBHe96fc2WnUAtxdkToGFOecuq1tMtt7H+e0KwNx3sfhzo03x4rmsw8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134724; c=relaxed/simple; bh=HqWQLqTIWJhQA/wLFQbrxl5Yh4O/OkFiHi8etjbGQVs=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=fjab8lnUPyg5s84tJmnImIwq4G3TZKxLgHF8ijaIEY7i8MDy449SdtBy3NplA/StpDbjHof76tnDfUIdP7kL6eykBl79iI9Z8hbza2royxhApLvdo0NL18PV0mSuEFR8FNJ/FDVKTdV1sNSt1Q7E2Szd5mKVBftmJJEm0gySzc0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=ldc/791q; arc=none smtp.client-ip=99.78.197.219 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="ldc/791q" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134722; x=1738670722; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=VONiUp37V5G8RKTN0EmhvHz4ErjJ4ifKDUQ6e8VlzjQ=; b=ldc/791qw6jJic1Uai0kv46k6X1Wfizk0tOGGdMAT7eR2d3YERm3mklH jHUZ6hCp3vTgh8dtRjl5+2zBoGEDDYu/mFZLH+Ewz+8PELZTA/YOMqNTh L/NuBZM/klXbZjUht1oy+3oiCT1mUrFu2AN4Uu9yLIhAo8xh3H6pCt9XV k=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="63755776" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:05:21 +0000 Received: from EX19MTAEUC001.ant.amazon.com [10.0.10.100:40572] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.14.233:2525] with esmtp (Farcaster) id 27f4f8c6-d18d-45f7-aee0-af42d924df8a; Mon, 5 Feb 2024 12:05:19 +0000 (UTC) X-Farcaster-Flow-ID: 27f4f8c6-d18d-45f7-aee0-af42d924df8a Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUC001.ant.amazon.com (10.252.51.155) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:19 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:13 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 14/18] intel-iommu: Allocate domain pgtable pages from pkernfs Date: Mon, 5 Feb 2024 12:01:59 +0000 Message-ID: <20240205120203.60312-15-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D046UWA004.ant.amazon.com (10.13.139.76) To EX19D014EUC004.ant.amazon.com (10.252.51.182) In the previous commit VFIO was updated to be able to define persistent pgtables on a container. Now the IOMMU driver is updated to accept the file for persistent pgtables when the domain is allocated and use that file as the source of pages for the pgtables. The iommu_ops.domain_alloc callback is extended to page a struct file for the pkernfs domain pgtables file. Most call sites are updated to supply NULL here, indicating no persistent pgtables. VFIO's caller is updated to plumb the pkernfs file through. When this file is supplied the md_domain_init() function convers the file into a pkernfs region and uses that region for pgtables. Similarly to the root pgtables there are use after free issues with this that need sorting out, and the free() functions also need to be updated to free from the pkernfs region. It may be better to store the struct file on the dmar_domain and map file offset to addr every time rather than using a pkernfs region for this. --- drivers/iommu/intel/iommu.c | 35 +++++++++++++++++++++++++++-------- drivers/iommu/intel/iommu.h | 1 + drivers/iommu/iommu.c | 22 ++++++++++++++-------- drivers/iommu/pgtable_alloc.c | 7 +++++++ drivers/iommu/pgtable_alloc.h | 1 + fs/pkernfs/iommu.c | 2 +- include/linux/iommu.h | 6 +++++- include/linux/pkernfs.h | 1 + 8 files changed, 57 insertions(+), 18 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 315c6b7f901c..809ca9e93992 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -946,7 +946,13 @@ static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain, if (!dma_pte_present(pte)) { uint64_t pteval; - tmp_page = alloc_pgtable_page(domain->nid, gfp); + if (domain->pgtables_allocator.vaddr) + iommu_alloc_page_from_region( + &domain->pgtables_allocator, + &tmp_page, + NULL); + else + tmp_page = alloc_pgtable_page(domain->nid, gfp); if (!tmp_page) return NULL; @@ -2399,7 +2405,7 @@ static int iommu_domain_identity_map(struct dmar_domain *domain, DMA_PTE_READ|DMA_PTE_WRITE, GFP_KERNEL); } -static int md_domain_init(struct dmar_domain *domain, int guest_width); +static int md_domain_init(struct dmar_domain *domain, int guest_width, struct file *ppts); static int __init si_domain_init(int hw) { @@ -2411,7 +2417,7 @@ static int __init si_domain_init(int hw) if (!si_domain) return -EFAULT; - if (md_domain_init(si_domain, DEFAULT_DOMAIN_ADDRESS_WIDTH)) { + if (md_domain_init(si_domain, DEFAULT_DOMAIN_ADDRESS_WIDTH, NULL)) { domain_exit(si_domain); si_domain = NULL; return -EFAULT; @@ -4029,7 +4035,7 @@ static void device_block_translation(struct device *dev) info->domain = NULL; } -static int md_domain_init(struct dmar_domain *domain, int guest_width) +static int md_domain_init(struct dmar_domain *domain, int guest_width, struct file *ppts) { int adjust_width; @@ -4042,8 +4048,21 @@ static int md_domain_init(struct dmar_domain *domain, int guest_width) domain->iommu_superpage = 0; domain->max_addr = 0; - /* always allocate the top pgd */ - domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC); + if (ppts) { + unsigned long pgd_phy; + + pkernfs_get_region_for_ppts( + ppts, + &domain->pgtables_allocator); + iommu_get_pgd_page( + &domain->pgtables_allocator, + (void **) &domain->pgd, + &pgd_phy); + + } else { + /* always allocate the top pgd */ + domain->pgd = alloc_pgtable_page(domain->nid, GFP_ATOMIC); + } if (!domain->pgd) return -ENOMEM; domain_flush_cache(domain, domain->pgd, PAGE_SIZE); @@ -4064,7 +4083,7 @@ static struct iommu_domain blocking_domain = { } }; -static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) +static struct iommu_domain *intel_iommu_domain_alloc(unsigned int type, struct file *ppts) { struct dmar_domain *dmar_domain; struct iommu_domain *domain; @@ -4079,7 +4098,7 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) pr_err("Can't allocate dmar_domain\n"); return NULL; } - if (md_domain_init(dmar_domain, DEFAULT_DOMAIN_ADDRESS_WIDTH)) { + if (md_domain_init(dmar_domain, DEFAULT_DOMAIN_ADDRESS_WIDTH, ppts)) { pr_err("Domain initialization failed\n"); domain_exit(dmar_domain); return NULL; diff --git a/drivers/iommu/intel/iommu.h b/drivers/iommu/intel/iommu.h index 4a2f163a86f3..f772fdcf3828 100644 --- a/drivers/iommu/intel/iommu.h +++ b/drivers/iommu/intel/iommu.h @@ -602,6 +602,7 @@ struct dmar_domain { struct list_head dev_pasids; /* all attached pasids */ struct list_head domains; /* all struct dmar_domains on this IOMMU */ + struct pkernfs_region pgtables_allocator; struct dma_pte *pgd; /* virtual address */ int gaw; /* max guest address width */ diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 3a67e636287a..f26e83d5b159 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -97,7 +97,7 @@ static int iommu_bus_notifier(struct notifier_block *nb, unsigned long action, void *data); static void iommu_release_device(struct device *dev); static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus, - unsigned type); + unsigned int type, struct file *ppts); static int __iommu_attach_device(struct iommu_domain *domain, struct device *dev); static int __iommu_attach_group(struct iommu_domain *domain, @@ -1734,7 +1734,7 @@ __iommu_group_alloc_default_domain(const struct bus_type *bus, { if (group->default_domain && group->default_domain->type == req_type) return group->default_domain; - return __iommu_domain_alloc(bus, req_type); + return __iommu_domain_alloc(bus, req_type, NULL); } /* @@ -1971,7 +1971,7 @@ void iommu_set_fault_handler(struct iommu_domain *domain, EXPORT_SYMBOL_GPL(iommu_set_fault_handler); static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus, - unsigned type) + unsigned int type, struct file *ppts) { struct iommu_domain *domain; unsigned int alloc_type = type & IOMMU_DOMAIN_ALLOC_FLAGS; @@ -1979,7 +1979,7 @@ static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus, if (bus == NULL || bus->iommu_ops == NULL) return NULL; - domain = bus->iommu_ops->domain_alloc(alloc_type); + domain = bus->iommu_ops->domain_alloc(alloc_type, ppts); if (!domain) return NULL; @@ -2001,9 +2001,15 @@ static struct iommu_domain *__iommu_domain_alloc(const struct bus_type *bus, return domain; } +struct iommu_domain *iommu_domain_alloc_persistent(const struct bus_type *bus, struct file *ppts) +{ + return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED, ppts); +} +EXPORT_SYMBOL_GPL(iommu_domain_alloc_persistent); + struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus) { - return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED); + return __iommu_domain_alloc(bus, IOMMU_DOMAIN_UNMANAGED, NULL); } EXPORT_SYMBOL_GPL(iommu_domain_alloc); @@ -3198,14 +3204,14 @@ static int __iommu_group_alloc_blocking_domain(struct iommu_group *group) return 0; group->blocking_domain = - __iommu_domain_alloc(dev->dev->bus, IOMMU_DOMAIN_BLOCKED); + __iommu_domain_alloc(dev->dev->bus, IOMMU_DOMAIN_BLOCKED, NULL); if (!group->blocking_domain) { /* * For drivers that do not yet understand IOMMU_DOMAIN_BLOCKED * create an empty domain instead. */ group->blocking_domain = __iommu_domain_alloc( - dev->dev->bus, IOMMU_DOMAIN_UNMANAGED); + dev->dev->bus, IOMMU_DOMAIN_UNMANAGED, NULL); if (!group->blocking_domain) return -EINVAL; } @@ -3500,7 +3506,7 @@ struct iommu_domain *iommu_sva_domain_alloc(struct device *dev, const struct iommu_ops *ops = dev_iommu_ops(dev); struct iommu_domain *domain; - domain = ops->domain_alloc(IOMMU_DOMAIN_SVA); + domain = ops->domain_alloc(IOMMU_DOMAIN_SVA, false); if (!domain) return NULL; diff --git a/drivers/iommu/pgtable_alloc.c b/drivers/iommu/pgtable_alloc.c index f0c2e12f8a8b..276db15932cc 100644 --- a/drivers/iommu/pgtable_alloc.c +++ b/drivers/iommu/pgtable_alloc.c @@ -7,6 +7,13 @@ * The first 4 KiB is the bitmap - set the first bit in the bitmap. * Scan bitmap to find next free bits - it's next free page. */ +void iommu_get_pgd_page(struct pkernfs_region *region, void **vaddr, unsigned long *paddr) +{ + set_bit(1, region->vaddr); + *vaddr = region->vaddr + (1 << PAGE_SHIFT); + if (paddr) + *paddr = region->paddr + (1 << PAGE_SHIFT); +} void iommu_alloc_page_from_region(struct pkernfs_region *region, void **vaddr, unsigned long *paddr) { diff --git a/drivers/iommu/pgtable_alloc.h b/drivers/iommu/pgtable_alloc.h index c1666a7be3d3..50c3abba922b 100644 --- a/drivers/iommu/pgtable_alloc.h +++ b/drivers/iommu/pgtable_alloc.h @@ -3,6 +3,7 @@ #include #include +void iommu_get_pgd_page(struct pkernfs_region *region, void **vaddr, unsigned long *paddr); void iommu_alloc_page_from_region(struct pkernfs_region *region, void **vaddr, unsigned long *paddr); diff --git a/fs/pkernfs/iommu.c b/fs/pkernfs/iommu.c index f14e76013e85..5d0b256e7dd8 100644 --- a/fs/pkernfs/iommu.c +++ b/fs/pkernfs/iommu.c @@ -4,7 +4,7 @@ #include -void pkernfs_alloc_iommu_domain_pgtables(struct file *ppts, struct pkernfs_region *pkernfs_region) +void pkernfs_get_region_for_ppts(struct file *ppts, struct pkernfs_region *pkernfs_region) { struct pkernfs_inode *pkernfs_inode; unsigned long *mappings_block_vaddr; diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 0225cf7445de..01bb89246ef7 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -101,6 +101,7 @@ struct iommu_domain { enum iommu_page_response_code (*iopf_handler)(struct iommu_fault *fault, void *data); void *fault_data; + struct file *persistent_pgtables; union { struct { iommu_fault_handler_t handler; @@ -266,7 +267,8 @@ struct iommu_ops { void *(*hw_info)(struct device *dev, u32 *length, u32 *type); /* Domain allocation and freeing by the iommu driver */ - struct iommu_domain *(*domain_alloc)(unsigned iommu_domain_type); + /* If ppts is not null it is a persistent domain; null is non-persistent */ + struct iommu_domain *(*domain_alloc)(unsigned int tiommu_domain_type, struct file *ppts); struct iommu_device *(*probe_device)(struct device *dev); void (*release_device)(struct device *dev); @@ -466,6 +468,8 @@ extern bool iommu_present(const struct bus_type *bus); extern bool device_iommu_capable(struct device *dev, enum iommu_cap cap); extern bool iommu_group_has_isolated_msi(struct iommu_group *group); extern struct iommu_domain *iommu_domain_alloc(const struct bus_type *bus); +extern struct iommu_domain *iommu_domain_alloc_persistent(const struct bus_type *bus, + struct file *ppts); extern void iommu_domain_free(struct iommu_domain *domain); extern int iommu_attach_device(struct iommu_domain *domain, struct device *dev); diff --git a/include/linux/pkernfs.h b/include/linux/pkernfs.h index 4ca923ee0d82..8aa69ef5a2d8 100644 --- a/include/linux/pkernfs.h +++ b/include/linux/pkernfs.h @@ -31,6 +31,7 @@ struct pkernfs_region { void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region); void pkernfs_alloc_page_from_region(struct pkernfs_region *pkernfs_region, void **vaddr, unsigned long *paddr); +void pkernfs_get_region_for_ppts(struct file *ppts, struct pkernfs_region *pkernfs_region); void *pkernfs_region_paddr_to_vaddr(struct pkernfs_region *region, unsigned long paddr); bool pkernfs_is_iommu_domain_pgtables(struct file *f); From patchwork Mon Feb 5 12:02:00 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545373 Received: from smtp-fw-52003.amazon.com (smtp-fw-52003.amazon.com [52.119.213.152]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9E369200C3; Mon, 5 Feb 2024 12:06:05 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.152 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134767; cv=none; b=jr+433HzYYAtjj+BxAl+SVaMSqVuXS9x6saff/uj1A8eGyY6Fw8DvRuYxJjigUInFdoDQ+gl/TDMnaelpwYCwBgtxpOEn2+1euExWWX+SFTe28mnq+9scTGnErgETxTNyczCMin4BBySZRqpaBK3ISD7bVKGO2c94u7XIk3fGa0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134767; c=relaxed/simple; bh=wiv5U2uS8hUNGvTNLpJMav4y5mpoMA0A7e/DcQgUB54=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=cItoH1kD6BBUG1ze24SXY7eIs18Ol8VIwjTLBPQD6tCuyUwNL4uJlvsCupedWL7J/93A0kEMU042wJbnGC4qb/D9yU4ehwM4z/RxCrlefUT7f0PT487XQRsHYMjLDQIjI3qhvCv8/7YPNbtXaHLn+jwdZZ3BWzaFi4NZsIRlElc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=tPu64pUv; arc=none smtp.client-ip=52.119.213.152 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="tPu64pUv" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134766; x=1738670766; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=nHBURAsv2kVj18uGoBS/ynF5eAKuZbHIuIE7fti0DT8=; b=tPu64pUvAr6GnsEJ2eZJyq16WF88nHXybjAkbNmSFVN6DizbghynKDZN VMYgd63kckia0/Kh+s2KMXDnCkauoyNrhzbwajn+MWNP0ZFgLFFKsJ3fv tODXa+LUMh4p3y8a2ACA2lLkRz1SBH6fWIl1iSwR08WAmOWQ2Qkpzd9a9 4=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="635764854" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-52003.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:06:02 +0000 Received: from EX19MTAEUA002.ant.amazon.com [10.0.43.254:50504] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.28.192:2525] with esmtp (Farcaster) id a892883c-e384-427e-8387-b1c5e5820895; Mon, 5 Feb 2024 12:05:50 +0000 (UTC) X-Farcaster-Flow-ID: a892883c-e384-427e-8387-b1c5e5820895 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUA002.ant.amazon.com (10.252.50.124) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:49 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:43 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 15/18] pkernfs: register device memory for IOMMU domain pgtables Date: Mon, 5 Feb 2024 12:02:00 +0000 Message-ID: <20240205120203.60312-16-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D042UWB002.ant.amazon.com (10.13.139.175) To EX19D014EUC004.ant.amazon.com (10.252.51.182) Similarly to the root/context pgtables, the IOMMU driver also does phys_to_virt when walking the domain pgtables. To make this work properly the physical memory needs to be mapped in at the correct place in the direct map. Register a memory device to support this. The alternative would be to wrap all of the phys_to_virt functions in something which is pkernfs aware. --- fs/pkernfs/iommu.c | 16 +++++++++++++++- 1 file changed, 15 insertions(+), 1 deletion(-) diff --git a/fs/pkernfs/iommu.c b/fs/pkernfs/iommu.c index 5d0b256e7dd8..073b9dd48237 100644 --- a/fs/pkernfs/iommu.c +++ b/fs/pkernfs/iommu.c @@ -9,6 +9,7 @@ void pkernfs_get_region_for_ppts(struct file *ppts, struct pkernfs_region *pkern struct pkernfs_inode *pkernfs_inode; unsigned long *mappings_block_vaddr; unsigned long inode_idx; + int rc; /* * For a pkernfs region block, the "mappings_block" field is still @@ -22,7 +23,20 @@ void pkernfs_get_region_for_ppts(struct file *ppts, struct pkernfs_region *pkern mappings_block_vaddr = (unsigned long *)pkernfs_addr_for_block(NULL, pkernfs_inode->mappings_block); set_bit(0, mappings_block_vaddr); - pkernfs_region->vaddr = mappings_block_vaddr; + + dev_set_name(&pkernfs_region->dev, "vfio-ppt-%s", pkernfs_inode->filename); + rc = device_register(&pkernfs_region->dev); + if (rc) + pr_err("device_register failed: %i\n", rc); + + pkernfs_region->pgmap.range.start = pkernfs_base + + (pkernfs_inode->mappings_block * PMD_SIZE); + pkernfs_region->pgmap.range.end = + pkernfs_region->pgmap.range.start + PMD_SIZE - 1; + pkernfs_region->pgmap.nr_range = 1; + pkernfs_region->pgmap.type = MEMORY_DEVICE_GENERIC; + pkernfs_region->vaddr = + devm_memremap_pages(&pkernfs_region->dev, &pkernfs_region->pgmap); pkernfs_region->paddr = pkernfs_base + (pkernfs_inode->mappings_block * (2 << 20)); } void pkernfs_alloc_iommu_root_pgtables(struct pkernfs_region *pkernfs_region) From patchwork Mon Feb 5 12:02:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545371 Received: from smtp-fw-80008.amazon.com (smtp-fw-80008.amazon.com [99.78.197.219]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AD1691B943; Mon, 5 Feb 2024 12:05:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=99.78.197.219 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134759; cv=none; b=s81H1sAf3ew6o8YKADNQuYTm96YJ0bVI1brNyCqgigp/GtwmnYhTCAIMywGgWVW9ziSdxQhr6jVh56zuYEqyJOm6/5XT+wR70eo+JQ1JLuYTQAGsy1UrtMeNoZ+9Idy/zf0ZAGAqg40ckFLCWnVCmUSPXihlWEVYEKvK3TWF4/k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134759; c=relaxed/simple; bh=GKb17qTpfFtGwafuxlfpZaDYOiFuRf0lqbATSZhVa5M=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=VrKKmf3sgMGcE9hox37DOgifvWdwgb8fS5p5Ix71RaXvdu923oa5WXAu+CGE7OBa8LdCyX7UkTESOFLqmllOLVty2yTY0oxDIIrt60VDLiv/ZjDnIBozpSYDUamnnpcerbRZqlzT3W71ywGX5O70Eydx7KZ5nPiFQ6Y0xmA6mLU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=i/klmtUX; arc=none smtp.client-ip=99.78.197.219 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="i/klmtUX" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134757; x=1738670757; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=2AUEdbvoftHMg2AkT8RS6u9G56Zx/z0Bmygk8dweA74=; b=i/klmtUXhQUrqDvb4qCuzopKWT/ra9S74iPzNL3XvyoyCKcwmf1zgN4T miLXyPqCx6UXRj23ocW5tiM35DErXD+aCBEoRp0jMNNqmiSaBol5bQGif ZC2YCfK6Dd3qeJdvGBv3PJgo97QbI+tWtNJLMnAJnqpGex+192NyLyvIn c=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="63755948" Received: from pdx4-co-svc-p1-lb2-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.25.36.214]) by smtp-border-fw-80008.pdx80.corp.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:05:57 +0000 Received: from EX19MTAEUB001.ant.amazon.com [10.0.17.79:13316] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.45.85:2525] with esmtp (Farcaster) id 596cc4dd-1066-4e32-b40f-00d6ab21896d; Mon, 5 Feb 2024 12:05:56 +0000 (UTC) X-Farcaster-Flow-ID: 596cc4dd-1066-4e32-b40f-00d6ab21896d Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUB001.ant.amazon.com (10.252.51.26) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:56 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:49 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 16/18] vfio: support not mapping IOMMU pgtables on live-update Date: Mon, 5 Feb 2024 12:02:01 +0000 Message-ID: <20240205120203.60312-17-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D042UWB002.ant.amazon.com (10.13.139.175) To EX19D014EUC004.ant.amazon.com (10.252.51.182) When restoring VMs after live update kexec, the IOVAs for the guest VM are already present in the persisted page tables. It is unnecessary to clobber the existing pgtable entries and it may introduce races if pgtable modifications happen concurrently with DMA. Provide a new VFIO MAP_DMA flag which userspace can supply to inform VFIO that the IOVAs are already mapped. In this case VFIO will skip over the call to the IOMMU driver to do the mapping. VFIO still needs the MAP_DMA ioctl to set up its internal data structures about the mapping. It would probably be better to move the persistence one layer up and persist the VFIO container in pkernfs. That way the whole container could be picked up and re-used without needing to do any MAP_DMA ioctls after kexec. --- drivers/vfio/vfio_iommu_type1.c | 24 +++++++++++++----------- include/uapi/linux/vfio.h | 1 + 2 files changed, 14 insertions(+), 11 deletions(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index b36edfc5c9ef..dc2682fbda2e 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -1456,7 +1456,7 @@ static int vfio_iommu_map(struct vfio_iommu *iommu, dma_addr_t iova, } static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, - size_t map_size) + size_t map_size, unsigned int flags) { dma_addr_t iova = dma->iova; unsigned long vaddr = dma->vaddr; @@ -1479,14 +1479,16 @@ static int vfio_pin_map_dma(struct vfio_iommu *iommu, struct vfio_dma *dma, break; } - /* Map it! */ - ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, - dma->prot); - if (ret) { - vfio_unpin_pages_remote(dma, iova + dma->size, pfn, - npage, true); - vfio_batch_unpin(&batch, dma); - break; + if (!(flags & VFIO_DMA_MAP_FLAG_LIVE_UPDATE)) { + /* Map it! */ + ret = vfio_iommu_map(iommu, iova + dma->size, pfn, npage, + dma->prot); + if (ret) { + vfio_unpin_pages_remote(dma, iova + dma->size, pfn, + npage, true); + vfio_batch_unpin(&batch, dma); + break; + } } size -= npage << PAGE_SHIFT; @@ -1662,7 +1664,7 @@ static int vfio_dma_do_map(struct vfio_iommu *iommu, if (list_empty(&iommu->domain_list)) dma->size = size; else - ret = vfio_pin_map_dma(iommu, dma, size); + ret = vfio_pin_map_dma(iommu, dma, size, map->flags); if (!ret && iommu->dirty_page_tracking) { ret = vfio_dma_bitmap_alloc(dma, pgsize); @@ -2836,7 +2838,7 @@ static int vfio_iommu_type1_map_dma(struct vfio_iommu *iommu, struct vfio_iommu_type1_dma_map map; unsigned long minsz; uint32_t mask = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE | - VFIO_DMA_MAP_FLAG_VADDR; + VFIO_DMA_MAP_FLAG_VADDR | VFIO_DMA_MAP_FLAG_LIVE_UPDATE; minsz = offsetofend(struct vfio_iommu_type1_dma_map, size); diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index fa9676bb4b26..d04d28e52110 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -1536,6 +1536,7 @@ struct vfio_iommu_type1_dma_map { #define VFIO_DMA_MAP_FLAG_READ (1 << 0) /* readable from device */ #define VFIO_DMA_MAP_FLAG_WRITE (1 << 1) /* writable from device */ #define VFIO_DMA_MAP_FLAG_VADDR (1 << 2) +#define VFIO_DMA_MAP_FLAG_LIVE_UPDATE (1 << 3) /* IOVAs already mapped in IOMMU before LU */ __u64 vaddr; /* Process virtual address */ __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */ From patchwork Mon Feb 5 12:02:02 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545372 Received: from smtp-fw-52004.amazon.com (smtp-fw-52004.amazon.com [52.119.213.154]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6C695200AB; Mon, 5 Feb 2024 12:06:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=52.119.213.154 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134766; cv=none; b=TrtEn0EFDa6q9kni+9Q50ayTsVHLS4ncgdEjk873gKHUiKA5t1j+QCA5y9763X6E+GTi/MzyIJO4Iba2So1D4UHWLN5xbqHz49hnOwgB/AVxTSTJpsEPKPSjLpBvOBh2EUSxS4Ls0TPPWtzo8XCt3Y4qLrBXmxGdNkJUV8w9HQI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134766; c=relaxed/simple; bh=ixoM0Sd7pCRhUQdiB6CxvVkou2FFR6qDgDmLsYAURmU=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=qpHDBZGQ72WunF5mvfKj6q5e3Va1EZt/sb8gVejzt7VPRFmurPWM4HwuHlp75hJb3BVi9IRnxNbX/UhgTH6Hzx/9skDavMQWe38fqbvdPLaoGiy/Kd3g2722azHReXPR0boHqwR6vCmRZvEadUBkIa+LJrKwksmwUuB23ASrTfg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=EywkZ7fI; arc=none smtp.client-ip=52.119.213.154 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="EywkZ7fI" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134765; x=1738670765; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=PZz9j1l+xmnCAt/gSMnG+c7uMbg+JDxTUK6mJnmY1gY=; b=EywkZ7fINIMUY/2tN9n7MFAb0T18hEP31HO8IxhveOGhJH3sa05dWqFh coiXD8Zckk5WX3vBORsjQwIrfWx4Nb5ub6x1lhj8NciceWTpP2y4JQDtQ qRFLvb/M/pLYQnRXw/BxamCSR6Essvn+Ne6oZRuzI6InFuzSlqd17JaDi o=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="182633597" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO smtpout.prod.us-east-1.prod.farcaster.email.amazon.dev) ([10.43.8.2]) by smtp-border-fw-52004.iad7.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:06:04 +0000 Received: from EX19MTAEUB002.ant.amazon.com [10.0.17.79:57869] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.28.144:2525] with esmtp (Farcaster) id 1dbb430b-a9a2-4718-957f-9c92058f05d9; Mon, 5 Feb 2024 12:06:02 +0000 (UTC) X-Farcaster-Flow-ID: 1dbb430b-a9a2-4718-957f-9c92058f05d9 Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUB002.ant.amazon.com (10.252.51.59) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:06:02 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:05:56 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 17/18] pci: Don't clear bus master is persistence enabled Date: Mon, 5 Feb 2024 12:02:02 +0000 Message-ID: <20240205120203.60312-18-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D042UWB002.ant.amazon.com (10.13.139.175) To EX19D014EUC004.ant.amazon.com (10.252.51.182) In order for persistent devices to continue to DMA during kexec the bus mastering capability needs to remain on. Do not disable bus mastering if pkernfs is enabled, indicating that persistent devices are enabled. Only persistent devices should have bus mastering left on during kexec but this serves as a rough approximation of the functionality needed for this pkernfs RFC. --- drivers/pci/pci-driver.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c index 51ec9e7e784f..131127967811 100644 --- a/drivers/pci/pci-driver.c +++ b/drivers/pci/pci-driver.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -519,7 +520,8 @@ static void pci_device_shutdown(struct device *dev) * If it is not a kexec reboot, firmware will hit the PCI * devices with big hammer and stop their DMA any way. */ - if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot)) + if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot) + && !pkernfs_enabled()) pci_clear_master(pci_dev); } From patchwork Mon Feb 5 12:02:03 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Gowans, James" X-Patchwork-Id: 13545374 Received: from smtp-fw-2101.amazon.com (smtp-fw-2101.amazon.com [72.21.196.25]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B873A21112; Mon, 5 Feb 2024 12:06:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=72.21.196.25 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134800; cv=none; b=YxJ6wdOQS0TZPvzRInTa1S5d99lTgL89aVwjyXgEE8L+sbQbVWR8cdMY2BdLvhUVOg6G6XZYkehkkd4vHlDHBpqXzir5wcaFGfmOzjd71JSOnLZ1YxjUC6qUCkVkWWbEmPM9Mgd3r5xcn/RgTW7MwLVRoVXFM125TYYXkp7Skw0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707134800; c=relaxed/simple; bh=yhYdhvkUj66URI+HxR9xU2mALUiQmweSZAQjAWuVHXw=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=LzgPUhIbuNVAexs0wKTrO0nqm1P+HU2wRm8r/0YBT9wphTVZRnWkGGDrDn49NgI3RmH+IjFj/Cy5iG/uOy9fyEWEYR1pfNR5hB61TcirbB86pRe/1qI50OGpObfLNteAx0wtke33E8dTPYFx6OlzzTFjG6bBjourBaKUL5ZRyiA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com; spf=pass smtp.mailfrom=amazon.com; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b=KS4I5EHo; arc=none smtp.client-ip=72.21.196.25 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=amazon.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=amazon.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=amazon.com header.i=@amazon.com header.b="KS4I5EHo" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1707134799; x=1738670799; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FiBUV1fiPqoy2U4uMejLoWWd/EiydMvzfGCqIJIuYJY=; b=KS4I5EHoPb/GLevhTJXhrqZGA19Gu0J1gr1dLxgNL42X9GGMElWCxcV6 TMkSKYaGibsuGrDtJ/BVIk12A5B2F6AeD63lMV41v1myZ+ZGI6khCfgDA KtYbjAX5a0RIQlGh00w6P5OeRBJ1H9xpdzO4LHNt/rm5sA/HwFOxllbTC U=; X-IronPort-AV: E=Sophos;i="6.05,245,1701129600"; d="scan'208";a="378967633" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO smtpout.prod.us-west-2.prod.farcaster.email.amazon.dev) ([10.43.8.6]) by smtp-border-fw-2101.iad2.amazon.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Feb 2024 12:06:35 +0000 Received: from EX19MTAEUC002.ant.amazon.com [10.0.17.79:8094] by smtpin.naws.eu-west-1.prod.farcaster.email.amazon.dev [10.0.45.85:2525] with esmtp (Farcaster) id 26f74936-a3bc-4ca0-9d65-59fbb5471a6c; Mon, 5 Feb 2024 12:06:32 +0000 (UTC) X-Farcaster-Flow-ID: 26f74936-a3bc-4ca0-9d65-59fbb5471a6c Received: from EX19D014EUC004.ant.amazon.com (10.252.51.182) by EX19MTAEUC002.ant.amazon.com (10.252.51.181) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:06:32 +0000 Received: from dev-dsk-jgowans-1a-a3faec1f.eu-west-1.amazon.com (172.19.112.191) by EX19D014EUC004.ant.amazon.com (10.252.51.182) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1118.40; Mon, 5 Feb 2024 12:06:26 +0000 From: James Gowans To: CC: Eric Biederman , , "Joerg Roedel" , Will Deacon , , Alexander Viro , "Christian Brauner" , , Paolo Bonzini , Sean Christopherson , , Andrew Morton , , Alexander Graf , David Woodhouse , "Jan H . Schoenherr" , Usama Arif , Anthony Yznaga , Stanislav Kinsburskii , , , Subject: [RFC 18/18] vfio-pci: Assume device working after liveupdate Date: Mon, 5 Feb 2024 12:02:03 +0000 Message-ID: <20240205120203.60312-19-jgowans@amazon.com> X-Mailer: git-send-email 2.40.1 In-Reply-To: <20240205120203.60312-1-jgowans@amazon.com> References: <20240205120203.60312-1-jgowans@amazon.com> Precedence: bulk X-Mailing-List: kvm@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: EX19D044UWB004.ant.amazon.com (10.13.139.134) To EX19D014EUC004.ant.amazon.com (10.252.51.182) When re-creating a VFIO device after liveupdate no desctructive actions should be taken on it to avoid interrupting any ongoing DMA. Specifically bus mastering should not be cleared and the device should not be reset. Assume that reset works properly and skip over bus mastering reset. Ideally this would only be done for persistent devices but in this rough RFC there currently is no mechanism at this point to easily tell if a device is persisted or not. --- drivers/vfio/pci/vfio_pci_core.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c index 1929103ee59a..a7f56d43e0a4 100644 --- a/drivers/vfio/pci/vfio_pci_core.c +++ b/drivers/vfio/pci/vfio_pci_core.c @@ -480,19 +480,25 @@ int vfio_pci_core_enable(struct vfio_pci_core_device *vdev) return ret; } - /* Don't allow our initial saved state to include busmaster */ - pci_clear_master(pdev); + if (!liveupdate) { + /* Don't allow our initial saved state to include busmaster */ + pci_clear_master(pdev); + } ret = pci_enable_device(pdev); if (ret) goto out_power; - /* If reset fails because of the device lock, fail this path entirely */ - ret = pci_try_reset_function(pdev); - if (ret == -EAGAIN) - goto out_disable_device; + if (!liveupdate) { + /* If reset fails because of the device lock, fail this path entirely */ + ret = pci_try_reset_function(pdev); + if (ret == -EAGAIN) + goto out_disable_device; - vdev->reset_works = !ret; + vdev->reset_works = !ret; + } else { + vdev->reset_works = 1; + } pci_save_state(pdev); vdev->pci_saved_state = pci_store_saved_state(pdev); if (!vdev->pci_saved_state)