From patchwork Thu Mar 10 14:08:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776417 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D700C433F5 for ; Thu, 10 Mar 2022 14:09:47 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242708AbiCJOKq (ORCPT ); Thu, 10 Mar 2022 09:10:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47590 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239848AbiCJOKp (ORCPT ); Thu, 10 Mar 2022 09:10:45 -0500 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 99DD25B3E9; Thu, 10 Mar 2022 06:09:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921384; x=1678457384; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=HfTyQumQLz3rZd208DGi7M1QzPzA68CVqOu0FcAUOfs=; b=dDBCisZooMBX55z0pvOo7BwiNT5YhNhADPCorGQXClD5946mFRGHoO9V T9F6wuZ4663tySstqAAqxykQcxCAKAWx5y17YpVc00QZpWB1yEX6ZLR87 3PDBjoJJTOLyjXazfvDqE6Ly2NvfrV/dbMyGx+uZlQhHX/A6S+eLELFQe FT3uIk5rxUKPaOxtAaPf/nxZ0lkGQuditkFAnUPs08i78z33frS+TlyW6 9FPaWYGWZCNpYX0Fd3EHgTDcUQlUAq7+rwQhvz7H5qNeAAGuBQ76r3/GR WOAz9SjxQo+5CPduqTT05vZGwagZJtSvoJQwHtAMGVBiJ3IgWRAYx/R1J g==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="318479227" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="318479227" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:09:44 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654769" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:36 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Date: Thu, 10 Mar 2022 22:08:59 +0800 Message-Id: <20220310140911.50924-2-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: "Kirill A. Shutemov" Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. Since page migration/swapping is not yet supported for such usages so these pages are currently marked as UNMOVABLE and UNEVICTABLE which makes them behave like long-term pinned pages. The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag. At this time only shmem implements this flag. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/shmem_fs.h | 7 +++++ include/uapi/linux/memfd.h | 1 + mm/memfd.c | 26 +++++++++++++++-- mm/shmem.c | 57 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 88 insertions(+), 3 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index e65b80ed09e7..2dde843f28ef 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -12,6 +12,9 @@ /* inode in-kernel data */ +/* shmem extended flags */ +#define SHM_F_INACCESSIBLE 0x0001 /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */ + struct shmem_inode_info { spinlock_t lock; unsigned int seals; /* shmem seals */ @@ -24,6 +27,7 @@ struct shmem_inode_info { struct shared_policy policy; /* NUMA memory alloc policy */ struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ + unsigned int xflags; /* shmem extended flags */ struct inode vfs_inode; }; @@ -61,6 +65,9 @@ extern struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags); extern struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned long flags); +extern struct file *shmem_file_setup_xflags(const char *name, loff_t size, + unsigned long flags, + unsigned int xflags); extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name, loff_t size, unsigned long flags); extern int shmem_zero_setup(struct vm_area_struct *); diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..48750474b904 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_INACCESSIBLE 0x0008U /* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/memfd.c b/mm/memfd.c index 9f80f162791a..74d45a26cf5d 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -245,16 +245,20 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_INACCESSIBLE) SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) { + struct address_space *mapping; unsigned int *file_seals; + unsigned int xflags; struct file *file; int fd, error; char *name; + gfp_t gfp; long len; if (!(flags & MFD_HUGETLB)) { @@ -267,6 +271,10 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; } + /* Disallow sealing when MFD_INACCESSIBLE is set. */ + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING) + return -EINVAL; + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -301,8 +309,11 @@ SYSCALL_DEFINE2(memfd_create, HUGETLB_ANONHUGE_INODE, (flags >> MFD_HUGE_SHIFT) & MFD_HUGE_MASK); - } else - file = shmem_file_setup(name, 0, VM_NORESERVE); + } else { + xflags = flags & MFD_INACCESSIBLE ? SHM_F_INACCESSIBLE : 0; + file = shmem_file_setup_xflags(name, 0, VM_NORESERVE, xflags); + } + if (IS_ERR(file)) { error = PTR_ERR(file); goto err_fd; @@ -313,6 +324,15 @@ SYSCALL_DEFINE2(memfd_create, if (flags & MFD_ALLOW_SEALING) { file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; + } else if (flags & MFD_INACCESSIBLE) { + mapping = file_inode(file)->i_mapping; + gfp = mapping_gfp_mask(mapping); + gfp &= ~__GFP_MOVABLE; + mapping_set_gfp_mask(mapping, gfp); + mapping_set_unevictable(mapping); + + file_seals = memfd_file_seals_ptr(file); + *file_seals = F_SEAL_SEAL; } fd_install(fd, file); diff --git a/mm/shmem.c b/mm/shmem.c index a09b29ec2b45..9b31a7056009 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1084,6 +1084,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; + if (info->xflags & SHM_F_INACCESSIBLE) { + if(oldsize) + return -EPERM; + if (!PAGE_ALIGNED(newsize)) + return -EINVAL; + } + if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); @@ -1331,6 +1338,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty; + if (info->xflags & SHM_F_INACCESSIBLE) + goto redirty; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -2228,6 +2237,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; + if (info->xflags & SHM_F_INACCESSIBLE) + return -EPERM; + /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED; @@ -2433,6 +2445,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping, if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; } + if (unlikely(info->xflags & SHM_F_INACCESSIBLE)) + return -EPERM; ret = shmem_getpage(inode, index, pagep, SGP_WRITE); @@ -2517,6 +2531,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) end_index = i_size >> PAGE_SHIFT; if (index > end_index) break; + + /* + * inode_lock protects setting up seals as well as write to + * i_size. Setting SHM_F_INACCESSIBLE only allowed with + * i_size == 0. + * + * Check SHM_F_INACCESSIBLE after i_size. It effectively + * serialize read vs. setting SHM_F_INACCESSIBLE without + * taking inode_lock in read path. + */ + if (SHMEM_I(inode)->xflags & SHM_F_INACCESSIBLE) { + error = -EPERM; + break; + } + if (index == end_index) { nr = i_size & ~PAGE_MASK; if (nr <= offset) @@ -2648,6 +2677,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->xflags & SHM_F_INACCESSIBLE) && + (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) { + error = -EINVAL; + goto out; + } + shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; @@ -4082,6 +4117,28 @@ struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned lon return __shmem_file_setup(shm_mnt, name, size, flags, S_PRIVATE); } +/** + * shmem_file_setup_xflags - get an unlinked file living in tmpfs with + * additional xflags. + * @name: name for dentry (to be seen in /proc//maps + * @size: size to be set for the file + * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size + * @xflags: SHM_F_INACCESSIBLE prevents ordinary MMU access to the file content + */ + +struct file *shmem_file_setup_xflags(const char *name, loff_t size, + unsigned long flags, unsigned int xflags) +{ + struct shmem_inode_info *info; + struct file *res = __shmem_file_setup(shm_mnt, name, size, flags, 0); + + if(!IS_ERR(res)) { + info = SHMEM_I(file_inode(res)); + info->xflags = xflags & SHM_F_INACCESSIBLE; + } + return res; +} + /** * shmem_file_setup - get an unlinked file living in tmpfs * @name: name for dentry (to be seen in /proc//maps From patchwork Thu Mar 10 14:09:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776418 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2B6CC4332F for ; Thu, 10 Mar 2022 14:09:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242718AbiCJOK4 (ORCPT ); Thu, 10 Mar 2022 09:10:56 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48310 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242722AbiCJOKz (ORCPT ); Thu, 10 Mar 2022 09:10:55 -0500 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E4E467084C; Thu, 10 Mar 2022 06:09:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921392; x=1678457392; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=GbR4wRPHIwdHjPFQakoEmpS8z2SMsND5ZXVQtQDRCQY=; b=hA3tdBhcGqi8ytuzhUBoRTMFeCQy5ZvXGDVcvVklFx9Mzoc0k7SvrL4j Amy+XPJ16/730KNr35FxgRBC4QC3oxZFXOm+ghs4KrywbbziPrLHW6DMP zUSm2WQoqqfiAVMMczYRNM1eJxMQnFEwUSUL9X81XJb6kD9hUM+FT5o4S cdYVBy0MwfEkzu2EaH0ctcBTxBE55ZEj491jibFFeKvyjUZJlwyzsvrwZ sw0xTnelQ/GaTkCU5JwvlgkPec3taTovVJMffZR5dVP6VJc6Mm/nliTQT zcLFstjuxtCqJ1yWExzNYrB/JF5ubpMxyCcQ7VOQ9DE92UVCaoHkneRKP Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="235205942" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="235205942" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:09:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654831" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:44 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 02/13] mm: Introduce memfile_notifier Date: Thu, 10 Mar 2022 22:09:00 +0800 Message-Id: <20220310140911.50924-3-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become allocated/invalidated. It will be used for KVM to use a file descriptor as the guest memory backing store and KVM will use this memfile_notifier interface to interact with memory file subsystems. In the future there might be other consumers (e.g. VFIO with encrypted device memory). It consists two sets of callbacks: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets allocated/invalidated. - memfile_pfn_ops: callbacks for KVM to call into memory backing store to request memory pages for guest private memory. Userspace is in charge of guest memory lifecycle: it first allocates pages in memory backing store and then passes the fd to KVM and lets KVM register each memory slot to memory backing store via memfile_register_notifier. The supported memory backing store should maintain a memfile_notifier list and provide routine for memfile_notifier to get the list head address and memfile_pfn_ops callbacks for memfile_register_notifier. It also should call memfile_notifier_fallocate/memfile_notifier_invalidate when the bookmarked memory gets allocated/invalidated. Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/memfile_notifier.h | 64 +++++++++++++++++ mm/Kconfig | 4 ++ mm/Makefile | 1 + mm/memfile_notifier.c | 114 +++++++++++++++++++++++++++++++ 4 files changed, 183 insertions(+) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h new file mode 100644 index 000000000000..e8d400558adb --- /dev/null +++ b/include/linux/memfile_notifier.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMFILE_NOTIFIER_H +#define _LINUX_MEMFILE_NOTIFIER_H + +#include +#include +#include +#include + +struct memfile_notifier; + +struct memfile_notifier_ops { + void (*invalidate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); + void (*fallocate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct memfile_pfn_ops { + long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order); + void (*put_unlock_pfn)(unsigned long pfn); +}; + +struct memfile_notifier { + struct list_head list; + struct memfile_notifier_ops *ops; +}; + +struct memfile_notifier_list { + struct list_head head; + spinlock_t lock; +}; + +struct memfile_backing_store { + struct list_head list; + struct memfile_pfn_ops pfn_ops; + struct memfile_notifier_list* (*get_notifier_list)(struct inode *inode); +}; + +#ifdef CONFIG_MEMFILE_NOTIFIER +/* APIs for backing stores */ +static inline void memfile_notifier_list_init(struct memfile_notifier_list *list) +{ + INIT_LIST_HEAD(&list->head); + spin_lock_init(&list->lock); +} + +extern void memfile_notifier_invalidate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end); +extern void memfile_notifier_fallocate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end); +extern void memfile_register_backing_store(struct memfile_backing_store *bs); +extern void memfile_unregister_backing_store(struct memfile_backing_store *bs); + +/*APIs for notifier consumers */ +extern int memfile_register_notifier(struct inode *inode, + struct memfile_notifier *notifier, + struct memfile_pfn_ops **pfn_ops); +extern void memfile_unregister_notifier(struct inode *inode, + struct memfile_notifier *notifier); + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + +#endif /* _LINUX_MEMFILE_NOTIFIER_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 3326ee3903f3..7c6b1ad3dade 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -892,6 +892,10 @@ config ANON_VMA_NAME area from being merged with adjacent virtual memory areas due to the difference in their name. +config MEMFILE_NOTIFIER + bool + select SRCU + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index 70d4309c9ce3..f628256dce0d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -132,3 +132,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o +obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c new file mode 100644 index 000000000000..a405db56fde2 --- /dev/null +++ b/mm/memfile_notifier.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * linux/mm/memfile_notifier.c + * + * Copyright (C) 2022 Intel Corporation. + * Chao Peng + */ + +#include +#include + +DEFINE_STATIC_SRCU(srcu); +static LIST_HEAD(backing_store_list); + +void memfile_notifier_invalidate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&srcu); + list_for_each_entry_srcu(notifier, &list->head, list, + srcu_read_lock_held(&srcu)) { + if (notifier->ops && notifier->ops->invalidate) + notifier->ops->invalidate(notifier, start, end); + } + srcu_read_unlock(&srcu, id); +} + +void memfile_notifier_fallocate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&srcu); + list_for_each_entry_srcu(notifier, &list->head, list, + srcu_read_lock_held(&srcu)) { + if (notifier->ops && notifier->ops->fallocate) + notifier->ops->fallocate(notifier, start, end); + } + srcu_read_unlock(&srcu, id); +} + +void memfile_register_backing_store(struct memfile_backing_store *bs) +{ + BUG_ON(!bs || !bs->get_notifier_list); + + list_add_tail(&bs->list, &backing_store_list); +} + +void memfile_unregister_backing_store(struct memfile_backing_store *bs) +{ + list_del(&bs->list); +} + +static int memfile_get_notifier_info(struct inode *inode, + struct memfile_notifier_list **list, + struct memfile_pfn_ops **ops) +{ + struct memfile_backing_store *bs, *iter; + struct memfile_notifier_list *tmp; + + list_for_each_entry_safe(bs, iter, &backing_store_list, list) { + tmp = bs->get_notifier_list(inode); + if (tmp) { + *list = tmp; + if (ops) + *ops = &bs->pfn_ops; + return 0; + } + } + return -EOPNOTSUPP; +} + +int memfile_register_notifier(struct inode *inode, + struct memfile_notifier *notifier, + struct memfile_pfn_ops **pfn_ops) +{ + struct memfile_notifier_list *list; + int ret; + + if (!inode || !notifier | !pfn_ops) + return -EINVAL; + + ret = memfile_get_notifier_info(inode, &list, pfn_ops); + if (ret) + return ret; + + spin_lock(&list->lock); + list_add_rcu(¬ifier->list, &list->head); + spin_unlock(&list->lock); + + return 0; +} +EXPORT_SYMBOL_GPL(memfile_register_notifier); + +void memfile_unregister_notifier(struct inode *inode, + struct memfile_notifier *notifier) +{ + struct memfile_notifier_list *list; + + if (!inode || !notifier) + return; + + BUG_ON(memfile_get_notifier_info(inode, &list, NULL)); + + spin_lock(&list->lock); + list_del_rcu(¬ifier->list); + spin_unlock(&list->lock); + + synchronize_srcu(&srcu); +} +EXPORT_SYMBOL_GPL(memfile_unregister_notifier); From patchwork Thu Mar 10 14:09:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776419 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C85CC433F5 for ; Thu, 10 Mar 2022 14:10:12 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242731AbiCJOLH (ORCPT ); Thu, 10 Mar 2022 09:11:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49200 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242724AbiCJOLF (ORCPT ); Thu, 10 Mar 2022 09:11:05 -0500 Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BF44E7B572; Thu, 10 Mar 2022 06:10:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921401; x=1678457401; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=LshKgds1U2bNEdA3l08Zy9WrXI0Fxarkx0V/cuLKyUc=; b=azREEZj2WvSF3w5S7E5ujiwkNVcuTS3IHPy1PcR6ATAXKfa3BWeQ+Nwb qFfbO+iRa5ME16b5s4jVN+a1NQ/EX5ieWQppWEvZTzOKUrFy3EyzZ5sHs YodjDaLpPvbQNSqRc+o+rY76Czi2vwZ+QRXMiZJDAsYK1DZq0K6G1DP2A 7E2yRIIgBdicB24Jcwo+LK3l2E/bDlsplluNzDBY0mvEjNz2WEhGeSLR/ S32Y9mVJUGSf5fMzXu+LhBqNva62LoorwBO6cw3I9cOi1zOnR/fFGQyID mLX2N0jlpxdKMNc3yjxOp5UVNdOYyMjBH35uWfpEGZpl1Pih6juZzAfAA g==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="315975650" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="315975650" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654872" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:52 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 03/13] mm/shmem: Support memfile_notifier Date: Thu, 10 Mar 2022 22:09:01 +0800 Message-Id: <20220310140911.50924-4-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: "Kirill A. Shutemov" It maintains a memfile_notifier list in shmem_inode_info structure and implements memfile_pfn_ops callbacks defined by memfile_notifier. It then exposes them to memfile_notifier via shmem_get_memfile_notifier_info. We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be allocated by userspace for private memory. If there is no pages allocated at the offset then error should be returned so KVM knows that the memory is not private memory. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/shmem_fs.h | 4 +++ mm/shmem.c | 76 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 80 insertions(+) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 2dde843f28ef..7bb16f2d2825 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include #include #include +#include /* inode in-kernel data */ @@ -28,6 +29,9 @@ struct shmem_inode_info { struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ unsigned int xflags; /* shmem extended flags */ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct memfile_notifier_list memfile_notifiers; +#endif struct inode vfs_inode; }; diff --git a/mm/shmem.c b/mm/shmem.c index 9b31a7056009..7b43e274c9a2 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) return page ? page_folio(page) : NULL; } +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) +{ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct shmem_inode_info *info = SHMEM_I(inode); + + memfile_notifier_fallocate(&info->memfile_notifiers, start, end); +#endif +} + +static void notify_invalidate_page(struct inode *inode, struct folio *folio, + pgoff_t start, pgoff_t end) +{ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct shmem_inode_info *info = SHMEM_I(inode); + + start = max(start, folio->index); + end = min(end, folio->index + folio_nr_pages(folio)); + + memfile_notifier_invalidate(&info->memfile_notifiers, start, end); +#endif +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -946,6 +968,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += folio_nr_pages(folio) - 1; + notify_invalidate_page(inode, folio, start, end); + if (!unfalloc || !folio_test_uptodate(folio)) truncate_inode_folio(mapping, folio); folio_unlock(folio); @@ -1019,6 +1043,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; } + + notify_invalidate_page(inode, folio, start, end); + VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); truncate_inode_folio(mapping, folio); @@ -2279,6 +2306,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode info->flags = flags & VM_NORESERVE; INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_notifier_list_init(&info->memfile_notifiers); +#endif simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping); @@ -2802,6 +2832,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) i_size_write(inode, offset + len); inode->i_ctime = current_time(inode); + notify_fallocate(inode, start, end); undone: spin_lock(&inode->i_lock); inode->i_private = NULL; @@ -3909,6 +3940,47 @@ static struct file_system_type shmem_fs_type = { .fs_flags = FS_USERNS_MOUNT, }; +#ifdef CONFIG_MEMFILE_NOTIFIER +static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order) +{ + struct page *page; + int ret; + + ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC); + if (ret) + return ret; + + *order = thp_order(compound_head(page)); + + return page_to_pfn(page); +} + +static void shmem_put_unlock_pfn(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + + set_page_dirty(page); + unlock_page(page); + put_page(page); +} + +static struct memfile_notifier_list* shmem_get_notifier_list(struct inode *inode) +{ + if (!shmem_mapping(inode->i_mapping)) + return NULL; + + return &SHMEM_I(inode)->memfile_notifiers; +} + +static struct memfile_backing_store shmem_backing_store = { + .pfn_ops.get_lock_pfn = shmem_get_lock_pfn, + .pfn_ops.put_unlock_pfn = shmem_put_unlock_pfn, + .get_notifier_list = shmem_get_notifier_list, +}; +#endif /* CONFIG_MEMFILE_NOTIFIER */ + int __init shmem_init(void) { int error; @@ -3934,6 +4006,10 @@ int __init shmem_init(void) else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ #endif + +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_register_backing_store(&shmem_backing_store); +#endif return 0; out1: From patchwork Thu Mar 10 14:09:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776420 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CE96C43217 for ; Thu, 10 Mar 2022 14:10:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242784AbiCJOLZ (ORCPT ); Thu, 10 Mar 2022 09:11:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49200 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242737AbiCJOLK (ORCPT ); Thu, 10 Mar 2022 09:11:10 -0500 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D66E68F637; Thu, 10 Mar 2022 06:10:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921408; x=1678457408; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=s3KeBsvI7L7pbUPmXrVsTjDsdOJOMmhGXQWiylgk8+Q=; b=Tyo/fwpvW4j2fqABwXK8nR0supuYn55ZCaw6lMNcrUJ//0q8eMGoDpiS l5+DsVjkv8kBNzuirDO2SeaxKfYXldV17NwY9hpU3sG6uzgCBUvG3cGWZ h9fxIX8qh6xqZ2ej/cCZ+AxKxMWoeqQmX9Uv6HuPT+FkVaVMdlf9APVsJ GGVUTzuKjWaOWA15vKy/+Xg+Eptb0N/Oi75WXYObe+zi+wfDgZ1fLDkkf to7zEWJzW/jQHggGunowEwrkai/ii15KqXTH7irLWcf9X2HJzlg1oGzt4 w/Hp4ZWCpzE4Zs964bqxLCoIkbsd496wjqUJwXd40Q+U9Xd6n6678relI w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="254084871" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="254084871" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:08 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654936" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:00 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Date: Thu, 10 Mar 2022 22:09:02 +0800 Message-Id: <20220310140911.50924-5-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Since page migration / swapping is not supported yet, MFD_INACCESSIBLE memory behave like longterm pinned pages and thus should be accounted to mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. Signed-off-by: Chao Peng --- mm/shmem.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 7b43e274c9a2..ae46fb96494b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) static void notify_invalidate_page(struct inode *inode, struct folio *folio, pgoff_t start, pgoff_t end) { -#ifdef CONFIG_MEMFILE_NOTIFIER struct shmem_inode_info *info = SHMEM_I(inode); +#ifdef CONFIG_MEMFILE_NOTIFIER start = max(start, folio->index); end = min(end, folio->index + folio_nr_pages(folio)); memfile_notifier_invalidate(&info->memfile_notifiers, start, end); #endif + + if (info->xflags & SHM_F_INACCESSIBLE) + atomic64_sub(end - start, ¤t->mm->pinned_vm); } /* @@ -2680,6 +2683,20 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return offset; } +static bool memlock_limited(unsigned long npages) +{ + unsigned long lock_limit; + unsigned long pinned; + + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + pinned = atomic64_add_return(npages, ¤t->mm->pinned_vm); + if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) { + atomic64_sub(npages, ¤t->mm->pinned_vm); + return true; + } + return false; +} + static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { @@ -2753,6 +2770,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->xflags & SHM_F_INACCESSIBLE) && + memlock_limited(end - start)) { + error = -ENOMEM; + goto out; + } + shmem_falloc.waitq = NULL; shmem_falloc.start = start; shmem_falloc.next = start; From patchwork Thu Mar 10 14:09:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776421 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7B76C4332F for ; Thu, 10 Mar 2022 14:10:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242855AbiCJOLu (ORCPT ); Thu, 10 Mar 2022 09:11:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51240 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242754AbiCJOLa (ORCPT ); Thu, 10 Mar 2022 09:11:30 -0500 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E330E14F98A; Thu, 10 Mar 2022 06:10:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921429; x=1678457429; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=R9N/T0uA9Epd8O0DLQkw3YWNAAFsUHVjLA2F5YkEzAE=; b=mxMgDjuFIY4HLmTks+qcX2ZI77cQWZO2uUIekh3dDwzO4YZ6q9Lmw5sv Ip81L6ugwoNKMLW4HEEfrEfEXpHT5+s3OfyocpN0tL9K9GyMu3tkN3dWe hpFK6PT9KHHH8hdZpzIxnqZpxIYuDlBagnzU5n4hm8Bv3W4yvIY8r3Idc sJ6uaZO7FTac3980jCB6gFHmZhLIdttWlQdJJUFgNj5Kjp7Sx6ijVeW9A T/bm1XDme6cZg3hjEZzyEuvkonbhOtLed34R2mMx8Xog05aSvoao1Dljg Jrc6kcQ92buiYxJvhx7HrC8i/71WYHN+qjuNwFHJ89TUUudVB5y5uSG7k A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="235206016" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="235206016" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:16 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654963" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:08 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Date: Thu, 10 Mar 2022 22:09:03 +0800 Message-Id: <20220310140911.50924-6-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Extend the memslot definition to provide fd-based private memory support by adding two new fields (private_fd/private_offset). The memslot then can maintain memory for both shared pages and private pages in a single memslot. Shared pages are provided by existing userspace_addr(hva) field and private pages are provided through the new private_fd/private_offset fields. Since there is no 'hva' concept anymore for private memory so we cannot rely on get_user_pages() to get a pfn, instead we use the newly added memfile_notifier to complete the same job. This new extension is indicated by a new flag KVM_MEM_PRIVATE. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++------- include/linux/kvm_host.h | 7 +++++++ include/uapi/linux/kvm.h | 8 ++++++++ 3 files changed, 45 insertions(+), 7 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 3acbf4d263a5..f76ac598606c 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1307,7 +1307,7 @@ yet and must be cleared on entry. :Capability: KVM_CAP_USER_MEMORY :Architectures: all :Type: vm ioctl -:Parameters: struct kvm_userspace_memory_region (in) +:Parameters: struct kvm_userspace_memory_region(_ext) (in) :Returns: 0 on success, -1 on error :: @@ -1320,9 +1320,17 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ }; + struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 padding[5]; +}; + /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) + #define KVM_MEM_PRIVATE (1UL << 2) This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1353,12 +1361,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region +fields. It also includes additional fields for some specific features. See +below description of flags field for more information. It's recommended to use +kvm_userspace_memory_region_ext in new userspace code. + +The flags field supports below flags: + +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. + +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to + make a new slot read-only. In this case, writes to this memory will be posted + to userspace as KVM_EXIT_MMIO exits. + +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by + a file descirptor(fd) and the content of the private memory is invisible to + userspace. In this case, userspace should use private_fd/private_offset in + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to + guest. Userspace should guarantee not to map the same pfn indicated by + private_fd/private_offset to different gfns with multiple memslots. Failed to + do this may result undefined behavior. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9536ffa0473b..3be8116079d4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -563,8 +563,15 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct file *private_file; + loff_t private_offset; }; +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) +{ + return slot && (slot->flags & KVM_MEM_PRIVATE); +} + static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) { return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 91a6fe4e02c0..a523d834efc8 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -103,6 +103,13 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ }; +struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 padding[5]; +}; + /* * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, * other bits are reserved for kvm internal use which are defined in @@ -110,6 +117,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_PRIVATE (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { From patchwork Thu Mar 10 14:09:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776422 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EE67CC433F5 for ; Thu, 10 Mar 2022 14:10:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242868AbiCJOLv (ORCPT ); Thu, 10 Mar 2022 09:11:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51062 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242760AbiCJOL1 (ORCPT ); Thu, 10 Mar 2022 09:11:27 -0500 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4C66F9A4D4; Thu, 10 Mar 2022 06:10:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921425; x=1678457425; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=aXDmGBTRlSAIbebZ1eKgmeAE+5/9H+lFwtsMHbxfleU=; b=Z4VC3B9yaIDESl8IUUm7CWKeW5VI6Gy+C9mvHguhOTBm2d2Y6YcToDCJ ukrwNLoy9+V4A23zgQZnbKdp2JBre5P1r45C8i0nTL4AiEAnYjImlPzxF j2RZCdeatisICJLy3cpI26A4rMeQ6DwnbPiRq4mS3wGZfqYSPvC9n3+G0 yKrphlTd8LiScJNN79C1qGinNO5Vse3NVur9/27lp/O+yw82haIuLzLIX YcD+PkJhk9+p4yh97XyfxSnFoYJvJb3NjWIKgYOOTodBVUwq5t+qN0bqm 9wW0CevEEJ2BsBXsWwVT/01Lm0XhXVWIdXTk+jQUS3eWMHgHgNWk2l/bR w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="252823460" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="252823460" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655000" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:16 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext Date: Thu, 10 Mar 2022 22:09:04 +0800 Message-Id: <20220310140911.50924-7-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Use the new extended memslot structure kvm_userspace_memory_region_ext. The extended part (private_fd/ private_offset) will be copied from userspace only when KVM_MEM_PRIVATE is set. Internally old kvm_userspace_memory_region will still be used for places where the extended fields are not needed. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/x86.c | 12 ++++++------ include/linux/kvm_host.h | 4 ++-- virt/kvm/kvm_main.c | 30 ++++++++++++++++++++---------- 3 files changed, 28 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8c06b8204fca..1d9dbef67715 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11757,13 +11757,13 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, } for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_userspace_memory_region_ext m; - m.slot = id | (i << 16); - m.flags = 0; - m.guest_phys_addr = gpa; - m.userspace_addr = hva; - m.memory_size = size; + m.region.slot = id | (i << 16); + m.region.flags = 0; + m.region.guest_phys_addr = gpa; + m.region.userspace_addr = hva; + m.region.memory_size = size; r = __kvm_set_memory_region(kvm, &m); if (r < 0) return ERR_PTR_USR(r); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3be8116079d4..c92c70174248 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1082,9 +1082,9 @@ enum kvm_mr_change { }; int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region_ext *region_ext); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region_ext *region_ext); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 69c318fdff61..d11a2628b548 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1809,8 +1809,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * Must be called holding kvm->slots_lock for write. */ int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_userspace_memory_region_ext *region_ext) { + const struct kvm_userspace_memory_region *mem = ®ion_ext->region; struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; enum kvm_mr_change change; @@ -1913,24 +1914,24 @@ int __kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(__kvm_set_memory_region); int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_userspace_memory_region_ext *region_ext) { int r; mutex_lock(&kvm->slots_lock); - r = __kvm_set_memory_region(kvm, mem); + r = __kvm_set_memory_region(kvm, region_ext); mutex_unlock(&kvm->slots_lock); return r; } EXPORT_SYMBOL_GPL(kvm_set_memory_region); static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, - struct kvm_userspace_memory_region *mem) + struct kvm_userspace_memory_region_ext *region_ext) { - if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) + if ((u16)region_ext->region.slot >= KVM_USER_MEM_SLOTS) return -EINVAL; - return kvm_set_memory_region(kvm, mem); + return kvm_set_memory_region(kvm, region_ext); } #ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT @@ -4476,14 +4477,23 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: { - struct kvm_userspace_memory_region kvm_userspace_mem; + struct kvm_userspace_memory_region_ext region_ext; r = -EFAULT; - if (copy_from_user(&kvm_userspace_mem, argp, - sizeof(kvm_userspace_mem))) + if (copy_from_user(®ion_ext, argp, + sizeof(struct kvm_userspace_memory_region))) goto out; + if (region_ext.region.flags & KVM_MEM_PRIVATE) { + int offset = offsetof( + struct kvm_userspace_memory_region_ext, + private_offset); + if (copy_from_user(®ion_ext.private_offset, + argp + offset, + sizeof(region_ext) - offset)) + goto out; + } - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); + r = kvm_vm_ioctl_set_memory_region(kvm, ®ion_ext); break; } case KVM_GET_DIRTY_LOG: { From patchwork Thu Mar 10 14:09:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776423 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7B061C433FE for ; Thu, 10 Mar 2022 14:11:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242825AbiCJOMD (ORCPT ); Thu, 10 Mar 2022 09:12:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51730 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242009AbiCJOLh (ORCPT ); Thu, 10 Mar 2022 09:11:37 -0500 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 88B38151C74; Thu, 10 Mar 2022 06:10:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921433; x=1678457433; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=Z4vQdKuwJGnUEOHT5u0YKIBAyF/sX/jaF7InK/l4l/o=; b=a//0wszSMmmgEDRiNXJlX0hrN8yPbfqmcHc1/LB5ZZ0bNqNi0PxAobKJ hPht3jZ+OsAmhn5kISR0FfnHblo8wQ5ll9TY42QGel7FbaJOdGagJEn6u X+0F14BeNdX8UVg6bYDLV8gs4+MW1YLXc90nX4/FArevUQh4btsUhtrEo jM6m4e0gkawwQszQJaYsNHElFFiySS4Fk86yjy8JtlSLm4BLMyemTUcwu ShtYmhgfaCrzJBr0MfKddByffI9J/oE5TrO7hT47YwtLzYcnMQJqpsQJ1 KjjoObJMY5U3z2h65QchwqQ23RnYvejub3UEsLv0j4MyV7qyPmhgAp2gc Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="255448395" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="255448395" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:33 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655053" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:24 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit Date: Thu, 10 Mar 2022 22:09:05 +0800 Message-Id: <20220310140911.50924-8-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This new KVM exit allows userspace to handle memory-related errors. It indicates an error happens in KVM at guest memory range [gpa, gpa+size). The flags includes additional information for userspace to handle the error. Currently bit 0 is defined as 'private memory' where '1' indicates error happens due to private memory access and '0' indicates error happens due to shared memory access. After private memory is enabled, this new exit will be used for KVM to exit to userspace for shared memory <-> private memory conversion in memory encryption usage. In such usage, typically there are two kind of memory conversions: - explicit conversion: happens when guest explicitly calls into KVM to map a range (as private or shared), KVM then exits to userspace to do the map/unmap operations. - implicit conversion: happens in KVM page fault handler. * if the fault is due to a private memory access then causes a userspace exit for a shared->private conversion request when the page has not been allocated in the private memory backend. * If the fault is due to a shared memory access then causes a userspace exit for a private->shared conversion request when the page has already been allocated in the private memory backend. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++ include/uapi/linux/kvm.h | 9 +++++++++ 2 files changed, 31 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index f76ac598606c..bad550c2212b 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6216,6 +6216,28 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_ERROR */ + struct { + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; +If exit reason is KVM_EXIT_MEMORY_ERROR then it indicates that the VCPU has +encountered a memory error which is not handled by KVM kernel module and +userspace may choose to handle it. The 'flags' field indicates the memory +properties of the exit. + + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by + private memory access when the bit is set otherwise the memory error is + caused by shared memory access when the bit is clear. + +'gpa' and 'size' indicate the memory range the error occurs at. The userspace +may handle the error and return to KVM to retry the previous memory access. + :: /* Fix the size of the union. */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a523d834efc8..9ad0c8aa0263 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -278,6 +278,7 @@ struct kvm_xen_exit { #define KVM_EXIT_X86_BUS_LOCK 33 #define KVM_EXIT_XEN 34 #define KVM_EXIT_RISCV_SBI 35 +#define KVM_EXIT_MEMORY_ERROR 36 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -495,6 +496,14 @@ struct kvm_run { unsigned long args[6]; unsigned long ret[2]; } riscv_sbi; + /* KVM_EXIT_MEMORY_ERROR */ + struct { +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; /* Fix the size of the union. */ char padding[256]; }; From patchwork Thu Mar 10 14:09:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776426 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A24FC43217 for ; Thu, 10 Mar 2022 14:13:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243082AbiCJOON (ORCPT ); Thu, 10 Mar 2022 09:14:13 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50554 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243043AbiCJONj (ORCPT ); Thu, 10 Mar 2022 09:13:39 -0500 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 33BB9158781; Thu, 10 Mar 2022 06:11:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921494; x=1678457494; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=/0lWU+5EyoSjlOB1k9o55wX3sxsJiR5VJJRuHs8r7qY=; b=kiYvPnW9Gj7SQrP24WbI06ECFPfV2PqyoFFes50+7S86EZlVoYo7g0Jx jYL0+D2UIylthTW7Aw3zTuAs5o660VG2VDDGcDryyAfhXmhIgvJdPAl8A aJojElt2ht3053BNck435zMaBVB5t/QxOWfBb0sZdduj+iEuH0TMcdlRo dVbqlVuSo/9IqGNcc3okXn3h5UMvLItMzfvWTn2Scq/5Nr52PKDo1uC9W P6etffT2Y22yZqN30E3aWg1ihKLdCQ/qKLfi8FDaxZ1vWTc164ge7T8NK qcBa8KAk/DXSfNwUsgYLGdAYpV2XN25Ht9BQivzLaDSplSTk2a1R8hdO+ A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="242702683" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="242702683" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:40 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655084" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:32 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages Date: Thu, 10 Mar 2022 22:09:06 +0800 Message-Id: <20220310140911.50924-9-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Private pages are not mmap-ed into userspace so can not reply on get_user_pages() to obtain the pfn. Instead we add a memfile_pfn_ops pointer pfn_ops in each private memslot and use it to obtain the pfn for a gfn. To do that, KVM should convert the gfn to the offset into the fd and then call get_lock_pfn callback. Once KVM completes its job it should call put_unlock_pfn to unlock the pfn. Note the pfn(page) is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is valid when KVM uses it to establish the mapping in the secondary MMU page table. The pfn_ops is initialized via memfile_register_notifier from the memory backing store that provided the private_fd. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/Kconfig | 1 + include/linux/kvm_host.h | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index e3cbd7706136..ca7b2a6a452a 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -48,6 +48,7 @@ config KVM select SRCU select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM + select MEMFILE_NOTIFIER help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c92c70174248..6e1d770d6bf8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -44,6 +44,7 @@ #include #include +#include #ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -565,6 +566,7 @@ struct kvm_memory_slot { u16 as_id; struct file *private_file; loff_t private_offset; + struct memfile_pfn_ops *pfn_ops; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) @@ -915,6 +917,7 @@ static inline void kvm_irqfd_exit(void) { } #endif + int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align, struct module *module); void kvm_exit(void); @@ -2217,4 +2220,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, + int *order) +{ + pgoff_t index = gfn - slot->base_gfn + + (slot->private_offset >> PAGE_SHIFT); + + return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file), + index, order); +} + +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ + slot->pfn_ops->put_unlock_pfn(pfn); +} + +#else +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, + int *order) +{ + return -1; +} + +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ +} +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #endif From patchwork Thu Mar 10 14:09:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776424 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78D05C43217 for ; Thu, 10 Mar 2022 14:12:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237282AbiCJONB (ORCPT ); Thu, 10 Mar 2022 09:13:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50884 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242937AbiCJOMg (ORCPT ); Thu, 10 Mar 2022 09:12:36 -0500 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6AA6414A6D8; Thu, 10 Mar 2022 06:10:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921449; x=1678457449; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=uGzkzAL8Dy+iBgRDUMsNImudpdJ03OH3v6JTl1qmo+Y=; b=edGhZH26HZ23srmHhGqL9ZTiw9+LcvTQlWQDYLS0sYqi4kK/fczuekCy VtzOhzbvcMglkLZLnH69Xz0tb/6SmZmWXjxi2MKs/B78RsML1KhKOA6Pf Tz5U70ZWNb07Cbuec56y2WdTTMq9ncB5yqRzMgadqrlnNHX/V+o3hrtos y0L266Vo1jD+u/xhLADMRklV+Qqo0kYJl35lmkaW1aaHoGBoMR6gTZKSQ Ik2CJxwGB2ELVKWpNbGHAsvVkhgqo8lf1nqmTC24G+x2NzdRac/jI68jd 1OXa7aW/Jwou7HhyWB1W7NgAzFr+chiv6E80TK85IJIAHUTwwzxHUzml0 A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="254085038" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="254085038" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655113" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:41 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 09/13] KVM: Handle page fault for private memory Date: Thu, 10 Mar 2022 22:09:07 +0800 Message-Id: <20220310140911.50924-10-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org When page fault happens for a memslot with KVM_MEM_PRIVATE, we use kvm_memfile_get_pfn() which further calls into memfile_pfn_ops callbacks defined for each memslot to request the pfn from the memory backing store. One assumption is that private pages are persistent and pre-allocated in the private memory fd (backing store) so KVM uses this information as an indicator for a page is private or shared (i.e. the private fd is the final source of truth as to whether or not a GPA is private). Depending on the access is private or shared, we go different paths: - For private access, KVM checks if the page is already allocated in the memory backing store, if yes KVM establishes the mapping, otherwise exits to userspace to convert a shared page to private one. - For shared access, KVM also checks if the page is already allocated in the memory backing store, if yes then exit to userspace to convert a private page to shared one, otherwise it's treated as a traditional hva-based shared memory, KVM lets existing code to obtain a pfn with get_user_pages() and establish the mapping. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 11 +++-- 2 files changed, 77 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3b8da8b0745e..f04c823ea09a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2844,6 +2844,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K; + if (kvm_slot_is_private(slot)) + return max_level; + host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level); } @@ -3890,7 +3893,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); } -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + /* + * At this time private gfn has not been supported yet. Other patch + * that enables it should change this. + */ + return false; +} + +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault, + bool *is_private_pfn, int *r) +{ + int order; + unsigned int flags = 0; + struct kvm_memory_slot *slot = fault->slot; + long pfn = kvm_memfile_get_pfn(slot, fault->gfn, &order); + + if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) { + if (pfn < 0) + flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; + else { + fault->pfn = pfn; + if (slot->flags & KVM_MEM_READONLY) + fault->map_writable = false; + else + fault->map_writable = true; + + if (order == 0) + fault->max_level = PG_LEVEL_4K; + *is_private_pfn = true; + *r = RET_PF_FIXED; + return true; + } + } else { + if (pfn < 0) + return false; + + kvm_memfile_put_pfn(slot, pfn); + } + + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR; + vcpu->run->memory.flags = flags; + vcpu->run->memory.padding = 0; + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT; + vcpu->run->memory.size = PAGE_SIZE; + fault->pfn = -1; + *r = -1; + return true; +} + +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + bool *is_private_pfn, int *r) { struct kvm_memory_slot *slot = fault->slot; bool async; @@ -3924,6 +3979,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, } } + if (kvm_slot_is_private(slot) && + kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r)) + return *r == RET_PF_FIXED ? false : true; + async = false; fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, fault->write, &fault->map_writable, @@ -3984,6 +4043,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu); unsigned long mmu_seq; + bool is_private_pfn = false; int r; fault->gfn = fault->addr >> PAGE_SHIFT; @@ -4003,7 +4063,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (kvm_faultin_pfn(vcpu, fault, &r)) + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r)) return r; if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r)) @@ -4016,7 +4076,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault else write_lock(&vcpu->kvm->mmu_lock); - if (is_page_fault_stale(vcpu, fault, mmu_seq)) + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; r = make_mmu_pages_available(vcpu); @@ -4033,7 +4093,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault read_unlock(&vcpu->kvm->mmu_lock); else write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + + if (is_private_pfn) + kvm_memfile_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); + return r; } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 252c77805eb9..6a5736699c0a 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -825,6 +825,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault int r; unsigned long mmu_seq; bool is_self_change_mapping; + bool is_private_pfn = false; + pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code); WARN_ON_ONCE(fault->is_tdp); @@ -873,7 +875,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (kvm_faultin_pfn(vcpu, fault, &r)) + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r)) return r; if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r)) @@ -901,7 +903,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = RET_PF_RETRY; write_lock(&vcpu->kvm->mmu_lock); - if (is_page_fault_stale(vcpu, fault, mmu_seq)) + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; r = make_mmu_pages_available(vcpu); @@ -911,7 +913,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + if (is_private_pfn) + kvm_memfile_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); return r; } From patchwork Thu Mar 10 14:09:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776425 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4833AC43217 for ; Thu, 10 Mar 2022 14:12:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236680AbiCJONm (ORCPT ); Thu, 10 Mar 2022 09:13:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52744 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243020AbiCJONC (ORCPT ); Thu, 10 Mar 2022 09:13:02 -0500 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C3AF015694C; Thu, 10 Mar 2022 06:11:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921480; x=1678457480; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=dJciXpEHLFj6Z+EAfDFvGxoZXvWRu/tGM+jdlgCTKlE=; b=Js2cUaBkpKSpZ2f+bQu0e7Lq8W9rIAI3MQoOOEaEh5WQaOMkMXTMoKdJ HRkqWCOWgyt7zDICbp3zC2lArsOpGhv75JsDbR80rFTaY9Pa057PQjVob vYy0CtKv5iJmts/UIK96OKSkEqEqPObvvJ1Xkt6t6ezjnNhxTHCPJA0qC U09Wngd7+C5MTe+bdJwCOngIf77CQrNBRsPUnGgsk8LXtpSmV9PJesf8i q51wtmEurz3TifUvoG6dO8/xTLoj8tlPLF9NqV0BwoyhMjDYJAOXe4vny loSSqhC8NDLAr4t85Ap37dilJ7AfKfu8yQHH4qkChqpFMJCNWQLjomSTc g==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="252823625" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="252823625" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:57 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655136" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:49 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 10/13] KVM: Register private memslot to memory backing store Date: Thu, 10 Mar 2022 22:09:08 +0800 Message-Id: <20220310140911.50924-11-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Add 'notifier' to memslot to make it a memfile_notifier node and then register it to memory backing store via memfile_register_notifier() when memslot gets created. When memslot is deleted, do the reverse with memfile_unregister_notifier(). Note each KVM memslot can be registered to different memory backing stores (or the same backing store but at different offset) independently. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 75 ++++++++++++++++++++++++++++++++++++---- 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 6e1d770d6bf8..9b175aeca63f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -567,6 +567,7 @@ struct kvm_memory_slot { struct file *private_file; loff_t private_offset; struct memfile_pfn_ops *pfn_ops; + struct memfile_notifier notifier; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d11a2628b548..67349421eae3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -840,6 +840,37 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline int kvm_memfile_register(struct kvm_memory_slot *slot) +{ + return memfile_register_notifier(file_inode(slot->private_file), + &slot->notifier, + &slot->pfn_ops); +} + +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot) +{ + if (slot->private_file) { + memfile_unregister_notifier(file_inode(slot->private_file), + &slot->notifier); + fput(slot->private_file); + slot->private_file = NULL; + } +} + +#else /* !CONFIG_MEMFILE_NOTIFIER */ + +static inline int kvm_memfile_register(struct kvm_memory_slot *slot) +{ + return -EOPNOTSUPP; +} + +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot) +{ +} + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state, @@ -884,6 +915,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot) /* This does not remove the slot from struct kvm_memslots data structures */ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { + if (slot->flags & KVM_MEM_PRIVATE) + kvm_memfile_unregister(slot); + kvm_destroy_dirty_bitmap(slot); kvm_arch_free_memslot(kvm, slot); @@ -1738,6 +1772,12 @@ static int kvm_set_memslot(struct kvm *kvm, kvm_invalidate_memslot(kvm, old, invalid_slot); } + if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) { + r = kvm_memfile_register(new); + if (r) + return r; + } + r = kvm_prepare_memory_region(kvm, old, new, change); if (r) { /* @@ -1752,6 +1792,10 @@ static int kvm_set_memslot(struct kvm *kvm, } else { mutex_unlock(&kvm->slots_arch_lock); } + + if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) + kvm_memfile_unregister(new); + return r; } @@ -1817,6 +1861,7 @@ int __kvm_set_memory_region(struct kvm *kvm, enum kvm_mr_change change; unsigned long npages; gfn_t base_gfn; + struct file *file = NULL; int as_id, id; int r; @@ -1890,14 +1935,24 @@ int __kvm_set_memory_region(struct kvm *kvm, return 0; } + if (mem->flags & KVM_MEM_PRIVATE) { + file = fdget(region_ext->private_fd).file; + if (!file) + return -EINVAL; + } + if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) && - kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) - return -EEXIST; + kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) { + r = -EEXIST; + goto out; + } /* Allocate a slot that will persist in the memslot. */ new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT); - if (!new) - return -ENOMEM; + if (!new) { + r = -ENOMEM; + goto out; + } new->as_id = as_id; new->id = id; @@ -1905,10 +1960,18 @@ int __kvm_set_memory_region(struct kvm *kvm, new->npages = npages; new->flags = mem->flags; new->userspace_addr = mem->userspace_addr; + new->private_file = file; + new->private_offset = mem->flags & KVM_MEM_PRIVATE ? + region_ext->private_offset : 0; r = kvm_set_memslot(kvm, old, new, change); - if (r) - kfree(new); + if (!r) + return r; + + kfree(new); +out: + if (file) + fput(file); return r; } EXPORT_SYMBOL_GPL(__kvm_set_memory_region); From patchwork Thu Mar 10 14:09:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776427 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 192B5C433FE for ; Thu, 10 Mar 2022 14:13:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243097AbiCJOOg (ORCPT ); Thu, 10 Mar 2022 09:14:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52048 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243055AbiCJONj (ORCPT ); Thu, 10 Mar 2022 09:13:39 -0500 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F9869A9B5; Thu, 10 Mar 2022 06:11:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921497; x=1678457497; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=UDNf/tndP7AKz2Uk9JPtbsxmn9QkdDXr+UJSlA/azbE=; b=O+YpcDiy1oXurDZKsyZVFJGI1pPv7DnewXMpBYDVw6YIyq1s4VN9+Ncc wLm5S4V1j2GC0AG8q8eP9PaWBn+4LP5wzVddndRBxK7h66QOH/txLZYZH 4i+DB0Jr5DEQjEhdE5xoR681vqcHUFE4UXS9i/axp+bLcPo2kQ6X032dK GONoT3dk4umm9NRp53s+Wrm46e32mqEKPyiTosJCI/N7dvOge5C9gmnZR I+8FraorLY6y5PKl6V8F1aoT5MgdKpxunySqL3sPEkvY80DJJ1ZQwDwgb tQqxsp1ui5yJB3K7Kz1AGxZoujLjAwWqKqGfZQ34aad4tDkYw/7SD4FiR w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="242702811" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="242702811" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655204" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:56 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Date: Thu, 10 Mar 2022 22:09:09 +0800 Message-Id: <20220310140911.50924-12-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org KVM gets notified when memory pages changed in the memory backing store. When userspace allocates the memory with fallocate() or frees memory with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into KVM fallocate/invalidate callbacks respectively. To ensure KVM never maps both the private and shared variants of a GPA into the guest, in the fallocate callback, we should zap the existing shared mapping and in the invalidate callback we should zap the existing private mapping. In the callbacks, KVM firstly converts the offset range into the gfn_range and then calls existing kvm_unmap_gfn_range() which will zap the shared or private mapping. Both callbacks pass in a memslot reference but we need 'kvm' so add a reference in memslot structure. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 3 ++- virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9b175aeca63f..186b9b981a65 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -236,7 +236,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER) struct kvm_gfn_range { struct kvm_memory_slot *slot; gfn_t start; @@ -568,6 +568,7 @@ struct kvm_memory_slot { loff_t private_offset; struct memfile_pfn_ops *pfn_ops; struct memfile_notifier notifier; + struct kvm *kvm; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 67349421eae3..52319f49d58a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ #ifdef CONFIG_MEMFILE_NOTIFIER +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end) +{ + int idx; + struct kvm_memory_slot *slot = container_of(notifier, + struct kvm_memory_slot, + notifier); + struct kvm_gfn_range gfn_range = { + .slot = slot, + .start = start - (slot->private_offset >> PAGE_SHIFT), + .end = end - (slot->private_offset >> PAGE_SHIFT), + .may_block = true, + }; + struct kvm *kvm = slot->kvm; + + gfn_range.start = max(gfn_range.start, slot->base_gfn); + gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages); + + if (gfn_range.start >= gfn_range.end) + return; + + idx = srcu_read_lock(&kvm->srcu); + KVM_MMU_LOCK(kvm); + kvm_unmap_gfn_range(kvm, &gfn_range); + kvm_flush_remote_tlbs(kvm); + KVM_MMU_UNLOCK(kvm); + srcu_read_unlock(&kvm->srcu, idx); +} + +static struct memfile_notifier_ops kvm_memfile_notifier_ops = { + .invalidate = kvm_memfile_notifier_handler, + .fallocate = kvm_memfile_notifier_handler, +}; + static inline int kvm_memfile_register(struct kvm_memory_slot *slot) { + slot->notifier.ops = &kvm_memfile_notifier_ops; return memfile_register_notifier(file_inode(slot->private_file), &slot->notifier, &slot->pfn_ops); @@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm, new->private_file = file; new->private_offset = mem->flags & KVM_MEM_PRIVATE ? region_ext->private_offset : 0; + new->kvm = kvm; r = kvm_set_memslot(kvm, old, new, change); if (!r) From patchwork Thu Mar 10 14:09:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776428 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 333F5C433FE for ; Thu, 10 Mar 2022 14:13:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243074AbiCJOO5 (ORCPT ); Thu, 10 Mar 2022 09:14:57 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51584 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243067AbiCJOOF (ORCPT ); Thu, 10 Mar 2022 09:14:05 -0500 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D41C9154704; Thu, 10 Mar 2022 06:11:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921504; x=1678457504; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=dlixjqkIQMYPAGk8N2BV+CT0PbmeRswbnhn9umPYrd0=; b=hf0NsgxOOE3+V8fImk9HNH6oRLTvXKAwPiB3LcSB1kPw97l1cWdsYeOR 5v7Q2vxcjh2LFDahb+CaIAdmG/kE6a+s2K1LG4VC0C6RP8YPVta9725Aw LEECMnQTcK/3H0aj5g3fxUaC8GszMiM6dyfa54p6tIr8pmSSuTMa8XkWP UWGAtyO3RM4ThbxSwOcBsdDhd2AtzR4Km8FUX5bq/vU2pZfhxoPx7m7wP cfFyLNPKteIalmkjnEZT6IZqtZY1fpb8EPfE8BUz4mvLhQNF/m51gbVz7 Cc+ipkFfP4n3Yc+Lu/+2vusseDxxVeDAdeL+CfHhvpmK/RbFTS+EgIywJ A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="255203360" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="255203360" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:12 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655235" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:11:04 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE Date: Thu, 10 Mar 2022 22:09:10 +0800 Message-Id: <20220310140911.50924-13-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org KVM_MEM_PRIVATE is not exposed by default but architecture code can turn on it by implementing kvm_arch_private_memory_supported(). Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 24 +++++++++++++++++++----- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 186b9b981a65..0150e952a131 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1432,6 +1432,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_private_memory_supported(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 52319f49d58a..df5311755a40 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1485,10 +1485,19 @@ static void kvm_replace_memslot(struct kvm *kvm, } } -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) +bool __weak kvm_arch_private_memory_supported(struct kvm *kvm) +{ + return false; +} + +static int check_memory_region_flags(struct kvm *kvm, + const struct kvm_userspace_memory_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; + if (kvm_arch_private_memory_supported(kvm)) + valid_flags |= KVM_MEM_PRIVATE; + #ifdef __KVM_HAVE_READONLY_MEM valid_flags |= KVM_MEM_READONLY; #endif @@ -1900,7 +1909,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int as_id, id; int r; - r = check_memory_region_flags(mem); + r = check_memory_region_flags(kvm, mem); if (r) return r; @@ -1913,10 +1922,12 @@ int __kvm_set_memory_region(struct kvm *kvm, return -EINVAL; if (mem->guest_phys_addr & (PAGE_SIZE - 1)) return -EINVAL; - /* We can read the guest memory with __xxx_user() later on. */ if ((mem->userspace_addr & (PAGE_SIZE - 1)) || - (mem->userspace_addr != untagged_addr(mem->userspace_addr)) || - !access_ok((void __user *)(unsigned long)mem->userspace_addr, + (mem->userspace_addr != untagged_addr(mem->userspace_addr))) + return -EINVAL; + /* We can read the guest memory with __xxx_user() later on. */ + if (!(mem->flags & KVM_MEM_PRIVATE) && + !access_ok((void __user *)(unsigned long)mem->userspace_addr, mem->memory_size)) return -EINVAL; if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) @@ -1957,6 +1968,9 @@ int __kvm_set_memory_region(struct kvm *kvm, if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) return -EINVAL; } else { /* Modify an existing slot. */ + /* Private memslots are immutable, they can only be deleted. */ + if (mem->flags & KVM_MEM_PRIVATE) + return -EINVAL; if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) From patchwork Thu Mar 10 14:09:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776429 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF85EC43217 for ; Thu, 10 Mar 2022 14:14:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243025AbiCJOP3 (ORCPT ); Thu, 10 Mar 2022 09:15:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52280 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243104AbiCJOOl (ORCPT ); Thu, 10 Mar 2022 09:14:41 -0500 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 316851390F8; Thu, 10 Mar 2022 06:11:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921516; x=1678457516; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=XcUkjLL+YpWw11P834waxgWm/NgeT+vT/ACdu5U5CAo=; b=F/sUjSpCuWk97h0sO46fD8jRuA5FWxvDd7aR3R7NaO1nE+jWvZXr0D6X 7wV4dJPL83+hSaW5X+NXupMONaBK+cElKsWFL+e8KgMo050tAAXng81hQ lF8vP+Q/dzgF4dcZBWpHsqRa8oGrIPlmFkBhAM+2KckfcgAMk7+7kE5I0 Oizkr9EPNs4pC1KYrd8PnQNMf6493I9na6tX9HBAV1LjccSeiGQ6luqTj ha3tiaEZDTNZK2PaAFfwJaX9xTmZeEBA5np1QiG01hLK02CewHfNNmUOW 7vDoveF13QNAwosp2COiEYa9HBK2oj0CUjxpSmqVBGccZjPPJ5y4ieCqP w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="235206263" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="235206263" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:21 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655270" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:11:12 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v5 13/13] memfd_create.2: Describe MFD_INACCESSIBLE flag Date: Thu, 10 Mar 2022 22:09:11 +0800 Message-Id: <20220310140911.50924-14-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Signed-off-by: Chao Peng --- man2/memfd_create.2 | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/man2/memfd_create.2 b/man2/memfd_create.2 index 89e9c4136..2698222ae 100644 --- a/man2/memfd_create.2 +++ b/man2/memfd_create.2 @@ -101,6 +101,19 @@ meaning that no other seals can be set on the file. .\" FIXME Why is the MFD_ALLOW_SEALING behavior not simply the default? .\" Is it worth adding some text explaining this? .TP +.BR MFD_INACCESSIBLE +Disallow userspace access through ordinary MMU accesses via +.BR read (2), +.BR write (2) +and +.BR mmap (2). +The file size cannot be changed once initialized. +This flag cannot coexist with +.B MFD_ALLOW_SEALING +and when this flag is set, the initial set of seals will be +.B F_SEAL_SEAL, +meaning that no other seals can be set on the file. +.TP .BR MFD_HUGETLB " (since Linux 4.14)" .\" commit 749df87bd7bee5a79cef073f5d032ddb2b211de8 The anonymous file will be created in the hugetlbfs filesystem using