From patchwork Thu Mar 10 14:08:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776386 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id EFE95C433EF for ; Thu, 10 Mar 2022 14:11:39 +0000 (UTC) Received: from localhost ([::1]:56244 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJVz-0007hy-3s for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:11:39 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35376) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUB-000580-WD for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:09:48 -0500 Received: from mga06.intel.com ([134.134.136.31]:42502) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJU9-0004d4-Ii for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:09:47 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921385; x=1678457385; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=HfTyQumQLz3rZd208DGi7M1QzPzA68CVqOu0FcAUOfs=; b=lEzIRTv+fBuighTO0sO7ivu5ohNf+B39GtwqINaVo/shdsihwAM2p+RD c8GQN2XJd01agHNfM+U9x1n4mf0HnK3TR+4d9VnEpUiz6oDFQKtaGwd3s jfIMrTWArTRalQayEGcJRS2YEaGUy6f+XRsUsVEwZN3lkrS0L7XthrgyX VjJupVvaPetXyo2psoktwJvM3zjHqx29oLEq9JyA1T5lGixDVZ01Ur8fD 7u+xv9Eq8rK2TP5Bsu9ZS8JtVxhgaBQeWJnYesXjTlc43cv150pQm+S+3 KGCN0Z0DrlknXf1cBbr4gyqHn8glja8M8KkewT0wDevb6ZDfyF+iGAbm2 w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="315975592" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="315975592" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:09:44 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654769" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:36 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 01/13] mm/memfd: Introduce MFD_INACCESSIBLE flag Date: Thu, 10 Mar 2022 22:08:59 +0800 Message-Id: <20220310140911.50924-2-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=134.134.136.31; envelope-from=chao.p.peng@linux.intel.com; helo=mga06.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: "Kirill A. Shutemov" Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. Since page migration/swapping is not yet supported for such usages so these pages are currently marked as UNMOVABLE and UNEVICTABLE which makes them behave like long-term pinned pages. The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag. At this time only shmem implements this flag. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/shmem_fs.h | 7 +++++ include/uapi/linux/memfd.h | 1 + mm/memfd.c | 26 +++++++++++++++-- mm/shmem.c | 57 ++++++++++++++++++++++++++++++++++++++ 4 files changed, 88 insertions(+), 3 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index e65b80ed09e7..2dde843f28ef 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -12,6 +12,9 @@ /* inode in-kernel data */ +/* shmem extended flags */ +#define SHM_F_INACCESSIBLE 0x0001 /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */ + struct shmem_inode_info { spinlock_t lock; unsigned int seals; /* shmem seals */ @@ -24,6 +27,7 @@ struct shmem_inode_info { struct shared_policy policy; /* NUMA memory alloc policy */ struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ + unsigned int xflags; /* shmem extended flags */ struct inode vfs_inode; }; @@ -61,6 +65,9 @@ extern struct file *shmem_file_setup(const char *name, loff_t size, unsigned long flags); extern struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned long flags); +extern struct file *shmem_file_setup_xflags(const char *name, loff_t size, + unsigned long flags, + unsigned int xflags); extern struct file *shmem_file_setup_with_mnt(struct vfsmount *mnt, const char *name, loff_t size, unsigned long flags); extern int shmem_zero_setup(struct vm_area_struct *); diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..48750474b904 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_INACCESSIBLE 0x0008U /* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/memfd.c b/mm/memfd.c index 9f80f162791a..74d45a26cf5d 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -245,16 +245,20 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_INACCESSIBLE) SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) { + struct address_space *mapping; unsigned int *file_seals; + unsigned int xflags; struct file *file; int fd, error; char *name; + gfp_t gfp; long len; if (!(flags & MFD_HUGETLB)) { @@ -267,6 +271,10 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; } + /* Disallow sealing when MFD_INACCESSIBLE is set. */ + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING) + return -EINVAL; + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -301,8 +309,11 @@ SYSCALL_DEFINE2(memfd_create, HUGETLB_ANONHUGE_INODE, (flags >> MFD_HUGE_SHIFT) & MFD_HUGE_MASK); - } else - file = shmem_file_setup(name, 0, VM_NORESERVE); + } else { + xflags = flags & MFD_INACCESSIBLE ? SHM_F_INACCESSIBLE : 0; + file = shmem_file_setup_xflags(name, 0, VM_NORESERVE, xflags); + } + if (IS_ERR(file)) { error = PTR_ERR(file); goto err_fd; @@ -313,6 +324,15 @@ SYSCALL_DEFINE2(memfd_create, if (flags & MFD_ALLOW_SEALING) { file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; + } else if (flags & MFD_INACCESSIBLE) { + mapping = file_inode(file)->i_mapping; + gfp = mapping_gfp_mask(mapping); + gfp &= ~__GFP_MOVABLE; + mapping_set_gfp_mask(mapping, gfp); + mapping_set_unevictable(mapping); + + file_seals = memfd_file_seals_ptr(file); + *file_seals = F_SEAL_SEAL; } fd_install(fd, file); diff --git a/mm/shmem.c b/mm/shmem.c index a09b29ec2b45..9b31a7056009 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1084,6 +1084,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; + if (info->xflags & SHM_F_INACCESSIBLE) { + if(oldsize) + return -EPERM; + if (!PAGE_ALIGNED(newsize)) + return -EINVAL; + } + if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); @@ -1331,6 +1338,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty; + if (info->xflags & SHM_F_INACCESSIBLE) + goto redirty; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -2228,6 +2237,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; + if (info->xflags & SHM_F_INACCESSIBLE) + return -EPERM; + /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED; @@ -2433,6 +2445,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping, if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; } + if (unlikely(info->xflags & SHM_F_INACCESSIBLE)) + return -EPERM; ret = shmem_getpage(inode, index, pagep, SGP_WRITE); @@ -2517,6 +2531,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) end_index = i_size >> PAGE_SHIFT; if (index > end_index) break; + + /* + * inode_lock protects setting up seals as well as write to + * i_size. Setting SHM_F_INACCESSIBLE only allowed with + * i_size == 0. + * + * Check SHM_F_INACCESSIBLE after i_size. It effectively + * serialize read vs. setting SHM_F_INACCESSIBLE without + * taking inode_lock in read path. + */ + if (SHMEM_I(inode)->xflags & SHM_F_INACCESSIBLE) { + error = -EPERM; + break; + } + if (index == end_index) { nr = i_size & ~PAGE_MASK; if (nr <= offset) @@ -2648,6 +2677,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->xflags & SHM_F_INACCESSIBLE) && + (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) { + error = -EINVAL; + goto out; + } + shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; @@ -4082,6 +4117,28 @@ struct file *shmem_kernel_file_setup(const char *name, loff_t size, unsigned lon return __shmem_file_setup(shm_mnt, name, size, flags, S_PRIVATE); } +/** + * shmem_file_setup_xflags - get an unlinked file living in tmpfs with + * additional xflags. + * @name: name for dentry (to be seen in /proc//maps + * @size: size to be set for the file + * @flags: VM_NORESERVE suppresses pre-accounting of the entire object size + * @xflags: SHM_F_INACCESSIBLE prevents ordinary MMU access to the file content + */ + +struct file *shmem_file_setup_xflags(const char *name, loff_t size, + unsigned long flags, unsigned int xflags) +{ + struct shmem_inode_info *info; + struct file *res = __shmem_file_setup(shm_mnt, name, size, flags, 0); + + if(!IS_ERR(res)) { + info = SHMEM_I(file_inode(res)); + info->xflags = xflags & SHM_F_INACCESSIBLE; + } + return res; +} + /** * shmem_file_setup - get an unlinked file living in tmpfs * @name: name for dentry (to be seen in /proc//maps From patchwork Thu Mar 10 14:09:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776438 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2BB8CC433F5 for ; Thu, 10 Mar 2022 14:22:15 +0000 (UTC) Received: from localhost ([::1]:58052 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJgE-0003GD-0X for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:22:14 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35438) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUM-0005J5-GW for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:09:58 -0500 Received: from mga12.intel.com ([192.55.52.136]:3802) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUI-0004dS-RC for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:09:58 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921394; x=1678457394; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=GbR4wRPHIwdHjPFQakoEmpS8z2SMsND5ZXVQtQDRCQY=; b=KZt9Kv+cCc8+gM6D45kuzByzwGG4mvyQWp3BwKL0xW55l+Lg+YCR0Px+ e/lq/FNz/Wtr9dI4k1rGNytSnsP85gZICizmUO1JDVfqO0eVX+LYxSOrq 8+A8C/A6pfPjfWWlooIL4/xj4cc00S2S1SAK9XoCjb549/AxfuGWgQToc 3NA9pWCWRqonm38HEq2Gdck80seS1MvSmboxLmgXobf3YmCllVwshvcX5 5J8DvSsuND9f9NDPS3nSSazp2lX3c6JZ3+tXUrSj4QUeyhvAbLPl8XgOF iI8KfNmmJDCGvwuYuwG6+g6AsPq9MGJUvb084ID5IhC/xyIofnkxPPNa4 A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="235205944" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="235205944" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:09:52 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654831" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:44 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 02/13] mm: Introduce memfile_notifier Date: Thu, 10 Mar 2022 22:09:00 +0800 Message-Id: <20220310140911.50924-3-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=192.55.52.136; envelope-from=chao.p.peng@linux.intel.com; helo=mga12.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become allocated/invalidated. It will be used for KVM to use a file descriptor as the guest memory backing store and KVM will use this memfile_notifier interface to interact with memory file subsystems. In the future there might be other consumers (e.g. VFIO with encrypted device memory). It consists two sets of callbacks: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets allocated/invalidated. - memfile_pfn_ops: callbacks for KVM to call into memory backing store to request memory pages for guest private memory. Userspace is in charge of guest memory lifecycle: it first allocates pages in memory backing store and then passes the fd to KVM and lets KVM register each memory slot to memory backing store via memfile_register_notifier. The supported memory backing store should maintain a memfile_notifier list and provide routine for memfile_notifier to get the list head address and memfile_pfn_ops callbacks for memfile_register_notifier. It also should call memfile_notifier_fallocate/memfile_notifier_invalidate when the bookmarked memory gets allocated/invalidated. Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/memfile_notifier.h | 64 +++++++++++++++++ mm/Kconfig | 4 ++ mm/Makefile | 1 + mm/memfile_notifier.c | 114 +++++++++++++++++++++++++++++++ 4 files changed, 183 insertions(+) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h new file mode 100644 index 000000000000..e8d400558adb --- /dev/null +++ b/include/linux/memfile_notifier.h @@ -0,0 +1,64 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMFILE_NOTIFIER_H +#define _LINUX_MEMFILE_NOTIFIER_H + +#include +#include +#include +#include + +struct memfile_notifier; + +struct memfile_notifier_ops { + void (*invalidate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); + void (*fallocate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct memfile_pfn_ops { + long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order); + void (*put_unlock_pfn)(unsigned long pfn); +}; + +struct memfile_notifier { + struct list_head list; + struct memfile_notifier_ops *ops; +}; + +struct memfile_notifier_list { + struct list_head head; + spinlock_t lock; +}; + +struct memfile_backing_store { + struct list_head list; + struct memfile_pfn_ops pfn_ops; + struct memfile_notifier_list* (*get_notifier_list)(struct inode *inode); +}; + +#ifdef CONFIG_MEMFILE_NOTIFIER +/* APIs for backing stores */ +static inline void memfile_notifier_list_init(struct memfile_notifier_list *list) +{ + INIT_LIST_HEAD(&list->head); + spin_lock_init(&list->lock); +} + +extern void memfile_notifier_invalidate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end); +extern void memfile_notifier_fallocate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end); +extern void memfile_register_backing_store(struct memfile_backing_store *bs); +extern void memfile_unregister_backing_store(struct memfile_backing_store *bs); + +/*APIs for notifier consumers */ +extern int memfile_register_notifier(struct inode *inode, + struct memfile_notifier *notifier, + struct memfile_pfn_ops **pfn_ops); +extern void memfile_unregister_notifier(struct inode *inode, + struct memfile_notifier *notifier); + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + +#endif /* _LINUX_MEMFILE_NOTIFIER_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 3326ee3903f3..7c6b1ad3dade 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -892,6 +892,10 @@ config ANON_VMA_NAME area from being merged with adjacent virtual memory areas due to the difference in their name. +config MEMFILE_NOTIFIER + bool + select SRCU + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index 70d4309c9ce3..f628256dce0d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -132,3 +132,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o +obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c new file mode 100644 index 000000000000..a405db56fde2 --- /dev/null +++ b/mm/memfile_notifier.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * linux/mm/memfile_notifier.c + * + * Copyright (C) 2022 Intel Corporation. + * Chao Peng + */ + +#include +#include + +DEFINE_STATIC_SRCU(srcu); +static LIST_HEAD(backing_store_list); + +void memfile_notifier_invalidate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&srcu); + list_for_each_entry_srcu(notifier, &list->head, list, + srcu_read_lock_held(&srcu)) { + if (notifier->ops && notifier->ops->invalidate) + notifier->ops->invalidate(notifier, start, end); + } + srcu_read_unlock(&srcu, id); +} + +void memfile_notifier_fallocate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&srcu); + list_for_each_entry_srcu(notifier, &list->head, list, + srcu_read_lock_held(&srcu)) { + if (notifier->ops && notifier->ops->fallocate) + notifier->ops->fallocate(notifier, start, end); + } + srcu_read_unlock(&srcu, id); +} + +void memfile_register_backing_store(struct memfile_backing_store *bs) +{ + BUG_ON(!bs || !bs->get_notifier_list); + + list_add_tail(&bs->list, &backing_store_list); +} + +void memfile_unregister_backing_store(struct memfile_backing_store *bs) +{ + list_del(&bs->list); +} + +static int memfile_get_notifier_info(struct inode *inode, + struct memfile_notifier_list **list, + struct memfile_pfn_ops **ops) +{ + struct memfile_backing_store *bs, *iter; + struct memfile_notifier_list *tmp; + + list_for_each_entry_safe(bs, iter, &backing_store_list, list) { + tmp = bs->get_notifier_list(inode); + if (tmp) { + *list = tmp; + if (ops) + *ops = &bs->pfn_ops; + return 0; + } + } + return -EOPNOTSUPP; +} + +int memfile_register_notifier(struct inode *inode, + struct memfile_notifier *notifier, + struct memfile_pfn_ops **pfn_ops) +{ + struct memfile_notifier_list *list; + int ret; + + if (!inode || !notifier | !pfn_ops) + return -EINVAL; + + ret = memfile_get_notifier_info(inode, &list, pfn_ops); + if (ret) + return ret; + + spin_lock(&list->lock); + list_add_rcu(¬ifier->list, &list->head); + spin_unlock(&list->lock); + + return 0; +} +EXPORT_SYMBOL_GPL(memfile_register_notifier); + +void memfile_unregister_notifier(struct inode *inode, + struct memfile_notifier *notifier) +{ + struct memfile_notifier_list *list; + + if (!inode || !notifier) + return; + + BUG_ON(memfile_get_notifier_info(inode, &list, NULL)); + + spin_lock(&list->lock); + list_del_rcu(¬ifier->list); + spin_unlock(&list->lock); + + synchronize_srcu(&srcu); +} +EXPORT_SYMBOL_GPL(memfile_unregister_notifier); From patchwork Thu Mar 10 14:09:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776439 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 6F2C9C433FE for ; Thu, 10 Mar 2022 14:23:54 +0000 (UTC) Received: from localhost ([::1]:60800 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJhp-0005tu-DZ for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:23:53 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35480) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUR-0005WI-Ve for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:04 -0500 Received: from mga07.intel.com ([134.134.136.100]:62813) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUP-0004eI-M8 for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:03 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921401; x=1678457401; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=LshKgds1U2bNEdA3l08Zy9WrXI0Fxarkx0V/cuLKyUc=; b=azREEZj2WvSF3w5S7E5ujiwkNVcuTS3IHPy1PcR6ATAXKfa3BWeQ+Nwb qFfbO+iRa5ME16b5s4jVN+a1NQ/EX5ieWQppWEvZTzOKUrFy3EyzZ5sHs YodjDaLpPvbQNSqRc+o+rY76Czi2vwZ+QRXMiZJDAsYK1DZq0K6G1DP2A 7E2yRIIgBdicB24Jcwo+LK3l2E/bDlsplluNzDBY0mvEjNz2WEhGeSLR/ S32Y9mVJUGSf5fMzXu+LhBqNva62LoorwBO6cw3I9cOi1zOnR/fFGQyID mLX2N0jlpxdKMNc3yjxOp5UVNdOYyMjBH35uWfpEGZpl1Pih6juZzAfAA g==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="318479296" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="318479296" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654872" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:09:52 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 03/13] mm/shmem: Support memfile_notifier Date: Thu, 10 Mar 2022 22:09:01 +0800 Message-Id: <20220310140911.50924-4-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=134.134.136.100; envelope-from=chao.p.peng@linux.intel.com; helo=mga07.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" From: "Kirill A. Shutemov" It maintains a memfile_notifier list in shmem_inode_info structure and implements memfile_pfn_ops callbacks defined by memfile_notifier. It then exposes them to memfile_notifier via shmem_get_memfile_notifier_info. We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be allocated by userspace for private memory. If there is no pages allocated at the offset then error should be returned so KVM knows that the memory is not private memory. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/shmem_fs.h | 4 +++ mm/shmem.c | 76 ++++++++++++++++++++++++++++++++++++++++ 2 files changed, 80 insertions(+) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 2dde843f28ef..7bb16f2d2825 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include #include #include +#include /* inode in-kernel data */ @@ -28,6 +29,9 @@ struct shmem_inode_info { struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ unsigned int xflags; /* shmem extended flags */ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct memfile_notifier_list memfile_notifiers; +#endif struct inode vfs_inode; }; diff --git a/mm/shmem.c b/mm/shmem.c index 9b31a7056009..7b43e274c9a2 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -903,6 +903,28 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) return page ? page_folio(page) : NULL; } +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) +{ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct shmem_inode_info *info = SHMEM_I(inode); + + memfile_notifier_fallocate(&info->memfile_notifiers, start, end); +#endif +} + +static void notify_invalidate_page(struct inode *inode, struct folio *folio, + pgoff_t start, pgoff_t end) +{ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct shmem_inode_info *info = SHMEM_I(inode); + + start = max(start, folio->index); + end = min(end, folio->index + folio_nr_pages(folio)); + + memfile_notifier_invalidate(&info->memfile_notifiers, start, end); +#endif +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -946,6 +968,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += folio_nr_pages(folio) - 1; + notify_invalidate_page(inode, folio, start, end); + if (!unfalloc || !folio_test_uptodate(folio)) truncate_inode_folio(mapping, folio); folio_unlock(folio); @@ -1019,6 +1043,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; } + + notify_invalidate_page(inode, folio, start, end); + VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); truncate_inode_folio(mapping, folio); @@ -2279,6 +2306,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode info->flags = flags & VM_NORESERVE; INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_notifier_list_init(&info->memfile_notifiers); +#endif simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping); @@ -2802,6 +2832,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) i_size_write(inode, offset + len); inode->i_ctime = current_time(inode); + notify_fallocate(inode, start, end); undone: spin_lock(&inode->i_lock); inode->i_private = NULL; @@ -3909,6 +3940,47 @@ static struct file_system_type shmem_fs_type = { .fs_flags = FS_USERNS_MOUNT, }; +#ifdef CONFIG_MEMFILE_NOTIFIER +static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order) +{ + struct page *page; + int ret; + + ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC); + if (ret) + return ret; + + *order = thp_order(compound_head(page)); + + return page_to_pfn(page); +} + +static void shmem_put_unlock_pfn(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + + set_page_dirty(page); + unlock_page(page); + put_page(page); +} + +static struct memfile_notifier_list* shmem_get_notifier_list(struct inode *inode) +{ + if (!shmem_mapping(inode->i_mapping)) + return NULL; + + return &SHMEM_I(inode)->memfile_notifiers; +} + +static struct memfile_backing_store shmem_backing_store = { + .pfn_ops.get_lock_pfn = shmem_get_lock_pfn, + .pfn_ops.put_unlock_pfn = shmem_put_unlock_pfn, + .get_notifier_list = shmem_get_notifier_list, +}; +#endif /* CONFIG_MEMFILE_NOTIFIER */ + int __init shmem_init(void) { int error; @@ -3934,6 +4006,10 @@ int __init shmem_init(void) else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ #endif + +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_register_backing_store(&shmem_backing_store); +#endif return 0; out1: From patchwork Thu Mar 10 14:09:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776430 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 8307FC433EF for ; Thu, 10 Mar 2022 14:14:53 +0000 (UTC) Received: from localhost ([::1]:36506 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJZ6-00052f-DH for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:14:52 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35548) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUd-0005xv-8o for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:15 -0500 Received: from mga07.intel.com ([134.134.136.100]:62837) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUb-0004sI-ID for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921413; x=1678457413; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=s3KeBsvI7L7pbUPmXrVsTjDsdOJOMmhGXQWiylgk8+Q=; b=c/6HdbyRBf0Psz1ChU4lpMVvt783BPJKrYvLcaLdJykEeXn6ZNnYQn8h iDuw3Ylm3e9l1bQ0M+gMJnE6J0FhLVtGfg7hIlozmy9tLOFIFH8ImalHO iWJY5qPItyA4TM5dyIB+ZMbU+trctW5uc36yt3331/uooim4TEGCLG2bc 8EpW8QKa5jbLiYcL39G5UI36xn48vLElEwi9W4G5CtYSdTb/yCifdY520 LgMS7+C2OELpOexqSpreiVHB1yLvZkNYHhUqF/uV+MJSkEV9QPC0FBAZ7 2hUg1zI6j2kYqQhFUkhoo0bKRrr+989uz1CN/l25iB6k49W8ty5FXmxo8 w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="318479330" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="318479330" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:08 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654936" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:00 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Date: Thu, 10 Mar 2022 22:09:02 +0800 Message-Id: <20220310140911.50924-5-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=134.134.136.100; envelope-from=chao.p.peng@linux.intel.com; helo=mga07.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Since page migration / swapping is not supported yet, MFD_INACCESSIBLE memory behave like longterm pinned pages and thus should be accounted to mm->pinned_vm and be restricted by RLIMIT_MEMLOCK. Signed-off-by: Chao Peng --- mm/shmem.c | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/mm/shmem.c b/mm/shmem.c index 7b43e274c9a2..ae46fb96494b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -915,14 +915,17 @@ static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) static void notify_invalidate_page(struct inode *inode, struct folio *folio, pgoff_t start, pgoff_t end) { -#ifdef CONFIG_MEMFILE_NOTIFIER struct shmem_inode_info *info = SHMEM_I(inode); +#ifdef CONFIG_MEMFILE_NOTIFIER start = max(start, folio->index); end = min(end, folio->index + folio_nr_pages(folio)); memfile_notifier_invalidate(&info->memfile_notifiers, start, end); #endif + + if (info->xflags & SHM_F_INACCESSIBLE) + atomic64_sub(end - start, ¤t->mm->pinned_vm); } /* @@ -2680,6 +2683,20 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence) return offset; } +static bool memlock_limited(unsigned long npages) +{ + unsigned long lock_limit; + unsigned long pinned; + + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + pinned = atomic64_add_return(npages, ¤t->mm->pinned_vm); + if (pinned > lock_limit && !capable(CAP_IPC_LOCK)) { + atomic64_sub(npages, ¤t->mm->pinned_vm); + return true; + } + return false; +} + static long shmem_fallocate(struct file *file, int mode, loff_t offset, loff_t len) { @@ -2753,6 +2770,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->xflags & SHM_F_INACCESSIBLE) && + memlock_limited(end - start)) { + error = -ENOMEM; + goto out; + } + shmem_falloc.waitq = NULL; shmem_falloc.start = start; shmem_falloc.next = start; From patchwork Thu Mar 10 14:09:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776441 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AFDEEC433FE for ; Thu, 10 Mar 2022 14:26:27 +0000 (UTC) Received: from localhost ([::1]:34860 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJkI-0000AT-IJ for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:26:26 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35590) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUj-0005zm-Dl for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:22 -0500 Received: from mga09.intel.com ([134.134.136.24]:50575) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUf-0004sp-V3 for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:20 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921418; x=1678457418; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=R9N/T0uA9Epd8O0DLQkw3YWNAAFsUHVjLA2F5YkEzAE=; b=EuwlRaypq/ks3zRUGaXN33ON8xP4gcZKD0IaMHr5BHan6vcCPzwqN37y ppd66xjUSqb/Zl7XPzOjK7cqNepWNvHnextgJrajP3DlP+LCMeA7tiEAi 6N9cflqDalrBYatCuPA6fCZdSIPo/3g+pXlpTLrzmWAwmFc/nm+MnBFdR QdegJ0LxaYXLgHbVJLpda+PE6zy+sExXiYgN2fVvER9rhqB2DjpSNIYDQ r6iIe6rOJXZ4qU/fBtG1qb8TIMsqkNpkF3Qe6gI62Lvq/HzQ5NhPMVAxF 0QDrdS4s9U2WFXp3vZzUY0DFEMjifArulfG1ei4NRFH1q892KnyUNbmwj A==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="254994195" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="254994195" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:16 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554654963" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:08 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 05/13] KVM: Extend the memslot to support fd-based private memory Date: Thu, 10 Mar 2022 22:09:03 +0800 Message-Id: <20220310140911.50924-6-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=134.134.136.24; envelope-from=chao.p.peng@linux.intel.com; helo=mga09.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Extend the memslot definition to provide fd-based private memory support by adding two new fields (private_fd/private_offset). The memslot then can maintain memory for both shared pages and private pages in a single memslot. Shared pages are provided by existing userspace_addr(hva) field and private pages are provided through the new private_fd/private_offset fields. Since there is no 'hva' concept anymore for private memory so we cannot rely on get_user_pages() to get a pfn, instead we use the newly added memfile_notifier to complete the same job. This new extension is indicated by a new flag KVM_MEM_PRIVATE. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 37 +++++++++++++++++++++++++++------- include/linux/kvm_host.h | 7 +++++++ include/uapi/linux/kvm.h | 8 ++++++++ 3 files changed, 45 insertions(+), 7 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 3acbf4d263a5..f76ac598606c 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1307,7 +1307,7 @@ yet and must be cleared on entry. :Capability: KVM_CAP_USER_MEMORY :Architectures: all :Type: vm ioctl -:Parameters: struct kvm_userspace_memory_region (in) +:Parameters: struct kvm_userspace_memory_region(_ext) (in) :Returns: 0 on success, -1 on error :: @@ -1320,9 +1320,17 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ }; + struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 padding[5]; +}; + /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) + #define KVM_MEM_PRIVATE (1UL << 2) This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1353,12 +1361,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region +fields. It also includes additional fields for some specific features. See +below description of flags field for more information. It's recommended to use +kvm_userspace_memory_region_ext in new userspace code. + +The flags field supports below flags: + +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. + +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to + make a new slot read-only. In this case, writes to this memory will be posted + to userspace as KVM_EXIT_MMIO exits. + +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by + a file descirptor(fd) and the content of the private memory is invisible to + userspace. In this case, userspace should use private_fd/private_offset in + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to + guest. Userspace should guarantee not to map the same pfn indicated by + private_fd/private_offset to different gfns with multiple memslots. Failed to + do this may result undefined behavior. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9536ffa0473b..3be8116079d4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -563,8 +563,15 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct file *private_file; + loff_t private_offset; }; +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) +{ + return slot && (slot->flags & KVM_MEM_PRIVATE); +} + static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) { return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 91a6fe4e02c0..a523d834efc8 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -103,6 +103,13 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ }; +struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 padding[5]; +}; + /* * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, * other bits are reserved for kvm internal use which are defined in @@ -110,6 +117,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_PRIVATE (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { From patchwork Thu Mar 10 14:09:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776387 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3E86EC433F5 for ; Thu, 10 Mar 2022 14:12:21 +0000 (UTC) Received: from localhost ([::1]:57898 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJWe-0000QF-DS for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:12:20 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35608) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUr-00065H-0D for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:29 -0500 Received: from mga03.intel.com ([134.134.136.65]:35090) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUo-0004xK-O2 for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921426; x=1678457426; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=aXDmGBTRlSAIbebZ1eKgmeAE+5/9H+lFwtsMHbxfleU=; b=NpAwKIFpSKCxjUC/3BiQkdsatjJed/h3nX90xj7t3osXE64se+4ruujM 5T2H0aVZ+4ZQeaSSlc9d5cNvnhREIviOoDgUdwCSK07qG/Oo3yhR72U7y QNz54Z4H5o5iUUV9R9H1lrdluM+5v5r25FUfTW1eyxRC/Jjtl5DnF880a 6571hpsb5wxBO0nA96w1zj8hJKrwHzi80Dcmm0mtio44l1eioSEYwngyE qEFhdtd3IynLdt14WwILoKiD8va0fSPU5Akml1TrixA9N6oAIhqIJQKob rgPKJ9KLr/sYdv0ZOl69fO6AccrLiU2M/hy86n/07U0s0JbyHFWGphGXr w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="255203145" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="255203145" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:24 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655000" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:16 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 06/13] KVM: Use kvm_userspace_memory_region_ext Date: Thu, 10 Mar 2022 22:09:04 +0800 Message-Id: <20220310140911.50924-7-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=134.134.136.65; envelope-from=chao.p.peng@linux.intel.com; helo=mga03.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Use the new extended memslot structure kvm_userspace_memory_region_ext. The extended part (private_fd/ private_offset) will be copied from userspace only when KVM_MEM_PRIVATE is set. Internally old kvm_userspace_memory_region will still be used for places where the extended fields are not needed. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/x86.c | 12 ++++++------ include/linux/kvm_host.h | 4 ++-- virt/kvm/kvm_main.c | 30 ++++++++++++++++++++---------- 3 files changed, 28 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8c06b8204fca..1d9dbef67715 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11757,13 +11757,13 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, } for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_userspace_memory_region_ext m; - m.slot = id | (i << 16); - m.flags = 0; - m.guest_phys_addr = gpa; - m.userspace_addr = hva; - m.memory_size = size; + m.region.slot = id | (i << 16); + m.region.flags = 0; + m.region.guest_phys_addr = gpa; + m.region.userspace_addr = hva; + m.region.memory_size = size; r = __kvm_set_memory_region(kvm, &m); if (r < 0) return ERR_PTR_USR(r); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3be8116079d4..c92c70174248 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1082,9 +1082,9 @@ enum kvm_mr_change { }; int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region_ext *region_ext); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region_ext *region_ext); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 69c318fdff61..d11a2628b548 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1809,8 +1809,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * Must be called holding kvm->slots_lock for write. */ int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_userspace_memory_region_ext *region_ext) { + const struct kvm_userspace_memory_region *mem = ®ion_ext->region; struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; enum kvm_mr_change change; @@ -1913,24 +1914,24 @@ int __kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(__kvm_set_memory_region); int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_userspace_memory_region_ext *region_ext) { int r; mutex_lock(&kvm->slots_lock); - r = __kvm_set_memory_region(kvm, mem); + r = __kvm_set_memory_region(kvm, region_ext); mutex_unlock(&kvm->slots_lock); return r; } EXPORT_SYMBOL_GPL(kvm_set_memory_region); static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, - struct kvm_userspace_memory_region *mem) + struct kvm_userspace_memory_region_ext *region_ext) { - if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) + if ((u16)region_ext->region.slot >= KVM_USER_MEM_SLOTS) return -EINVAL; - return kvm_set_memory_region(kvm, mem); + return kvm_set_memory_region(kvm, region_ext); } #ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT @@ -4476,14 +4477,23 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: { - struct kvm_userspace_memory_region kvm_userspace_mem; + struct kvm_userspace_memory_region_ext region_ext; r = -EFAULT; - if (copy_from_user(&kvm_userspace_mem, argp, - sizeof(kvm_userspace_mem))) + if (copy_from_user(®ion_ext, argp, + sizeof(struct kvm_userspace_memory_region))) goto out; + if (region_ext.region.flags & KVM_MEM_PRIVATE) { + int offset = offsetof( + struct kvm_userspace_memory_region_ext, + private_offset); + if (copy_from_user(®ion_ext.private_offset, + argp + offset, + sizeof(region_ext) - offset)) + goto out; + } - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); + r = kvm_vm_ioctl_set_memory_region(kvm, ®ion_ext); break; } case KVM_GET_DIRTY_LOG: { From patchwork Thu Mar 10 14:09:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776432 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id B35F7C433F5 for ; Thu, 10 Mar 2022 14:15:53 +0000 (UTC) Received: from localhost ([::1]:39204 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJa4-0006qd-MY for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:15:52 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35652) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUy-0006CS-Qd for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:37 -0500 Received: from mga11.intel.com ([192.55.52.93]:24075) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJUw-0004z9-EQ for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:36 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921434; x=1678457434; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=Z4vQdKuwJGnUEOHT5u0YKIBAyF/sX/jaF7InK/l4l/o=; b=ORB6pm1Tg5a2uHP7B9VIRJOuf2Oxa3uuMi7eYg2rw4VM0v3lLtf2WObn GWehlNF5LskzaY9jlpB5czE3cpxo87ZBS53S9jg110OiIT4LJweJX+jFq OtSemtQr0B2p7RiTh4iNQnl5mPmk7PySEYhb9KqNF0t420lidSoNtdGNo 8/yxH1uZSQGFvbCpiWtIEdnShRQHkTwImWz6kM04y9zmPu7m9i3J9yxx4 F3TfldieVFny1B0uXY10c8wZoYoDWswH/3rFnk2objaPwPes2WOfwriMc dvWVmY+LZ/TbpFRmBw/H3g/yyGg1q2gr3RozRn7Jrn/xcIZgSeHGgGoTM Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="252823508" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="252823508" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:32 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655053" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:24 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 07/13] KVM: Add KVM_EXIT_MEMORY_ERROR exit Date: Thu, 10 Mar 2022 22:09:05 +0800 Message-Id: <20220310140911.50924-8-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=192.55.52.93; envelope-from=chao.p.peng@linux.intel.com; helo=mga11.intel.com X-Spam_score_int: -70 X-Spam_score: -7.1 X-Spam_bar: ------- X-Spam_report: (-7.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" This new KVM exit allows userspace to handle memory-related errors. It indicates an error happens in KVM at guest memory range [gpa, gpa+size). The flags includes additional information for userspace to handle the error. Currently bit 0 is defined as 'private memory' where '1' indicates error happens due to private memory access and '0' indicates error happens due to shared memory access. After private memory is enabled, this new exit will be used for KVM to exit to userspace for shared memory <-> private memory conversion in memory encryption usage. In such usage, typically there are two kind of memory conversions: - explicit conversion: happens when guest explicitly calls into KVM to map a range (as private or shared), KVM then exits to userspace to do the map/unmap operations. - implicit conversion: happens in KVM page fault handler. * if the fault is due to a private memory access then causes a userspace exit for a shared->private conversion request when the page has not been allocated in the private memory backend. * If the fault is due to a shared memory access then causes a userspace exit for a private->shared conversion request when the page has already been allocated in the private memory backend. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++ include/uapi/linux/kvm.h | 9 +++++++++ 2 files changed, 31 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index f76ac598606c..bad550c2212b 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6216,6 +6216,28 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_ERROR */ + struct { + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; +If exit reason is KVM_EXIT_MEMORY_ERROR then it indicates that the VCPU has +encountered a memory error which is not handled by KVM kernel module and +userspace may choose to handle it. The 'flags' field indicates the memory +properties of the exit. + + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by + private memory access when the bit is set otherwise the memory error is + caused by shared memory access when the bit is clear. + +'gpa' and 'size' indicate the memory range the error occurs at. The userspace +may handle the error and return to KVM to retry the previous memory access. + :: /* Fix the size of the union. */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a523d834efc8..9ad0c8aa0263 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -278,6 +278,7 @@ struct kvm_xen_exit { #define KVM_EXIT_X86_BUS_LOCK 33 #define KVM_EXIT_XEN 34 #define KVM_EXIT_RISCV_SBI 35 +#define KVM_EXIT_MEMORY_ERROR 36 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -495,6 +496,14 @@ struct kvm_run { unsigned long args[6]; unsigned long ret[2]; } riscv_sbi; + /* KVM_EXIT_MEMORY_ERROR */ + struct { +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; /* Fix the size of the union. */ char padding[256]; }; From patchwork Thu Mar 10 14:09:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776431 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id AA532C433FE for ; Thu, 10 Mar 2022 14:14:54 +0000 (UTC) Received: from localhost ([::1]:36706 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJZ7-0005AX-Gr for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:14:53 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35700) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJV7-0006b4-D7 for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:45 -0500 Received: from mga04.intel.com ([192.55.52.120]:7825) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJV5-0004zp-EA for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:10:45 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921443; x=1678457443; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=/0lWU+5EyoSjlOB1k9o55wX3sxsJiR5VJJRuHs8r7qY=; b=Qcsgzn9bdwIw/5HfryRjiYS45NGxuFgv+dOLMR7MPs3y36xcH0HQ403T 0dtQ/YvyeDj73XYILGr1dUweMftALEfFYIsPowwqYcdHnyIYGF9RRe++R fKhe45xTw2UcaD6PLkjPhBl1FMMKZLgYgEXACiuE+hKt9QRDRyBMVm/Qp xNTJrCAsLpnmMt8x/dQ7CmDAvtAPP4D+mSSmC79w3dL/pBZ9PA41wUbqS VN7rxls19ZgZBO3Ef7HNGwh7ZAaDPfkMoNbbZBothJ2ZgKnYnXCEhGsFB 95L1oRVD+pKkNVKv2qP+jK8Z+aya7+nuzqSWCRjmbi5eOIOLIqBKEmKid w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="254084996" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="254084996" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:41 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655084" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:32 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 08/13] KVM: Use memfile_pfn_ops to obtain pfn for private pages Date: Thu, 10 Mar 2022 22:09:06 +0800 Message-Id: <20220310140911.50924-9-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=192.55.52.120; envelope-from=chao.p.peng@linux.intel.com; helo=mga04.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Private pages are not mmap-ed into userspace so can not reply on get_user_pages() to obtain the pfn. Instead we add a memfile_pfn_ops pointer pfn_ops in each private memslot and use it to obtain the pfn for a gfn. To do that, KVM should convert the gfn to the offset into the fd and then call get_lock_pfn callback. Once KVM completes its job it should call put_unlock_pfn to unlock the pfn. Note the pfn(page) is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is valid when KVM uses it to establish the mapping in the secondary MMU page table. The pfn_ops is initialized via memfile_register_notifier from the memory backing store that provided the private_fd. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/Kconfig | 1 + include/linux/kvm_host.h | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index e3cbd7706136..ca7b2a6a452a 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -48,6 +48,7 @@ config KVM select SRCU select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM + select MEMFILE_NOTIFIER help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c92c70174248..6e1d770d6bf8 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -44,6 +44,7 @@ #include #include +#include #ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -565,6 +566,7 @@ struct kvm_memory_slot { u16 as_id; struct file *private_file; loff_t private_offset; + struct memfile_pfn_ops *pfn_ops; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) @@ -915,6 +917,7 @@ static inline void kvm_irqfd_exit(void) { } #endif + int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align, struct module *module); void kvm_exit(void); @@ -2217,4 +2220,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, + int *order) +{ + pgoff_t index = gfn - slot->base_gfn + + (slot->private_offset >> PAGE_SHIFT); + + return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file), + index, order); +} + +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ + slot->pfn_ops->put_unlock_pfn(pfn); +} + +#else +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, + int *order) +{ + return -1; +} + +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ +} +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #endif From patchwork Thu Mar 10 14:09:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776437 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 12432C433FE for ; Thu, 10 Mar 2022 14:21:37 +0000 (UTC) Received: from localhost ([::1]:55576 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJfb-0001IN-TT for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:21:35 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35800) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVb-0008MD-Ru for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:15 -0500 Received: from mga11.intel.com ([192.55.52.93]:24116) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVZ-00053A-UP for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:15 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921473; x=1678457473; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=uGzkzAL8Dy+iBgRDUMsNImudpdJ03OH3v6JTl1qmo+Y=; b=OqIOEGBE4VJYWmx5Cnd1/H1LJFa21GWkzq+NCW6zk1amfqJ+HmfYtuEZ eqQ0DCGYVpeYGzHREr0nxeTolDbwLcsNd6kR+05ljp79Qk6cs2qMA4oAw ft9wUJ12B1vIuFGY2kcxtJdhvXSVegEq2FWAdrajv7+dDmXVr0kuDyL5r Vs+QZHB0kywk6ZTMldWG0lETDS/6/ZpSx1+/N3+2sbWJLLJCdTWFH7Lov WUdbHfbg5gCkxk1UlV7U9iStukI5AVSE+cJ1tnWJIqA4iL2AcxBjDjiFy 4Ts31HKq51Lxuy5DZx6fsUzzvqS5AkQAmg+kwvQ/ZTXtD//k29AB2qLNq w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="252823589" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="252823589" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655113" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:41 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 09/13] KVM: Handle page fault for private memory Date: Thu, 10 Mar 2022 22:09:07 +0800 Message-Id: <20220310140911.50924-10-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=192.55.52.93; envelope-from=chao.p.peng@linux.intel.com; helo=mga11.intel.com X-Spam_score_int: -70 X-Spam_score: -7.1 X-Spam_bar: ------- X-Spam_report: (-7.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" When page fault happens for a memslot with KVM_MEM_PRIVATE, we use kvm_memfile_get_pfn() which further calls into memfile_pfn_ops callbacks defined for each memslot to request the pfn from the memory backing store. One assumption is that private pages are persistent and pre-allocated in the private memory fd (backing store) so KVM uses this information as an indicator for a page is private or shared (i.e. the private fd is the final source of truth as to whether or not a GPA is private). Depending on the access is private or shared, we go different paths: - For private access, KVM checks if the page is already allocated in the memory backing store, if yes KVM establishes the mapping, otherwise exits to userspace to convert a shared page to private one. - For shared access, KVM also checks if the page is already allocated in the memory backing store, if yes then exit to userspace to convert a private page to shared one, otherwise it's treated as a traditional hva-based shared memory, KVM lets existing code to obtain a pfn with get_user_pages() and establish the mapping. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 11 +++-- 2 files changed, 77 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 3b8da8b0745e..f04c823ea09a 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2844,6 +2844,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K; + if (kvm_slot_is_private(slot)) + return max_level; + host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level); } @@ -3890,7 +3893,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); } -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + /* + * At this time private gfn has not been supported yet. Other patch + * that enables it should change this. + */ + return false; +} + +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault, + bool *is_private_pfn, int *r) +{ + int order; + unsigned int flags = 0; + struct kvm_memory_slot *slot = fault->slot; + long pfn = kvm_memfile_get_pfn(slot, fault->gfn, &order); + + if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) { + if (pfn < 0) + flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; + else { + fault->pfn = pfn; + if (slot->flags & KVM_MEM_READONLY) + fault->map_writable = false; + else + fault->map_writable = true; + + if (order == 0) + fault->max_level = PG_LEVEL_4K; + *is_private_pfn = true; + *r = RET_PF_FIXED; + return true; + } + } else { + if (pfn < 0) + return false; + + kvm_memfile_put_pfn(slot, pfn); + } + + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR; + vcpu->run->memory.flags = flags; + vcpu->run->memory.padding = 0; + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT; + vcpu->run->memory.size = PAGE_SIZE; + fault->pfn = -1; + *r = -1; + return true; +} + +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + bool *is_private_pfn, int *r) { struct kvm_memory_slot *slot = fault->slot; bool async; @@ -3924,6 +3979,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, } } + if (kvm_slot_is_private(slot) && + kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r)) + return *r == RET_PF_FIXED ? false : true; + async = false; fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, fault->write, &fault->map_writable, @@ -3984,6 +4043,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu); unsigned long mmu_seq; + bool is_private_pfn = false; int r; fault->gfn = fault->addr >> PAGE_SHIFT; @@ -4003,7 +4063,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (kvm_faultin_pfn(vcpu, fault, &r)) + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r)) return r; if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r)) @@ -4016,7 +4076,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault else write_lock(&vcpu->kvm->mmu_lock); - if (is_page_fault_stale(vcpu, fault, mmu_seq)) + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; r = make_mmu_pages_available(vcpu); @@ -4033,7 +4093,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault read_unlock(&vcpu->kvm->mmu_lock); else write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + + if (is_private_pfn) + kvm_memfile_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); + return r; } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 252c77805eb9..6a5736699c0a 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -825,6 +825,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault int r; unsigned long mmu_seq; bool is_self_change_mapping; + bool is_private_pfn = false; + pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code); WARN_ON_ONCE(fault->is_tdp); @@ -873,7 +875,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (kvm_faultin_pfn(vcpu, fault, &r)) + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r)) return r; if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r)) @@ -901,7 +903,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = RET_PF_RETRY; write_lock(&vcpu->kvm->mmu_lock); - if (is_page_fault_stale(vcpu, fault, mmu_seq)) + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; r = make_mmu_pages_available(vcpu); @@ -911,7 +913,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + if (is_private_pfn) + kvm_memfile_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); return r; } From patchwork Thu Mar 10 14:09:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776434 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 2BF2BC433F5 for ; Thu, 10 Mar 2022 14:18:59 +0000 (UTC) Received: from localhost ([::1]:47786 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJd4-0004Kp-2s for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:18:58 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35736) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVN-0007YC-5L for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:01 -0500 Received: from mga07.intel.com ([134.134.136.100]:62907) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVL-00050k-BS for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:00 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921459; x=1678457459; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=dJciXpEHLFj6Z+EAfDFvGxoZXvWRu/tGM+jdlgCTKlE=; b=L/jYOUcSHvFQ0TZPweJB+lZSEl5m0ZVVEt62Zi/l9AGSqKjkqG0JSjdn Gzl3znNL9SDvWB89qXW58TantQ952rLqF+qrJENxbxsGyAiNenaIYB55B 1aEkDpWHeXOAilZxE6m/I/fV2+7OyNrbeKPCYfRV7ySe1hn2UmT98bx2R mtKgJsXN4AMUlktpsm5AvoMgOP7qtYV7lm5a9SAYzR/dIVEkEJdGhOy2O 8wf/k+fcZ2saXzQEamZpLiTRj2Sz6a0sNVxtR8HAqhm001v3RJli1P3Nu mR/lB8rtqPGExhMmoWw6CPR4ucIfaMOOi9QCA0mS3gm6yk6e91YGhBuUL w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="318479498" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="318479498" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:10:57 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655136" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:49 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 10/13] KVM: Register private memslot to memory backing store Date: Thu, 10 Mar 2022 22:09:08 +0800 Message-Id: <20220310140911.50924-11-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=134.134.136.100; envelope-from=chao.p.peng@linux.intel.com; helo=mga07.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Add 'notifier' to memslot to make it a memfile_notifier node and then register it to memory backing store via memfile_register_notifier() when memslot gets created. When memslot is deleted, do the reverse with memfile_unregister_notifier(). Note each KVM memslot can be registered to different memory backing stores (or the same backing store but at different offset) independently. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 75 ++++++++++++++++++++++++++++++++++++---- 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 6e1d770d6bf8..9b175aeca63f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -567,6 +567,7 @@ struct kvm_memory_slot { struct file *private_file; loff_t private_offset; struct memfile_pfn_ops *pfn_ops; + struct memfile_notifier notifier; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index d11a2628b548..67349421eae3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -840,6 +840,37 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline int kvm_memfile_register(struct kvm_memory_slot *slot) +{ + return memfile_register_notifier(file_inode(slot->private_file), + &slot->notifier, + &slot->pfn_ops); +} + +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot) +{ + if (slot->private_file) { + memfile_unregister_notifier(file_inode(slot->private_file), + &slot->notifier); + fput(slot->private_file); + slot->private_file = NULL; + } +} + +#else /* !CONFIG_MEMFILE_NOTIFIER */ + +static inline int kvm_memfile_register(struct kvm_memory_slot *slot) +{ + return -EOPNOTSUPP; +} + +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot) +{ +} + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state, @@ -884,6 +915,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot) /* This does not remove the slot from struct kvm_memslots data structures */ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { + if (slot->flags & KVM_MEM_PRIVATE) + kvm_memfile_unregister(slot); + kvm_destroy_dirty_bitmap(slot); kvm_arch_free_memslot(kvm, slot); @@ -1738,6 +1772,12 @@ static int kvm_set_memslot(struct kvm *kvm, kvm_invalidate_memslot(kvm, old, invalid_slot); } + if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) { + r = kvm_memfile_register(new); + if (r) + return r; + } + r = kvm_prepare_memory_region(kvm, old, new, change); if (r) { /* @@ -1752,6 +1792,10 @@ static int kvm_set_memslot(struct kvm *kvm, } else { mutex_unlock(&kvm->slots_arch_lock); } + + if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) + kvm_memfile_unregister(new); + return r; } @@ -1817,6 +1861,7 @@ int __kvm_set_memory_region(struct kvm *kvm, enum kvm_mr_change change; unsigned long npages; gfn_t base_gfn; + struct file *file = NULL; int as_id, id; int r; @@ -1890,14 +1935,24 @@ int __kvm_set_memory_region(struct kvm *kvm, return 0; } + if (mem->flags & KVM_MEM_PRIVATE) { + file = fdget(region_ext->private_fd).file; + if (!file) + return -EINVAL; + } + if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) && - kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) - return -EEXIST; + kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) { + r = -EEXIST; + goto out; + } /* Allocate a slot that will persist in the memslot. */ new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT); - if (!new) - return -ENOMEM; + if (!new) { + r = -ENOMEM; + goto out; + } new->as_id = as_id; new->id = id; @@ -1905,10 +1960,18 @@ int __kvm_set_memory_region(struct kvm *kvm, new->npages = npages; new->flags = mem->flags; new->userspace_addr = mem->userspace_addr; + new->private_file = file; + new->private_offset = mem->flags & KVM_MEM_PRIVATE ? + region_ext->private_offset : 0; r = kvm_set_memslot(kvm, old, new, change); - if (r) - kfree(new); + if (!r) + return r; + + kfree(new); +out: + if (file) + fput(file); return r; } EXPORT_SYMBOL_GPL(__kvm_set_memory_region); From patchwork Thu Mar 10 14:09:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776442 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id D309EC433F5 for ; Thu, 10 Mar 2022 14:28:06 +0000 (UTC) Received: from localhost ([::1]:37032 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJlt-0001m3-U1 for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:28:05 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35830) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVj-0008TB-9z for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:24 -0500 Received: from mga17.intel.com ([192.55.52.151]:7593) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVh-000542-3v for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:23 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921481; x=1678457481; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=UDNf/tndP7AKz2Uk9JPtbsxmn9QkdDXr+UJSlA/azbE=; b=G4EwD3KGIAyfHruAFlFkx9e9+pW8hr/MpqfZq8JnQQIddVRAOQcvkPHS I25z18qH9j2vWFfI5QB18Qe1ux+y7Tk1WewAViTANpYHc8PDVh/a5ZIpf HOQ7x/ByB6sLyX/Trtf+wbFvtjRyJATDC3bcfb5yHcBRcIv4x0isl/f4y c+z4v2gPbjp3HQkgQTAjdopGLTsYyT0IxnxRwBAWtyFPL0zbPFP61wKqH YiMKlFH45ZKqEMjX5UEFqkm7EtLlhLWBXq98eHV341VAEPVgn0a0NxQyt uD3tgTJUUzVTz57xOIOdpDogenPo3/csCqm3douWT7nXK6DsKNJos+MBw w==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="235862653" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="235862653" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655204" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:10:56 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 11/13] KVM: Zap existing KVM mappings when pages changed in the private fd Date: Thu, 10 Mar 2022 22:09:09 +0800 Message-Id: <20220310140911.50924-12-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=192.55.52.151; envelope-from=chao.p.peng@linux.intel.com; helo=mga17.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" KVM gets notified when memory pages changed in the memory backing store. When userspace allocates the memory with fallocate() or frees memory with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into KVM fallocate/invalidate callbacks respectively. To ensure KVM never maps both the private and shared variants of a GPA into the guest, in the fallocate callback, we should zap the existing shared mapping and in the invalidate callback we should zap the existing private mapping. In the callbacks, KVM firstly converts the offset range into the gfn_range and then calls existing kvm_unmap_gfn_range() which will zap the shared or private mapping. Both callbacks pass in a memslot reference but we need 'kvm' so add a reference in memslot structure. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 3 ++- virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 9b175aeca63f..186b9b981a65 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -236,7 +236,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER) struct kvm_gfn_range { struct kvm_memory_slot *slot; gfn_t start; @@ -568,6 +568,7 @@ struct kvm_memory_slot { loff_t private_offset; struct memfile_pfn_ops *pfn_ops; struct memfile_notifier notifier; + struct kvm *kvm; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 67349421eae3..52319f49d58a 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -841,8 +841,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ #ifdef CONFIG_MEMFILE_NOTIFIER +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end) +{ + int idx; + struct kvm_memory_slot *slot = container_of(notifier, + struct kvm_memory_slot, + notifier); + struct kvm_gfn_range gfn_range = { + .slot = slot, + .start = start - (slot->private_offset >> PAGE_SHIFT), + .end = end - (slot->private_offset >> PAGE_SHIFT), + .may_block = true, + }; + struct kvm *kvm = slot->kvm; + + gfn_range.start = max(gfn_range.start, slot->base_gfn); + gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages); + + if (gfn_range.start >= gfn_range.end) + return; + + idx = srcu_read_lock(&kvm->srcu); + KVM_MMU_LOCK(kvm); + kvm_unmap_gfn_range(kvm, &gfn_range); + kvm_flush_remote_tlbs(kvm); + KVM_MMU_UNLOCK(kvm); + srcu_read_unlock(&kvm->srcu, idx); +} + +static struct memfile_notifier_ops kvm_memfile_notifier_ops = { + .invalidate = kvm_memfile_notifier_handler, + .fallocate = kvm_memfile_notifier_handler, +}; + static inline int kvm_memfile_register(struct kvm_memory_slot *slot) { + slot->notifier.ops = &kvm_memfile_notifier_ops; return memfile_register_notifier(file_inode(slot->private_file), &slot->notifier, &slot->pfn_ops); @@ -1963,6 +1998,7 @@ int __kvm_set_memory_region(struct kvm *kvm, new->private_file = file; new->private_offset = mem->flags & KVM_MEM_PRIVATE ? region_ext->private_offset : 0; + new->kvm = kvm; r = kvm_set_memslot(kvm, old, new, change); if (!r) From patchwork Thu Mar 10 14:09:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776433 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 4D674C433F5 for ; Thu, 10 Mar 2022 14:18:16 +0000 (UTC) Received: from localhost ([::1]:45440 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJcN-0002eP-4N for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:18:15 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35798) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVb-0008LR-Kg for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:15 -0500 Received: from mga09.intel.com ([134.134.136.24]:50692) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVZ-00053I-UU for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:15 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921474; x=1678457474; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=dlixjqkIQMYPAGk8N2BV+CT0PbmeRswbnhn9umPYrd0=; b=LDBRihQ5DCe+Uc84P228O3MZgrqxajhuAIuBfxQ367M07OiTI7HYgq69 4dliKRyp4mlicivHRpRAOBQv4bnMKeUnTZAPM76GgunkMEJWaBkfAPBOW KeO0HKyeTqMeqbY2Rph7oDduhWJ2+53HMklW+GDxTvY2ewWfK/5yPK1X2 JkxsEwlAzLxXURnSMXtSODaB8eADj2ybmr24mOwkRjxSj/13oBD3TzmS8 pYU545BpjdOZlS+v848WU/4aj/rTEHCOk1JnCY7pk5ED3+8O0ImweV0DO QfoUf4TrdLZi6O7FI/DNfsyXp7zz+JdpJ4ZMwIY6Jo6PC3VYTeZVO+EfT g==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="254994409" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="254994409" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:12 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655235" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:11:04 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 12/13] KVM: Expose KVM_MEM_PRIVATE Date: Thu, 10 Mar 2022 22:09:10 +0800 Message-Id: <20220310140911.50924-13-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=134.134.136.24; envelope-from=chao.p.peng@linux.intel.com; helo=mga09.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" KVM_MEM_PRIVATE is not exposed by default but architecture code can turn on it by implementing kvm_arch_private_memory_supported(). Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 24 +++++++++++++++++++----- 2 files changed, 20 insertions(+), 5 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 186b9b981a65..0150e952a131 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1432,6 +1432,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_private_memory_supported(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 52319f49d58a..df5311755a40 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1485,10 +1485,19 @@ static void kvm_replace_memslot(struct kvm *kvm, } } -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) +bool __weak kvm_arch_private_memory_supported(struct kvm *kvm) +{ + return false; +} + +static int check_memory_region_flags(struct kvm *kvm, + const struct kvm_userspace_memory_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; + if (kvm_arch_private_memory_supported(kvm)) + valid_flags |= KVM_MEM_PRIVATE; + #ifdef __KVM_HAVE_READONLY_MEM valid_flags |= KVM_MEM_READONLY; #endif @@ -1900,7 +1909,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int as_id, id; int r; - r = check_memory_region_flags(mem); + r = check_memory_region_flags(kvm, mem); if (r) return r; @@ -1913,10 +1922,12 @@ int __kvm_set_memory_region(struct kvm *kvm, return -EINVAL; if (mem->guest_phys_addr & (PAGE_SIZE - 1)) return -EINVAL; - /* We can read the guest memory with __xxx_user() later on. */ if ((mem->userspace_addr & (PAGE_SIZE - 1)) || - (mem->userspace_addr != untagged_addr(mem->userspace_addr)) || - !access_ok((void __user *)(unsigned long)mem->userspace_addr, + (mem->userspace_addr != untagged_addr(mem->userspace_addr))) + return -EINVAL; + /* We can read the guest memory with __xxx_user() later on. */ + if (!(mem->flags & KVM_MEM_PRIVATE) && + !access_ok((void __user *)(unsigned long)mem->userspace_addr, mem->memory_size)) return -EINVAL; if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) @@ -1957,6 +1968,9 @@ int __kvm_set_memory_region(struct kvm *kvm, if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) return -EINVAL; } else { /* Modify an existing slot. */ + /* Private memslots are immutable, they can only be deleted. */ + if (mem->flags & KVM_MEM_PRIVATE) + return -EINVAL; if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) From patchwork Thu Mar 10 14:09:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12776436 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BB6EAC433F5 for ; Thu, 10 Mar 2022 14:20:52 +0000 (UTC) Received: from localhost ([::1]:53764 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nSJet-0008OF-TM for qemu-devel@archiver.kernel.org; Thu, 10 Mar 2022 09:20:51 -0500 Received: from eggs.gnu.org ([209.51.188.92]:35848) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVm-0008UM-7H for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:28 -0500 Received: from mga03.intel.com ([134.134.136.65]:35210) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nSJVk-00054G-Mg for qemu-devel@nongnu.org; Thu, 10 Mar 2022 09:11:25 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1646921484; x=1678457484; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=XcUkjLL+YpWw11P834waxgWm/NgeT+vT/ACdu5U5CAo=; b=H+4JDP+tnIqN70FDTgMkJe1GCfTp8sD8YKNdwhrrDe89D4omRGI7PB1l Hu4FWXufZPp1KDm0ic/RRdyShCwmDvzmekjCNvxu4mWnM1CxUVtm26B7Y a+jqea6Y6xnZ+huehTsR2hg5cKDBOnS9uKvtrNekqGQ8+24+kQgGlAReq TLuoN1tzlsXPNkW9tkOGMkxy3m4G954EPYV7jImBsNwihl1tKM5Nkzmg6 plCpxqHv5hUgc2+hJSr6gRpyPEZ69LH/9DQ3M0cQcQQaybnyWQz9ZxIja 9X7TT3yA9yECJoQdyhdDsmV8zo8phqfVsE5dAnxntuiUAV6T2tF7GWCyD g==; X-IronPort-AV: E=McAfee;i="6200,9189,10281"; a="255203402" X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="255203402" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Mar 2022 06:11:21 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.90,170,1643702400"; d="scan'208";a="554655270" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 10 Mar 2022 06:11:12 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, qemu-devel@nongnu.org Subject: [PATCH v5 13/13] memfd_create.2: Describe MFD_INACCESSIBLE flag Date: Thu, 10 Mar 2022 22:09:11 +0800 Message-Id: <20220310140911.50924-14-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220310140911.50924-1-chao.p.peng@linux.intel.com> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> Received-SPF: none client-ip=134.134.136.65; envelope-from=chao.p.peng@linux.intel.com; helo=mga03.intel.com X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.082, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Wanpeng Li , jun.nakajima@intel.com, david@redhat.com, "J . Bruce Fields" , dave.hansen@intel.com, "H . Peter Anvin" , Chao Peng , ak@linux.intel.com, Jonathan Corbet , Joerg Roedel , x86@kernel.org, Hugh Dickins , Steven Price , Ingo Molnar , "Maciej S . Szmigiero" , Borislav Petkov , luto@kernel.org, Thomas Gleixner , Vitaly Kuznetsov , Vlastimil Babka , Jim Mattson , Sean Christopherson , Jeff Layton , Yu Zhang , "Kirill A . Shutemov" , Paolo Bonzini , Andrew Morton , Vishal Annapurve , Mike Rapoport Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Signed-off-by: Chao Peng --- man2/memfd_create.2 | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/man2/memfd_create.2 b/man2/memfd_create.2 index 89e9c4136..2698222ae 100644 --- a/man2/memfd_create.2 +++ b/man2/memfd_create.2 @@ -101,6 +101,19 @@ meaning that no other seals can be set on the file. .\" FIXME Why is the MFD_ALLOW_SEALING behavior not simply the default? .\" Is it worth adding some text explaining this? .TP +.BR MFD_INACCESSIBLE +Disallow userspace access through ordinary MMU accesses via +.BR read (2), +.BR write (2) +and +.BR mmap (2). +The file size cannot be changed once initialized. +This flag cannot coexist with +.B MFD_ALLOW_SEALING +and when this flag is set, the initial set of seals will be +.B F_SEAL_SEAL, +meaning that no other seals can be set on the file. +.TP .BR MFD_HUGETLB " (since Linux 4.14)" .\" commit 749df87bd7bee5a79cef073f5d032ddb2b211de8 The anonymous file will be created in the hugetlbfs filesystem using