From patchwork Tue Jan 18 13:21:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716368 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B238C433FE for ; Tue, 18 Jan 2022 13:22:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242677AbiARNWI (ORCPT ); Tue, 18 Jan 2022 08:22:08 -0500 Received: from mga06.intel.com ([134.134.136.31]:48436 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242590AbiARNWG (ORCPT ); Tue, 18 Jan 2022 08:22:06 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512126; x=1674048126; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=4KNvuS2T6y3ZLkboDDDFll1RTzkK0ZIAxH/eOgdZ+S8=; b=OxEzfROHTJa4yDj3Q3dVTpqs1laCTYV7KkFT+BRyribEsm8RbWKRbVCt 5vjQ+1RNGsE+x+zOTcgiBRiFbPbrr68ef6cKwaJ5P7LaWDu/Q/BbS3ZfY gj34av002UecWRBRwPN/PqWgS5Fa8dYBlnDbSF1LyAo4KL1l6UNfbnAF0 ngzVrBi2RRxzpop2H6HYdv0ymunGvviI0kdZ9l9TscxY0fMZYZzIoHptL ZYKnODWI+2kwVCYFhHRixqGeTIdzALIX/Ra87KD89d00GLI1ewAESg98E +7TnsKJqxeCkBxYtBlsc6FJCNGX//p8UKX8I3ylLXgEQDG23Fee8m3OW4 Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="305545718" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="305545718" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:22:06 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791632" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:21:59 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 01/12] mm/shmem: Introduce F_SEAL_INACCESSIBLE Date: Tue, 18 Jan 2022 21:21:10 +0800 Message-Id: <20220118132121.31388-2-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: "Kirill A. Shutemov" Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of the file is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this seal set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. At this time only shmem implements this seal. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/uapi/linux/fcntl.h | 1 + mm/shmem.c | 40 ++++++++++++++++++++++++++++++++++++-- 2 files changed, 39 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..09ef34754dfa 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_INACCESSIBLE 0x0020 /* prevent ordinary MMU access (e.g. read/write/mmap) to file content */ /* (1U << 31) is reserved for signed error codes */ /* diff --git a/mm/shmem.c b/mm/shmem.c index 18f93c2d68f1..72185630e7c4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1098,6 +1098,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; + if (info->seals & F_SEAL_INACCESSIBLE) { + if(i_size_read(inode)) + return -EPERM; + if (newsize & ~PAGE_MASK) + return -EINVAL; + } + if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); @@ -1364,6 +1371,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty; + if (info->seals & F_SEAL_INACCESSIBLE) + goto redirty; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -2262,6 +2271,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; + if (info->seals & F_SEAL_INACCESSIBLE) + return -EPERM; + /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED; @@ -2459,12 +2471,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping, pgoff_t index = pos >> PAGE_SHIFT; /* i_rwsem is held by caller */ - if (unlikely(info->seals & (F_SEAL_GROW | - F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) { + if (unlikely(info->seals & (F_SEAL_GROW | F_SEAL_WRITE | + F_SEAL_FUTURE_WRITE | + F_SEAL_INACCESSIBLE))) { if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) return -EPERM; if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; + if (info->seals & F_SEAL_INACCESSIBLE) + return -EPERM; } return shmem_getpage(inode, index, pagep, SGP_WRITE); @@ -2538,6 +2553,21 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) end_index = i_size >> PAGE_SHIFT; if (index > end_index) break; + + /* + * inode_lock protects setting up seals as well as write to + * i_size. Setting F_SEAL_INACCESSIBLE only allowed with + * i_size == 0. + * + * Check F_SEAL_INACCESSIBLE after i_size. It effectively + * serialize read vs. setting F_SEAL_INACCESSIBLE without + * taking inode_lock in read path. + */ + if (SHMEM_I(inode)->seals & F_SEAL_INACCESSIBLE) { + error = -EPERM; + break; + } + if (index == end_index) { nr = i_size & ~PAGE_MASK; if (nr <= offset) @@ -2663,6 +2693,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->seals & F_SEAL_INACCESSIBLE) && + (offset & ~PAGE_MASK || len & ~PAGE_MASK)) { + error = -EINVAL; + goto out; + } + shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; From patchwork Tue Jan 18 13:21:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716369 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 630BAC433EF for ; Tue, 18 Jan 2022 13:22:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242818AbiARNWS (ORCPT ); Tue, 18 Jan 2022 08:22:18 -0500 Received: from mga04.intel.com ([192.55.52.120]:18774 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242808AbiARNWO (ORCPT ); Tue, 18 Jan 2022 08:22:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512134; x=1674048134; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=tGgrzrdRsYU+NqPfrWvM/m/GPitol5J3LJPXDgJxTB8=; b=I1erD7Q9ENroY+8/OPNtN1yK2NJ56XLm27mOKZ3BhwV5swFcPV6+GZoi bdmTY8iJdQD1EeUSfj7E8qj0WX22Zur1hdCyDHFshZ6f3HSc2M1x4lWxQ KC3W8HobzeaCgQNDcbwJwvNPF3FJg72a7PdDmto1NLjbDRHR93K/ppgMw BsfpP/wgbgYszypK024LjvWcVJePrJPXqPMuDuM3Rv/NVIcNBFDtFT2Aj T5jZ48Yns+AizeQ/gtPxvgdkEgTBzm4ijlB0sOwocDfR3Bc5Rm9xUjFO3 6NMzU2l/WwTiz5tFHE/wigYmyzASOTqxA9tOSlht+Fq+EYg4i4+WTxaC5 A==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="243636251" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="243636251" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:22:13 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791674" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:22:06 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 02/12] mm/memfd: Introduce MFD_INACCESSIBLE flag Date: Tue, 18 Jan 2022 21:21:11 +0800 Message-Id: <20220118132121.31388-3-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace. It does this by force setting F_SEAL_INACCESSIBLE seal when the file is created. It also set F_SEAL_SEAL to prevent future sealing, which means, it can not coexist with MFD_ALLOW_SEALING. The pages backed by such memfd will be used as guest private memory in confidential computing environments such as Intel TDX/AMD SEV. Since page migration/swapping is not yet supported for such usages so these pages are currently marked as UNMOVABLE and UNEVICTABLE which makes them behave like long-term pinned pages. Signed-off-by: Chao Peng --- include/uapi/linux/memfd.h | 1 + mm/memfd.c | 20 +++++++++++++++++++- 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..48750474b904 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_INACCESSIBLE 0x0008U /* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/memfd.c b/mm/memfd.c index 9f80f162791a..26998d96dc11 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -245,16 +245,19 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_INACCESSIBLE) SYSCALL_DEFINE2(memfd_create, const char __user *, uname, unsigned int, flags) { + struct address_space *mapping; unsigned int *file_seals; struct file *file; int fd, error; char *name; + gfp_t gfp; long len; if (!(flags & MFD_HUGETLB)) { @@ -267,6 +270,10 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; } + /* Disallow sealing when MFD_INACCESSIBLE is set. */ + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING) + return -EINVAL; + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -315,6 +322,17 @@ SYSCALL_DEFINE2(memfd_create, *file_seals &= ~F_SEAL_SEAL; } + if (flags & MFD_INACCESSIBLE) { + mapping = file_inode(file)->i_mapping; + gfp = mapping_gfp_mask(mapping); + gfp &= ~__GFP_MOVABLE; + mapping_set_gfp_mask(mapping, gfp); + mapping_set_unevictable(mapping); + + file_seals = memfd_file_seals_ptr(file); + *file_seals &= F_SEAL_SEAL | F_SEAL_INACCESSIBLE; + } + fd_install(fd, file); kfree(name); return fd; From patchwork Tue Jan 18 13:21:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716370 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14428C433EF for ; Tue, 18 Jan 2022 13:22:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242886AbiARNWi (ORCPT ); Tue, 18 Jan 2022 08:22:38 -0500 Received: from mga09.intel.com ([134.134.136.24]:39173 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242770AbiARNWZ (ORCPT ); Tue, 18 Jan 2022 08:22:25 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512145; x=1674048145; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=2dspJZW+Fb695a/O+SFc/x1OjrhtU2wFDKh4IAY6Z+0=; b=Ur9EbbS9kRJk3v/TllxHENI9vfURgz3Avs/eDlnVR0r2RXOKEbYxWMmu 48n/YIFEEgEda8WDrF/S49UrXP/oHYHvcbo3mTRG/4IpDzrvRk/ANH0Tk IRpxw8nrMmyKGflB1wYvpCqs7rgILSPkdr3k8wt7rWrNpoYYnbhDHI1pK pN19u58VVxPcwJkwJQL6Y7Bdd/sHo/H8otp5fzSQxGj29Tzewoom3PNxL lpJ7l+IaNVQPESv4mSD2g57xZrz+/sZ8qzVHOur6ErScKDEyZCWysUF9E LlO8YWnW9PEnOtkxyeTqyGC8IE55Ivqa8Idbb45TEBZni8dFMrCSwboOt g==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="244609329" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="244609329" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:22:20 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791700" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:22:13 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 03/12] mm: Introduce memfile_notifier Date: Tue, 18 Jan 2022 21:21:12 +0800 Message-Id: <20220118132121.31388-4-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become allocated/invalidated. It will be used for KVM to use a file descriptor as the guest memory backing store and KVM will use this memfile_notifier interface to interact with memory file subsystems. In the future there might be other consumers (e.g. VFIO with encrypted device memory). It consists two sets of callbacks: - memfile_notifier_ops: callbacks for memory backing store to notify KVM when memory gets allocated/invalidated. - memfile_pfn_ops: callbacks for KVM to call into memory backing store to request memory pages for guest private memory. Userspace is in charge of guest memory lifecycle: it first allocates pages in memory backing store and then passes the fd to KVM and lets KVM register each memory slot to memory backing store via memfile_register_notifier. The supported memory backing store should maintain a memfile_notifier list and provide routine for memfile_notifier to get the list head address and memfile_pfn_ops callbacks for memfile_register_notifier. It also should call memfile_notifier_fallocate/memfile_notifier_invalidate when the bookmarked memory gets allocated/invalidated. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/memfile_notifier.h | 53 +++++++++++++++++++ mm/Kconfig | 4 ++ mm/Makefile | 1 + mm/memfile_notifier.c | 89 ++++++++++++++++++++++++++++++++ 4 files changed, 147 insertions(+) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h new file mode 100644 index 000000000000..a03bebdd1322 --- /dev/null +++ b/include/linux/memfile_notifier.h @@ -0,0 +1,53 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMFILE_NOTIFIER_H +#define _LINUX_MEMFILE_NOTIFIER_H + +#include +#include +#include +#include + +struct memfile_notifier; + +struct memfile_notifier_ops { + void (*invalidate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); + void (*fallocate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct memfile_pfn_ops { + long (*get_lock_pfn)(struct inode *inode, pgoff_t offset, int *order); + void (*put_unlock_pfn)(unsigned long pfn); +}; + +struct memfile_notifier { + struct list_head list; + struct memfile_notifier_ops *ops; +}; + +struct memfile_notifier_list { + struct list_head head; + spinlock_t lock; +}; + +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline void memfile_notifier_list_init(struct memfile_notifier_list *list) +{ + INIT_LIST_HEAD(&list->head); + spin_lock_init(&list->lock); +} + +extern void memfile_notifier_invalidate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end); +extern void memfile_notifier_fallocate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end); +extern int memfile_register_notifier(struct inode *inode, + struct memfile_notifier *notifier, + struct memfile_pfn_ops **pfn_ops); +extern void memfile_unregister_notifier(struct inode *inode, + struct memfile_notifier *notifier); + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + +#endif /* _LINUX_MEMFILE_NOTIFIER_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 28edafc820ad..fa31eda3c895 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -900,6 +900,10 @@ config IO_MAPPING config SECRETMEM def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED +config MEMFILE_NOTIFIER + bool + select SRCU + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index d6c0042e3aa0..80588f7c3bc2 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -130,3 +130,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o +obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c new file mode 100644 index 000000000000..8171d4601a04 --- /dev/null +++ b/mm/memfile_notifier.c @@ -0,0 +1,89 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * linux/mm/memfile_notifier.c + * + * Copyright (C) 2022 Intel Corporation. + * Chao Peng + */ + +#include +#include + +DEFINE_STATIC_SRCU(srcu); + +void memfile_notifier_invalidate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&srcu); + list_for_each_entry_srcu(notifier, &list->head, list, + srcu_read_lock_held(&srcu)) { + if (notifier->ops && notifier->ops->invalidate) + notifier->ops->invalidate(notifier, start, end); + } + srcu_read_unlock(&srcu, id); +} + +void memfile_notifier_fallocate(struct memfile_notifier_list *list, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&srcu); + list_for_each_entry_srcu(notifier, &list->head, list, + srcu_read_lock_held(&srcu)) { + if (notifier->ops && notifier->ops->fallocate) + notifier->ops->fallocate(notifier, start, end); + } + srcu_read_unlock(&srcu, id); +} + +static int memfile_get_notifier_info(struct inode *inode, + struct memfile_notifier_list **list, + struct memfile_pfn_ops **ops) +{ + return -EOPNOTSUPP; +} + +int memfile_register_notifier(struct inode *inode, + struct memfile_notifier *notifier, + struct memfile_pfn_ops **pfn_ops) +{ + struct memfile_notifier_list *list; + int ret; + + if (!inode || !notifier | !pfn_ops) + return -EINVAL; + + ret = memfile_get_notifier_info(inode, &list, pfn_ops); + if (ret) + return ret; + + spin_lock(&list->lock); + list_add_rcu(¬ifier->list, &list->head); + spin_unlock(&list->lock); + + return 0; +} +EXPORT_SYMBOL_GPL(memfile_register_notifier); + +void memfile_unregister_notifier(struct inode *inode, + struct memfile_notifier *notifier) +{ + struct memfile_notifier_list *list; + + if (!inode || !notifier) + return; + + BUG_ON(memfile_get_notifier_info(inode, &list, NULL)); + + spin_lock(&list->lock); + list_del_rcu(¬ifier->list); + spin_unlock(&list->lock); + + synchronize_srcu(&srcu); +} +EXPORT_SYMBOL_GPL(memfile_unregister_notifier); From patchwork Tue Jan 18 13:21:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716374 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1E6BAC433EF for ; Tue, 18 Jan 2022 13:23:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243356AbiARNXD (ORCPT ); Tue, 18 Jan 2022 08:23:03 -0500 Received: from mga07.intel.com ([134.134.136.100]:61766 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243250AbiARNWw (ORCPT ); Tue, 18 Jan 2022 08:22:52 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512172; x=1674048172; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=HZ8RoYPmQ6m9k6Lgoc+xE+GTQI3+IhGwWjUNoTgoWzo=; b=NmBFGKmpotwKOwBv+D6JKs5/omJF7hoy3fMSgu7H/EtBN0H7ndm2ehsp ozGAojo9sTYfJQHDxVE/rMDtpDciPEKaM8+MjChFLGEKrZqLDDO1kN4GV gEUzV7uJ/6wqDMsDlZhHHctraj2RWwC7b0e/JlbD0ihpOlmivJAXrz1Nm jvnPuv77pf7oFlD7UDQqj2VEr61qPKvj3I4/+98xmJrOLfSg9clLcPiVk EZspeLZ3jLTz50LqEEvJySdi1M/dTDWK6td+vs9/yN5IVTNLNRRWquy2p k0yLbJUHfvQpUy6qPC1U7JvTiQuIgcPfBP94wnd7dsLFMBcdc+rPNs6To A==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="308149127" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="308149127" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:22:27 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791758" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:22:20 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 04/12] mm/shmem: Support memfile_notifier Date: Tue, 18 Jan 2022 21:21:13 +0800 Message-Id: <20220118132121.31388-5-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org It maintains a memfile_notifier list in shmem_inode_info structure and implements memfile_pfn_ops callbacks defined by memfile_notifier. It then exposes them to memfile_notifier via shmem_get_memfile_notifier_info. We use SGP_NOALLOC in shmem_get_lock_pfn since the pages should be allocated by userspace for private memory. If there is no pages allocated at the offset then error should be returned so KVM knows that the memory is not private memory. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/shmem_fs.h | 4 ++ mm/memfile_notifier.c | 12 +++++- mm/shmem.c | 81 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 96 insertions(+), 1 deletion(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 166158b6e917..461633587eaf 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include #include #include +#include /* inode in-kernel data */ @@ -24,6 +25,9 @@ struct shmem_inode_info { struct shared_policy policy; /* NUMA memory alloc policy */ struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct memfile_notifier_list memfile_notifiers; +#endif struct inode vfs_inode; }; diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c index 8171d4601a04..b4699cbf629e 100644 --- a/mm/memfile_notifier.c +++ b/mm/memfile_notifier.c @@ -41,11 +41,21 @@ void memfile_notifier_fallocate(struct memfile_notifier_list *list, srcu_read_unlock(&srcu, id); } +#ifdef CONFIG_SHMEM +extern int shmem_get_memfile_notifier_info(struct inode *inode, + struct memfile_notifier_list **list, + struct memfile_pfn_ops **ops); +#endif + static int memfile_get_notifier_info(struct inode *inode, struct memfile_notifier_list **list, struct memfile_pfn_ops **ops) { - return -EOPNOTSUPP; + int ret = -EOPNOTSUPP; +#ifdef CONFIG_SHMEM + ret = shmem_get_memfile_notifier_info(inode, list, ops); +#endif + return ret; } int memfile_register_notifier(struct inode *inode, diff --git a/mm/shmem.c b/mm/shmem.c index 72185630e7c4..00af869d26ce 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -906,6 +906,28 @@ static bool shmem_punch_compound(struct page *page, pgoff_t start, pgoff_t end) return split_huge_page(page) >= 0; } +static void notify_fallocate(struct inode *inode, pgoff_t start, pgoff_t end) +{ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct shmem_inode_info *info = SHMEM_I(inode); + + memfile_notifier_fallocate(&info->memfile_notifiers, start, end); +#endif +} + +static void notify_invalidate_page(struct inode *inode, struct page *page, + pgoff_t start, pgoff_t end) +{ +#ifdef CONFIG_MEMFILE_NOTIFIER + struct shmem_inode_info *info = SHMEM_I(inode); + + start = max(start, page->index); + end = min(end, page->index + thp_nr_pages(page)); + + memfile_notifier_invalidate(&info->memfile_notifiers, start, end); +#endif +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -949,6 +971,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += thp_nr_pages(page) - 1; + notify_invalidate_page(inode, page, start, end); + if (!unfalloc || !PageUptodate(page)) truncate_inode_page(mapping, page); unlock_page(page); @@ -1025,6 +1049,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; } + + notify_invalidate_page(inode, page, start, end); + VM_BUG_ON_PAGE(PageWriteback(page), page); if (shmem_punch_compound(page, start, end)) truncate_inode_page(mapping, page); @@ -2313,6 +2340,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode info->flags = flags & VM_NORESERVE; INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_notifier_list_init(&info->memfile_notifiers); +#endif simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping); @@ -2818,6 +2848,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, if (!(mode & FALLOC_FL_KEEP_SIZE) && offset + len > inode->i_size) i_size_write(inode, offset + len); inode->i_ctime = current_time(inode); + notify_fallocate(inode, start, end); undone: spin_lock(&inode->i_lock); inode->i_private = NULL; @@ -4002,6 +4033,56 @@ struct kobj_attribute shmem_enabled_attr = __ATTR(shmem_enabled, 0644, shmem_enabled_show, shmem_enabled_store); #endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSFS */ +#ifdef CONFIG_MEMFILE_NOTIFIER +static long shmem_get_lock_pfn(struct inode *inode, pgoff_t offset, int *order) +{ + struct page *page; + int ret; + + ret = shmem_getpage(inode, offset, &page, SGP_NOALLOC); + if (ret) + return ret; + + *order = thp_order(compound_head(page)); + + return page_to_pfn(page); +} + +static void shmem_put_unlock_pfn(unsigned long pfn) +{ + struct page *page = pfn_to_page(pfn); + + VM_BUG_ON_PAGE(!PageLocked(page), page); + + set_page_dirty(page); + unlock_page(page); + put_page(page); +} + +static struct memfile_pfn_ops shmem_pfn_ops = { + .get_lock_pfn = shmem_get_lock_pfn, + .put_unlock_pfn = shmem_put_unlock_pfn, +}; + +int shmem_get_memfile_notifier_info(struct inode *inode, + struct memfile_notifier_list **list, + struct memfile_pfn_ops **ops) +{ + struct shmem_inode_info *info; + + if (!shmem_mapping(inode->i_mapping)) + return -EINVAL; + + info = SHMEM_I(inode); + *list = &info->memfile_notifiers; + if (ops) + *ops = &shmem_pfn_ops; + + return 0; +} + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #else /* !CONFIG_SHMEM */ /* From patchwork Tue Jan 18 13:21:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716371 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A536DC433FE for ; Tue, 18 Jan 2022 13:22:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242923AbiARNWp (ORCPT ); Tue, 18 Jan 2022 08:22:45 -0500 Received: from mga12.intel.com ([192.55.52.136]:1542 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243067AbiARNWf (ORCPT ); Tue, 18 Jan 2022 08:22:35 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512155; x=1674048155; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=7+c1sXkb+YZ1mlaZZMzVLTAk9/05+2xQQ6wigpDTqW0=; b=P7Fz1SZWXfY7pJCjSoYMJzwT7zc/N5iFarJDQnJMx4E52hnyV9OCJDh+ jXtW9643sl+ckOXcH4EObUEcRFnjLmhX8oiivYMycSBIFO72DUY3d5jWw sIFiIfyofPtGd3/xecpWwf70SN03VRmUFdCbqd/OFlA90wfmOwkVN+5cB xj1I7GkZez/Filt2NCMzJ2Zkl1Yo5O31O8c7mbXqLoZLLqwGM63gIhDub HOgVCTf2b7auiWM0JHA42LTW0OobQttrxFAzQP2ewvTagakWyEptyFZ1k v/AlUgLkunqk1RXbyBmr84NpEMriWrlaAgYHgopqTudTqobUwXpCMM4kk w==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="224791034" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="224791034" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:22:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791785" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:22:27 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 05/12] KVM: Extend the memslot to support fd-based private memory Date: Tue, 18 Jan 2022 21:21:14 +0800 Message-Id: <20220118132121.31388-6-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Extend the memslot definition to provide fd-based private memory support by adding two new fields (private_fd/private_offset). The memslot then can maintain memory for both shared pages and private pages in a single memslot. Shared pages are provided by existing userspace_addr(hva) field and private pages are provided through the new private_fd/private_offset fields. Since there is no 'hva' concept anymore for private memory so we cannot rely on get_user_pages() to get a pfn, instead we use the newly added memfile_notifier to complete the same job. This new extension is indicated by a new flag KVM_MEM_PRIVATE. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 7 +++++++ include/uapi/linux/kvm.h | 8 ++++++++ 2 files changed, 15 insertions(+) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f079820f52b5..5011ac35bc50 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -458,8 +458,15 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct file *private_file; + loff_t private_offset; }; +static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) +{ + return slot && (slot->flags & KVM_MEM_PRIVATE); +} + static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) { return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index fbfd70d965c6..5d6dceb1b93e 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -103,6 +103,13 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ }; +struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 padding[5]; +}; + /* * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, * other bits are reserved for kvm internal use which are defined in @@ -110,6 +117,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_PRIVATE (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { From patchwork Tue Jan 18 13:21:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716372 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4A8BFC433FE for ; Tue, 18 Jan 2022 13:22:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243292AbiARNWz (ORCPT ); Tue, 18 Jan 2022 08:22:55 -0500 Received: from mga01.intel.com ([192.55.52.88]:23096 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243074AbiARNWo (ORCPT ); Tue, 18 Jan 2022 08:22:44 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512164; x=1674048164; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=sOy+oj7x4sTMI5lpkFFnOHPfnLg7BWRRkoBuCyEaorQ=; b=Np55LppTq3hrMiib1ebcFrdgnNG6ARsO+QT8TrGwq2Pd50apOenwWbC9 9qM6yJ8n/sk+4uxJY8L6PgjBrq74KH4Lwqyz5VSvbLqEct0JUo1vISQ7d A/mpKy0/WF10OGJqcOK8cAN90lkeJBeZsWB9YQaOP68tbsLzLAKsTAAqa kQs0LpxoU/9XpCsXL8M7o2YKT35dnHxEsuVeuefifDydZYhDNWcEZnRZ6 gFkhH2G/sYLL2wrqTg1uiSxvMqb/g0cRNQQ07jEnmHParMyCC/WaKWVne JtEdBrNXkNbLiZiALi/sJ1nJXcXCcKThx6eoS30D5rgA4sF6yWuJ8W6j8 w==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="269193667" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="269193667" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:22:42 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791806" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:22:35 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 06/12] KVM: Use kvm_userspace_memory_region_ext Date: Tue, 18 Jan 2022 21:21:15 +0800 Message-Id: <20220118132121.31388-7-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Use the new extended memslot structure kvm_userspace_memory_region_ext. The extended part (private_fd/ private_offset) will be copied from userspace only when KVM_MEM_PRIVATE is set. Internally old kvm_userspace_memory_region will still be used for places where the extended fields are not needed. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/x86.c | 12 ++++++------ include/linux/kvm_host.h | 4 ++-- virt/kvm/kvm_main.c | 30 ++++++++++++++++++++---------- 3 files changed, 28 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c194a8cbd25f..7f8d87463391 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11572,13 +11572,13 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, } for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_userspace_memory_region_ext m; - m.slot = id | (i << 16); - m.flags = 0; - m.guest_phys_addr = gpa; - m.userspace_addr = hva; - m.memory_size = size; + m.region.slot = id | (i << 16); + m.region.flags = 0; + m.region.guest_phys_addr = gpa; + m.region.userspace_addr = hva; + m.region.memory_size = size; r = __kvm_set_memory_region(kvm, &m); if (r < 0) return ERR_PTR_USR(r); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 5011ac35bc50..26118a45f0bb 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -977,9 +977,9 @@ enum kvm_mr_change { }; int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region_ext *region_ext); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_userspace_memory_region_ext *region_ext); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 168d0ab93c88..ecf94e2548f7 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1815,8 +1815,9 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * Must be called holding kvm->slots_lock for write. */ int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_userspace_memory_region_ext *region_ext) { + const struct kvm_userspace_memory_region *mem = ®ion_ext->region; struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; enum kvm_mr_change change; @@ -1919,24 +1920,24 @@ int __kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(__kvm_set_memory_region); int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_userspace_memory_region_ext *region_ext) { int r; mutex_lock(&kvm->slots_lock); - r = __kvm_set_memory_region(kvm, mem); + r = __kvm_set_memory_region(kvm, region_ext); mutex_unlock(&kvm->slots_lock); return r; } EXPORT_SYMBOL_GPL(kvm_set_memory_region); static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, - struct kvm_userspace_memory_region *mem) + struct kvm_userspace_memory_region_ext *region_ext) { - if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) + if ((u16)region_ext->region.slot >= KVM_USER_MEM_SLOTS) return -EINVAL; - return kvm_set_memory_region(kvm, mem); + return kvm_set_memory_region(kvm, region_ext); } #ifndef CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT @@ -4482,14 +4483,23 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: { - struct kvm_userspace_memory_region kvm_userspace_mem; + struct kvm_userspace_memory_region_ext region_ext; r = -EFAULT; - if (copy_from_user(&kvm_userspace_mem, argp, - sizeof(kvm_userspace_mem))) + if (copy_from_user(®ion_ext, argp, + sizeof(struct kvm_userspace_memory_region))) goto out; + if (region_ext.region.flags & KVM_MEM_PRIVATE) { + int offset = offsetof( + struct kvm_userspace_memory_region_ext, + private_offset); + if (copy_from_user(®ion_ext.private_offset, + argp + offset, + sizeof(region_ext) - offset)) + goto out; + } - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); + r = kvm_vm_ioctl_set_memory_region(kvm, ®ion_ext); break; } case KVM_GET_DIRTY_LOG: { From patchwork Tue Jan 18 13:21:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716375 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21233C433F5 for ; Tue, 18 Jan 2022 13:23:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243591AbiARNXF (ORCPT ); Tue, 18 Jan 2022 08:23:05 -0500 Received: from mga06.intel.com ([134.134.136.31]:48497 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243192AbiARNWt (ORCPT ); Tue, 18 Jan 2022 08:22:49 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512169; x=1674048169; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=wc/HrL4JpEiPy0hOw1D7+yU1U6huBzVxxSt7KzpRO+o=; b=B04TtwMI8Z6Q0OSb5WaB8avjVN4KkBlKNKP8mKDlAZs/VIJNh2sM7HzA 3D53A5/bmI9KxAdg0mRf/kgVQ5lJhqjTcpDztbh4fSV8ag4T+r4PFGElC 1JWdamiGyqyQfLVIoWToGuZSyKF/BJred/xV655tgtt7xx3QON44ATjPo J2Jqx9oTpP31Xn58oLEqekiR9++9GQpkwlrXCDLEUWpqn07d3P4nCVQVy AU2WlkOCwOoJaoWP5wV2LH5MM9VlLMCHeWbKCGAp2hWLG/l25RVnQiivE kSMfrj6ktoxYUCCrnsu4Xe4TWqjQ/IpjPe5GZez88B4e3J+hrSEMBFgta w==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="305545836" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="305545836" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:22:49 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791841" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:22:42 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 07/12] KVM: Add KVM_EXIT_MEMORY_ERROR exit Date: Tue, 18 Jan 2022 21:21:16 +0800 Message-Id: <20220118132121.31388-8-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org This new KVM exit allows userspace to handle memory-related errors. It indicates an error happens in KVM at guest memory range [gpa, gpa+size). The flags includes additional information for userspace to handle the error. Currently bit 0 is defined as 'private memory' where '1' indicates error happens due to private memory access and '0' indicates error happens due to shared memory access. After private memory is enabled, this new exit will be used for KVM to exit to userspace for shared memory <-> private memory conversion in memory encryption usage. In such usage, typically there are two kind of memory conversions: - explicit conversion: happens when guest explicitly calls into KVM to map a range (as private or shared), KVM then exits to userspace to do the map/unmap operations. - implicit conversion: happens in KVM page fault handler. * if the fault is due to a private memory access then causes a userspace exit for a shared->private conversion request when the page has not been allocated in the private memory backend. * If the fault is due to a shared memory access then causes a userspace exit for a private->shared conversion request when the page has already been allocated in the private memory backend. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/uapi/linux/kvm.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 5d6dceb1b93e..52d8938a4ba1 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -278,6 +278,7 @@ struct kvm_xen_exit { #define KVM_EXIT_X86_BUS_LOCK 33 #define KVM_EXIT_XEN 34 #define KVM_EXIT_RISCV_SBI 35 +#define KVM_EXIT_MEMORY_ERROR 36 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -495,6 +496,14 @@ struct kvm_run { unsigned long args[6]; unsigned long ret[2]; } riscv_sbi; + /* KVM_EXIT_MEMORY_ERROR */ + struct { +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; /* Fix the size of the union. */ char padding[256]; }; From patchwork Tue Jan 18 13:21:17 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716373 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0464AC433F5 for ; Tue, 18 Jan 2022 13:23:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243229AbiARNXA (ORCPT ); Tue, 18 Jan 2022 08:23:00 -0500 Received: from mga04.intel.com ([192.55.52.120]:18818 "EHLO mga04.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242933AbiARNW5 (ORCPT ); Tue, 18 Jan 2022 08:22:57 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512177; x=1674048177; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=Sxmzc3HCYf+A9G7nAq2yyNQIgWHQdtSxzJnS7/C1WdQ=; b=hL2+HxVZricn2zr/dzQ8GZAwbTkDqoyCW53TXzoIePdEQiKN4PtiJpGR /p80VcPvt2HkDJdtsqIZ0JQvWqPRY3Au6VIabRQx+n4yM1UFiFBLBFhbq 8OPryc9+ozWA5IfPDuwWaanPfLjJFWc/mYSDzZo4Z5LubpKyEU9hwmuPd I5o+tlR/qdzTb7wvBzYhZJCbi/ZWD+FG8tI8P83g+97jMo0cxfubjofcy quVx2DtH8YnTMwFoD50ebNg6uIQ6nHU+AkD0V6Y6yLfWBZTYlDs70TMyd WX6sernqwBeUuI6ep4If7+BxoNZxDPSAG/paKH54OjyqWcIa3cCz2M7vo w==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="243636356" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="243636356" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:22:56 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791861" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:22:49 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 08/12] KVM: Use memfile_pfn_ops to obtain pfn for private pages Date: Tue, 18 Jan 2022 21:21:17 +0800 Message-Id: <20220118132121.31388-9-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Private pages are not mmap-ed into userspace so can not reply on get_user_pages() to obtain the pfn. Instead we add a memfile_pfn_ops pointer pfn_ops in each private memslot and use it to obtain the pfn for a gfn. To do that, KVM should convert the gfn to the offset into the fd and then call get_lock_pfn callback. Once KVM completes its job it should call put_unlock_pfn to unlock the pfn. Note the pfn(page) is locked between get_lock_pfn/put_unlock_pfn to ensure pfn is valid when KVM uses it to establish the mapping in the secondary MMU page table. The pfn_ops is initialized via memfile_register_notifier from the memory backing store that provided the private_fd. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/Kconfig | 1 + include/linux/kvm_host.h | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 34 insertions(+) diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index ebc8ce9ec917..5d5bebaad9e7 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -47,6 +47,7 @@ config KVM select SRCU select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM + select MEMFILE_NOTIFIER help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 26118a45f0bb..927e7f44a02a 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -42,6 +42,7 @@ #include #include +#include #ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -460,6 +461,7 @@ struct kvm_memory_slot { u16 as_id; struct file *private_file; loff_t private_offset; + struct memfile_pfn_ops *pfn_ops; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) @@ -810,6 +812,7 @@ static inline void kvm_irqfd_exit(void) { } #endif + int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align, struct module *module); void kvm_exit(void); @@ -2103,4 +2106,34 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, + int *order) +{ + pgoff_t index = gfn - slot->base_gfn + + (slot->private_offset >> PAGE_SHIFT); + + return slot->pfn_ops->get_lock_pfn(file_inode(slot->private_file), + index, order); +} + +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ + slot->pfn_ops->put_unlock_pfn(pfn); +} + +#else +static inline long kvm_memfile_get_pfn(struct kvm_memory_slot *slot, gfn_t gfn, + int *order) +{ + return -1; +} + +static inline void kvm_memfile_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ +} +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #endif From patchwork Tue Jan 18 13:21:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716382 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9689C433EF for ; Tue, 18 Jan 2022 13:24:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243851AbiARNXR (ORCPT ); Tue, 18 Jan 2022 08:23:17 -0500 Received: from mga14.intel.com ([192.55.52.115]:49861 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243476AbiARNXE (ORCPT ); Tue, 18 Jan 2022 08:23:04 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512184; x=1674048184; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=0KmAtoYc2KWHMVvjIOKZFm1pJUtWiDFa3f+h5iULyg8=; b=B/3iZAbYC0myLNjGsCEtWlXSji+ILdzMMSxbDyP9//mDDIjYJ36g733v IR+Gcqt0N2DLgN1hA0VStsFhxkVw4IB2ClyqzJagWktwQ6rkw5pMve7kv gTu5FJn5G1ATaFiq90xcI4XP4kRzOdd4LjaLbjJ9O9YFEWJloJ0XaFRsY UvTIp+fIaJQhjugU2k/DgpNe9snNl2f3GhFAe8jGxCivkSLdv+ZuJQjPz SsT2fu37ben6etOX/DpJ5dfwg6OzhXqIPO4iAJyKdSk2zlPuuc+ia5q2h EFCpXxg4p+ZcBaFDif6cGmByKsQ2XCn92YGqKDK4dDdchI5YbD/ajIb+Q Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="245007968" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="245007968" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:23:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791885" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:22:56 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 09/12] KVM: Handle page fault for private memory Date: Tue, 18 Jan 2022 21:21:18 +0800 Message-Id: <20220118132121.31388-10-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org When page fault happens for a memslot with KVM_MEM_PRIVATE, we use kvm_memfile_get_pfn() which further calls into memfile_pfn_ops callbacks defined for each memslot to request the pfn from the memory backing store. One assumption is that private pages are persistent and pre-allocated in the private memory fd (backing store) so KVM uses this information as an indicator for a page is private or shared (i.e. the private fd is the final source of truth as to whether or not a GPA is private). Depending on the access is private or shared, we go different paths: - For private access, KVM checks if the page is already allocated in the memory backing store, if yes KVM establishes the mapping, otherwise exits to userspace to convert a shared page to private one. - For shared access, KVM also checks if the page is already allocated in the memory backing store, if yes then exit to userspace to convert a private page to shared one, otherwise it's treated as a traditional hva-based shared memory, KVM lets existing code to obtain a pfn with get_user_pages() and establish the mapping. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/mmu/mmu.c | 73 ++++++++++++++++++++++++++++++++-- arch/x86/kvm/mmu/paging_tmpl.h | 11 +++-- 2 files changed, 77 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 1d275e9d76b5..df526ab7e657 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2873,6 +2873,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K; + if (kvm_slot_is_private(slot)) + return max_level; + host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level); } @@ -3903,7 +3906,59 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch); } -static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, int *r) +static bool kvm_vcpu_is_private_gfn(struct kvm_vcpu *vcpu, gfn_t gfn) +{ + /* + * At this time private gfn has not been supported yet. Other patch + * that enables it should change this. + */ + return false; +} + +static bool kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault, + bool *is_private_pfn, int *r) +{ + int order; + unsigned int flags = 0; + struct kvm_memory_slot *slot = fault->slot; + long pfn = kvm_memfile_get_pfn(slot, fault->gfn, &order); + + if (kvm_vcpu_is_private_gfn(vcpu, fault->addr >> PAGE_SHIFT)) { + if (pfn < 0) + flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; + else { + fault->pfn = pfn; + if (slot->flags & KVM_MEM_READONLY) + fault->map_writable = false; + else + fault->map_writable = true; + + if (order == 0) + fault->max_level = PG_LEVEL_4K; + *is_private_pfn = true; + *r = RET_PF_FIXED; + return true; + } + } else { + if (pfn < 0) + return false; + + kvm_memfile_put_pfn(slot, pfn); + } + + vcpu->run->exit_reason = KVM_EXIT_MEMORY_ERROR; + vcpu->run->memory.flags = flags; + vcpu->run->memory.padding = 0; + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT; + vcpu->run->memory.size = PAGE_SIZE; + fault->pfn = -1; + *r = -1; + return true; +} + +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, + bool *is_private_pfn, int *r) { struct kvm_memory_slot *slot = fault->slot; bool async; @@ -3937,6 +3992,10 @@ static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault, } } + if (kvm_slot_is_private(slot) && + kvm_faultin_pfn_private(vcpu, fault, is_private_pfn, r)) + return *r == RET_PF_FIXED ? false : true; + async = false; fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, fault->write, &fault->map_writable, @@ -3997,6 +4056,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault bool is_tdp_mmu_fault = is_tdp_mmu(vcpu->arch.mmu); unsigned long mmu_seq; + bool is_private_pfn = false; int r; fault->gfn = fault->addr >> PAGE_SHIFT; @@ -4016,7 +4076,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (kvm_faultin_pfn(vcpu, fault, &r)) + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r)) return r; if (handle_abnormal_pfn(vcpu, fault, ACC_ALL, &r)) @@ -4029,7 +4089,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault else write_lock(&vcpu->kvm->mmu_lock); - if (is_page_fault_stale(vcpu, fault, mmu_seq)) + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; r = make_mmu_pages_available(vcpu); @@ -4046,7 +4106,12 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault read_unlock(&vcpu->kvm->mmu_lock); else write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + + if (is_private_pfn) + kvm_memfile_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); + return r; } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 5b5bdac97c7b..a1d26b50a5ec 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -825,6 +825,8 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault int r; unsigned long mmu_seq; bool is_self_change_mapping; + bool is_private_pfn = false; + pgprintk("%s: addr %lx err %x\n", __func__, fault->addr, fault->error_code); WARN_ON_ONCE(fault->is_tdp); @@ -873,7 +875,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault mmu_seq = vcpu->kvm->mmu_notifier_seq; smp_rmb(); - if (kvm_faultin_pfn(vcpu, fault, &r)) + if (kvm_faultin_pfn(vcpu, fault, &is_private_pfn, &r)) return r; if (handle_abnormal_pfn(vcpu, fault, walker.pte_access, &r)) @@ -901,7 +903,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault r = RET_PF_RETRY; write_lock(&vcpu->kvm->mmu_lock); - if (is_page_fault_stale(vcpu, fault, mmu_seq)) + if (!is_private_pfn && is_page_fault_stale(vcpu, fault, mmu_seq)) goto out_unlock; kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT); @@ -913,7 +915,10 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault out_unlock: write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + if (is_private_pfn) + kvm_memfile_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); return r; } From patchwork Tue Jan 18 13:21:19 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716376 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC785C433F5 for ; Tue, 18 Jan 2022 13:23:41 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243507AbiARNXj (ORCPT ); Tue, 18 Jan 2022 08:23:39 -0500 Received: from mga02.intel.com ([134.134.136.20]:57296 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234907AbiARNX1 (ORCPT ); Tue, 18 Jan 2022 08:23:27 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512207; x=1674048207; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=EeAqQ9F4s9hLs6erZUPGg6L9PNHtGyg9c724TwmrBc4=; b=mdinzFsq+W20Dg0jbxVVwfpOX8filBTzKM6zox39NcWKLGdbW1bfPAu3 SqGxfVEyWHsxtZ/NEIxuHG/hUO3SSnED2/51f843XEywSNfWp1FsX6I+8 HU0mFwnNBmrc/U4qcchX84LIdP7GLcHMXvz1ozLSPb3dxCPOp33UEsQY6 Z4v6ejA0Fv/0uMwQnTWGCCiv6SbYhzqflXj5AmKPNms0LkJRwu/Y1Bycu /3RJcQTsNa4CI1LL/MXSwef3Imm9+ytrYyMtkgd9qupWVy4rAJAeoCevI wUQ691jwTv1a0Nu5QsMequjLgcDdjsOLS7QKKI1Ig/hRTq5+Zw3oHzkzt g==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="232172020" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="232172020" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:23:11 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791917" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:23:04 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 10/12] KVM: Register private memslot to memory backing store Date: Tue, 18 Jan 2022 21:21:19 +0800 Message-Id: <20220118132121.31388-11-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org Add 'notifier' to memslot to make it a memfile_notifier node and then register it to memory backing store via memfile_register_notifier() when memslot gets created. When memslot is deleted, do the reverse with memfile_unregister_notifier(). Note each KVM memslot can be registered to different memory backing stores (or the same backing store but at different offset) independently. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 75 ++++++++++++++++++++++++++++++++++++---- 2 files changed, 70 insertions(+), 6 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 927e7f44a02a..667efe839767 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -462,6 +462,7 @@ struct kvm_memory_slot { struct file *private_file; loff_t private_offset; struct memfile_pfn_ops *pfn_ops; + struct memfile_notifier notifier; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ecf94e2548f7..6b78ddef7880 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -846,6 +846,37 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_MEMFILE_NOTIFIER +static inline int kvm_memfile_register(struct kvm_memory_slot *slot) +{ + return memfile_register_notifier(file_inode(slot->private_file), + &slot->notifier, + &slot->pfn_ops); +} + +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot) +{ + if (slot->private_file) { + memfile_unregister_notifier(file_inode(slot->private_file), + &slot->notifier); + fput(slot->private_file); + slot->private_file = NULL; + } +} + +#else /* !CONFIG_MEMFILE_NOTIFIER */ + +static inline int kvm_memfile_register(struct kvm_memory_slot *slot) +{ + return -EOPNOTSUPP; +} + +static inline void kvm_memfile_unregister(struct kvm_memory_slot *slot) +{ +} + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state, @@ -890,6 +921,9 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot) /* This does not remove the slot from struct kvm_memslots data structures */ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { + if (slot->flags & KVM_MEM_PRIVATE) + kvm_memfile_unregister(slot); + kvm_destroy_dirty_bitmap(slot); kvm_arch_free_memslot(kvm, slot); @@ -1744,6 +1778,12 @@ static int kvm_set_memslot(struct kvm *kvm, kvm_invalidate_memslot(kvm, old, invalid_slot); } + if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) { + r = kvm_memfile_register(new); + if (r) + return r; + } + r = kvm_prepare_memory_region(kvm, old, new, change); if (r) { /* @@ -1758,6 +1798,10 @@ static int kvm_set_memslot(struct kvm *kvm, } else { mutex_unlock(&kvm->slots_arch_lock); } + + if (new->flags & KVM_MEM_PRIVATE && change == KVM_MR_CREATE) + kvm_memfile_unregister(new); + return r; } @@ -1823,6 +1867,7 @@ int __kvm_set_memory_region(struct kvm *kvm, enum kvm_mr_change change; unsigned long npages; gfn_t base_gfn; + struct file *file = NULL; int as_id, id; int r; @@ -1896,14 +1941,24 @@ int __kvm_set_memory_region(struct kvm *kvm, return 0; } + if (mem->flags & KVM_MEM_PRIVATE) { + file = fdget(region_ext->private_fd).file; + if (!file) + return -EINVAL; + } + if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) && - kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) - return -EEXIST; + kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) { + r = -EEXIST; + goto out; + } /* Allocate a slot that will persist in the memslot. */ new = kzalloc(sizeof(*new), GFP_KERNEL_ACCOUNT); - if (!new) - return -ENOMEM; + if (!new) { + r = -ENOMEM; + goto out; + } new->as_id = as_id; new->id = id; @@ -1911,10 +1966,18 @@ int __kvm_set_memory_region(struct kvm *kvm, new->npages = npages; new->flags = mem->flags; new->userspace_addr = mem->userspace_addr; + new->private_file = file; + new->private_offset = mem->flags & KVM_MEM_PRIVATE ? + region_ext->private_offset : 0; r = kvm_set_memslot(kvm, old, new, change); - if (r) - kfree(new); + if (!r) + return r; + + kfree(new); +out: + if (file) + fput(file); return r; } EXPORT_SYMBOL_GPL(__kvm_set_memory_region); From patchwork Tue Jan 18 13:21:20 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716377 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD70BC433FE for ; Tue, 18 Jan 2022 13:23:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243240AbiARNXm (ORCPT ); Tue, 18 Jan 2022 08:23:42 -0500 Received: from mga01.intel.com ([192.55.52.88]:23105 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234488AbiARNXg (ORCPT ); Tue, 18 Jan 2022 08:23:36 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512216; x=1674048216; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=Doiq2R3hAItBU5LQnTBxQCkUWyVHPsvtgI1nQR7ap4A=; b=VTH0UeXkWQffOiJ85URJghn7A7SpieAcgukEl1XFuVfQkjVbxiErW/Dg 9q9rVCE/MDD2fwQqPPmnbaD+iHy7zBYJz/HnKK9S5uQA0ATOXBYMLZoVw qWfJNqZo2U9cP9XLm4GHReZ23HErQpM+KQDNDfsXU6TeUjD7a8P47rlr/ CYuaR9N/ejA2Gp7wXnU3auM9bph6KLg/gG7S94qUN8Wnncp+tFsM/nWlK 2tnZ/h5EQIqit0W1PVM2sMFc4tOYGpAkqQRwCxgDx27rvPEygOLN66UbK AYQEoJWFKSCYRYL7uHyWtyoxhGB59m2Y9GarvIQLULau6MU65YnA3SJCu Q==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="269193757" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="269193757" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:23:19 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791936" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:23:11 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 11/12] KVM: Zap existing KVM mappings when pages changed in the private fd Date: Tue, 18 Jan 2022 21:21:20 +0800 Message-Id: <20220118132121.31388-12-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org KVM gets notified when memory pages changed in the memory backing store. When userspace allocates the memory with fallocate() or frees memory with fallocate(FALLOC_FL_PUNCH_HOLE), memory backing store calls into KVM fallocate/invalidate callbacks respectively. To ensure KVM never maps both the private and shared variants of a GPA into the guest, in the fallocate callback, we should zap the existing shared mapping and in the invalidate callback we should zap the existing private mapping. In the callbacks, KVM firstly converts the offset range into the gfn_range and then calls existing kvm_unmap_gfn_range() which will zap the shared or private mapping. Both callbacks pass in a memslot reference but we need 'kvm' so add a reference in memslot structure. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 3 ++- virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+), 1 deletion(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 667efe839767..117cf0da9c5e 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -235,7 +235,7 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #endif -#ifdef KVM_ARCH_WANT_MMU_NOTIFIER +#if defined(KVM_ARCH_WANT_MMU_NOTIFIER) || defined(CONFIG_MEMFILE_NOTIFIER) struct kvm_gfn_range { struct kvm_memory_slot *slot; gfn_t start; @@ -463,6 +463,7 @@ struct kvm_memory_slot { loff_t private_offset; struct memfile_pfn_ops *pfn_ops; struct memfile_notifier notifier; + struct kvm *kvm; }; static inline bool kvm_slot_is_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6b78ddef7880..10e553215618 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -847,8 +847,43 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ #ifdef CONFIG_MEMFILE_NOTIFIER +static void kvm_memfile_notifier_handler(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end) +{ + int idx; + struct kvm_memory_slot *slot = container_of(notifier, + struct kvm_memory_slot, + notifier); + struct kvm_gfn_range gfn_range = { + .slot = slot, + .start = start - (slot->private_offset >> PAGE_SHIFT), + .end = end - (slot->private_offset >> PAGE_SHIFT), + .may_block = true, + }; + struct kvm *kvm = slot->kvm; + + gfn_range.start = max(gfn_range.start, slot->base_gfn); + gfn_range.end = min(gfn_range.end, slot->base_gfn + slot->npages); + + if (gfn_range.start >= gfn_range.end) + return; + + idx = srcu_read_lock(&kvm->srcu); + KVM_MMU_LOCK(kvm); + kvm_unmap_gfn_range(kvm, &gfn_range); + kvm_flush_remote_tlbs(kvm); + KVM_MMU_UNLOCK(kvm); + srcu_read_unlock(&kvm->srcu, idx); +} + +static struct memfile_notifier_ops kvm_memfile_notifier_ops = { + .invalidate = kvm_memfile_notifier_handler, + .fallocate = kvm_memfile_notifier_handler, +}; + static inline int kvm_memfile_register(struct kvm_memory_slot *slot) { + slot->notifier.ops = &kvm_memfile_notifier_ops; return memfile_register_notifier(file_inode(slot->private_file), &slot->notifier, &slot->pfn_ops); @@ -1969,6 +2004,7 @@ int __kvm_set_memory_region(struct kvm *kvm, new->private_file = file; new->private_offset = mem->flags & KVM_MEM_PRIVATE ? region_ext->private_offset : 0; + new->kvm = kvm; r = kvm_set_memslot(kvm, old, new, change); if (!r) From patchwork Tue Jan 18 13:21:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12716378 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D18AEC433F5 for ; Tue, 18 Jan 2022 13:23:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243830AbiARNXy (ORCPT ); Tue, 18 Jan 2022 08:23:54 -0500 Received: from mga01.intel.com ([192.55.52.88]:23096 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239194AbiARNXh (ORCPT ); Tue, 18 Jan 2022 08:23:37 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1642512217; x=1674048217; h=from:to:cc:subject:date:message-id:in-reply-to: references; bh=WMNkcnSDaEvDrlJP1KgKm0jYnNbblXf6Dmx4W+l5lNg=; b=WI9lcyXo8WncTmLnkUnHTCisu7JSQHnqeuG6Ffr+LZDs1lWVwrZgwxgb ee0IYfE2QazcVb2LdkCLCrt3mktDX9W5I3VI51XpEaC9I+/0MGeVak7/B 2JH2oCOOcfV5vD1HINH49rG2eKLxzfpFwUFfKV8j2EPylMOGuLDSSdIrD wr4ipODnjsJ5uQMPn5d3uRc578Z6OULTE029kx0KiZTxGGbeMFiztkH1q jQCQdSAfb1p4bE+g+goINKSjCAJewGco7eTwm2O6tFuPSvRKodkS1tBbe Pq8OBeLy9Wa96OofRq/MZ2xrmR4tyjtN5nNRsIMgul4e82Bp7Dw0f/eJE g==; X-IronPort-AV: E=McAfee;i="6200,9189,10230"; a="269193767" X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="269193767" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 Jan 2022 05:23:26 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.88,297,1635231600"; d="scan'208";a="531791967" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga008.jf.intel.com with ESMTP; 18 Jan 2022 05:23:19 -0800 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com Subject: [PATCH v4 12/12] KVM: Expose KVM_MEM_PRIVATE Date: Tue, 18 Jan 2022 21:21:21 +0800 Message-Id: <20220118132121.31388-13-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20220118132121.31388-1-chao.p.peng@linux.intel.com> References: <20220118132121.31388-1-chao.p.peng@linux.intel.com> Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org KVM_MEM_PRIVATE is not exposed by default but architecture code can turn on it by implementing kvm_arch_private_memory_supported(). Also private memslot cannot be movable and the same file+offset can not be mapped into different GFNs. Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 49 ++++++++++++++++++++++++++++++++++------ 2 files changed, 43 insertions(+), 7 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 117cf0da9c5e..444b390261c0 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -1328,6 +1328,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_private_memory_supported(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 10e553215618..51d0f08a8601 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1491,10 +1491,19 @@ static void kvm_replace_memslot(struct kvm *kvm, } } -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) +bool __weak kvm_arch_private_memory_supported(struct kvm *kvm) +{ + return false; +} + +static int check_memory_region_flags(struct kvm *kvm, + const struct kvm_userspace_memory_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; + if (kvm_arch_private_memory_supported(kvm)) + valid_flags |= KVM_MEM_PRIVATE; + #ifdef __KVM_HAVE_READONLY_MEM valid_flags |= KVM_MEM_READONLY; #endif @@ -1873,15 +1882,32 @@ static int kvm_set_memslot(struct kvm *kvm, } static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, - gfn_t start, gfn_t end) + struct file *file, + gfn_t start, gfn_t end, + loff_t start_off, loff_t end_off) { struct kvm_memslot_iter iter; + struct kvm_memory_slot *slot; + struct inode *inode; + int bkt; kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) { if (iter.slot->id != id) return true; } + /* Disallow mapping the same file+offset into multiple gfns. */ + if (file) { + inode = file_inode(file); + kvm_for_each_memslot(slot, bkt, slots) { + if (slot->private_file && + file_inode(slot->private_file) == inode && + !(end_off <= slot->private_offset || + start_off >= slot->private_offset + + (slot->npages >> PAGE_SHIFT))) + return true; + } + } return false; } @@ -1906,7 +1932,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int as_id, id; int r; - r = check_memory_region_flags(mem); + r = check_memory_region_flags(kvm, mem); if (r) return r; @@ -1919,10 +1945,12 @@ int __kvm_set_memory_region(struct kvm *kvm, return -EINVAL; if (mem->guest_phys_addr & (PAGE_SIZE - 1)) return -EINVAL; - /* We can read the guest memory with __xxx_user() later on. */ if ((mem->userspace_addr & (PAGE_SIZE - 1)) || - (mem->userspace_addr != untagged_addr(mem->userspace_addr)) || - !access_ok((void __user *)(unsigned long)mem->userspace_addr, + (mem->userspace_addr != untagged_addr(mem->userspace_addr))) + return -EINVAL; + /* We can read the guest memory with __xxx_user() later on. */ + if (!(mem->flags & KVM_MEM_PRIVATE) && + !access_ok((void __user *)(unsigned long)mem->userspace_addr, mem->memory_size)) return -EINVAL; if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) @@ -1963,6 +1991,9 @@ int __kvm_set_memory_region(struct kvm *kvm, if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) return -EINVAL; } else { /* Modify an existing slot. */ + /* Private memslots are immutable, they can only be deleted. */ + if (mem->flags & KVM_MEM_PRIVATE) + return -EINVAL; if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) @@ -1983,7 +2014,11 @@ int __kvm_set_memory_region(struct kvm *kvm, } if ((change == KVM_MR_CREATE || change == KVM_MR_MOVE) && - kvm_check_memslot_overlap(slots, id, base_gfn, base_gfn + npages)) { + kvm_check_memslot_overlap(slots, id, file, + base_gfn, base_gfn + npages, + region_ext->private_offset, + region_ext->private_offset + + mem->memory_size)) { r = -EEXIST; goto out; }