From patchwork Wed Jul 6 08:20:03 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907528 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2AF15CCA483 for ; Wed, 6 Jul 2022 08:24:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231982AbiGFIYS (ORCPT ); Wed, 6 Jul 2022 04:24:18 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34564 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231917AbiGFIX6 (ORCPT ); Wed, 6 Jul 2022 04:23:58 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7496C13F23; Wed, 6 Jul 2022 01:23:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095837; x=1688631837; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=+/BgOKNb3dxFVxOORdt/5r5R408fOO4X1WIXgnHGmWI=; b=go6Iv3OfyhYBJiu/mOHx478uJywpkT/8Os1s+yFo5WcTooc5V8x0YuHy jk4S3ruDA+Ks/MjH210WmMcYUuHiFIIEmIN+IX8sBwpjGeG5fe2fxyULb nLjTgphU46nmUBwmm6rxPVM2jEF3aFIsr7qNsnaUk/s8deLEOeKzGh2wI dMUcOiLuWnJPIY0TMQsoK5WyR8i7vfUJSO+W4WVuIqI6Z7cE0nOp7J4+M AZeMry3XkBMj64wdXqRVrbYNsAlky6zj1BJc513mGFePvAvc7AmUwOZ1N +B76Qq3nJTvgt4m0kyZonTJpDWVkb+LM6Iycc6iEKC/ng7r/6ImjA7+k3 g==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="282433172" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="282433172" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:23:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567967821" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:23:46 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 01/14] mm: Add F_SEAL_AUTO_ALLOCATE seal to memfd Date: Wed, 6 Jul 2022 16:20:03 +0800 Message-Id: <20220706082016.2603916-2-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory. This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at: https://lkml.org/lkml/2022/6/14/1255 Suggested-by: Sean Christopherson Signed-off-by: Chao Peng --- include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ /* (1U << 31) is reserved for signed error codes */ /* diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..2afd898798e4 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \ - F_SEAL_FUTURE_WRITE) + F_SEAL_FUTURE_WRITE | \ + F_SEAL_AUTO_ALLOCATE) static int memfd_add_seals(struct file *file, unsigned int seals) { diff --git a/mm/shmem.c b/mm/shmem.c index a6f565308133..6c8aef15a17d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct inode *inode = file_inode(vma->vm_file); gfp_t gfp = mapping_gfp_mask(inode->i_mapping); + struct shmem_inode_info *info = SHMEM_I(inode); + enum sgp_type sgp; int err; vm_fault_t ret = VM_FAULT_LOCKED; @@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) spin_unlock(&inode->i_lock); } - err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE, + if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) + sgp = SGP_NOALLOC; + else + sgp = SGP_CACHE; + + err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp, gfp, vma, vmf, &ret); if (err) return vmf_error(err); @@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_SHIFT; + enum sgp_type sgp; int ret = 0; /* i_rwsem is held by caller */ @@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping, return -EPERM; } - ret = shmem_getpage(inode, index, pagep, SGP_WRITE); + if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) + sgp = SGP_NOALLOC; + else + sgp = SGP_WRITE; + ret = shmem_getpage(inode, index, pagep, sgp); if (ret) return ret; From patchwork Wed Jul 6 08:20:04 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907529 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 882D6CCA480 for ; Wed, 6 Jul 2022 08:24:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232096AbiGFIY0 (ORCPT ); Wed, 6 Jul 2022 04:24:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34780 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231765AbiGFIYJ (ORCPT ); Wed, 6 Jul 2022 04:24:09 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 778CD65CE; Wed, 6 Jul 2022 01:24:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095848; x=1688631848; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=kJl1R88jMUFb5Y8ov4QxxubYj5SsjR8PCdbcDnNsdhM=; b=WXFTAL9EeYSlctTOynMN6MTUfDDnRM2dTJfLimkHX7cvI0+qdvkIFEUV QovYwEE7JKQ8iBd1wbiTIvBYGYmFuffDXl1vRQRwx2WzqkeQKTNKh/pNY +5O+iFQ5ge89sZf6K6y3XrhOKJw/KrPHwv9ey0lZzLnO8jUEGvtu7OHBy Ku9Se5pzGUKXi/Rfw/z2NiHXHhBClMB1G1BAuQjQr4HdEkQdvyuzLb6bR gWddgPPN4D4ZfnIkJ0U5N8YmOsYVLbNp5Dn4f7WjpjiNUA//Y91/2anj3 xrcNiihtqSILbvN1WMgKOe4h0pfVmBy9bRjKOrIcDxJIwE1wqYH/t1Y5t A==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="282433194" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="282433194" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:24:08 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567967851" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:23:57 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 02/14] selftests/memfd: Add tests for F_SEAL_AUTO_ALLOCATE Date: Wed, 6 Jul 2022 16:20:04 +0800 Message-Id: <20220706082016.2603916-3-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Add tests to verify sealing memfds with the F_SEAL_AUTO_ALLOCATE works as expected. Signed-off-by: Chao Peng --- tools/testing/selftests/memfd/memfd_test.c | 166 +++++++++++++++++++++ 1 file changed, 166 insertions(+) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 94df2692e6e4..b849ece295fd 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -232,6 +233,31 @@ static void mfd_fail_open(int fd, int flags, mode_t mode) } } +static void mfd_assert_fallocate(int fd) +{ + int r; + + r = fallocate(fd, 0, 0, mfd_def_size); + if (r < 0) { + printf("fallocate(ALLOC) failed: %m\n"); + abort(); + } +} + +static void mfd_assert_punch_hole(int fd) +{ + int r; + + r = fallocate(fd, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + 0, + mfd_def_size); + if (r < 0) { + printf("fallocate(PUNCH_HOLE) failed: %m\n"); + abort(); + } +} + static void mfd_assert_read(int fd) { char buf[16]; @@ -594,6 +620,94 @@ static void mfd_fail_grow_write(int fd) } } +static void mfd_assert_hole_write(int fd) +{ + ssize_t l; + void *p; + char *p1; + + /* + * huegtlbfs does not support write, but we want to + * verify everything else here. + */ + if (!hugetlbfs_test) { + /* verify direct write() succeeds */ + l = write(fd, "\0\0\0\0", 4); + if (l != 4) { + printf("write() failed: %m\n"); + abort(); + } + } + + /* verify mmaped write succeeds */ + p = mmap(NULL, + mfd_def_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + p1 = (char *)p + mfd_def_size - 1; + *p1 = 'H'; + if (*p1 != 'H') { + printf("mmaped write failed: %m\n"); + abort(); + + } + munmap(p, mfd_def_size); +} + +sigjmp_buf jbuf, *sigbuf; +static void sig_handler(int sig, siginfo_t *siginfo, void *ptr) +{ + if (sig == SIGBUS) { + if (sigbuf) + siglongjmp(*sigbuf, 1); + abort(); + } +} + +static void mfd_fail_hole_write(int fd) +{ + ssize_t l; + void *p; + char *p1; + + /* verify direct write() fails */ + l = write(fd, "data", 4); + if (l > 0) { + printf("expected failure on write(), but got %d: %m\n", (int)l); + abort(); + } + + /* verify mmaped write fails */ + p = mmap(NULL, + mfd_def_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + fd, + 0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + + sigbuf = &jbuf; + if (sigsetjmp(*sigbuf, 1)) + goto out; + + /* Below write should trigger SIGBUS signal */ + p1 = (char *)p + mfd_def_size - 1; + *p1 = 'H'; + printf("failed to receive SIGBUS for mmaped write: %m\n"); + abort(); +out: + munmap(p, mfd_def_size); +} + static int idle_thread_fn(void *arg) { sigset_t set; @@ -880,6 +994,57 @@ static void test_seal_resize(void) close(fd); } +/* + * Test F_SEAL_AUTO_ALLOCATE + * Test whether F_SEAL_AUTO_ALLOCATE actually prevents allocation. + */ +static void test_seal_auto_allocate(void) +{ + struct sigaction act; + int fd; + + printf("%s SEAL-AUTO-ALLOCATE\n", memfd_str); + + memset(&act, 0, sizeof(act)); + act.sa_sigaction = sig_handler; + act.sa_flags = SA_SIGINFO; + if (sigaction(SIGBUS, &act, 0)) { + printf("sigaction() failed: %m\n"); + abort(); + } + + fd = mfd_assert_new("kern_memfd_seal_auto_allocate", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + /* read/write should pass if F_SEAL_AUTO_ALLOCATE not set */ + mfd_assert_read(fd); + mfd_assert_hole_write(fd); + + mfd_assert_has_seals(fd, 0); + mfd_assert_add_seals(fd, F_SEAL_AUTO_ALLOCATE); + mfd_assert_has_seals(fd, F_SEAL_AUTO_ALLOCATE); + + /* read/write should pass for pre-allocated area */ + mfd_assert_read(fd); + mfd_assert_hole_write(fd); + + mfd_assert_punch_hole(fd); + + /* read should pass, write should fail in hole */ + mfd_assert_read(fd); + mfd_fail_hole_write(fd); + + mfd_assert_fallocate(fd); + + /* read/write should pass after fallocate */ + mfd_assert_read(fd); + mfd_assert_hole_write(fd); + + close(fd); +} + + /* * Test sharing via dup() * Test that seals are shared between dupped FDs and they're all equal. @@ -1059,6 +1224,7 @@ int main(int argc, char **argv) test_seal_shrink(); test_seal_grow(); test_seal_resize(); + test_seal_auto_allocate(); test_share_dup("SHARE-DUP", ""); test_share_mmap("SHARE-MMAP", ""); From patchwork Wed Jul 6 08:20:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907530 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A480DCCA482 for ; Wed, 6 Jul 2022 08:24:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232009AbiGFIYs (ORCPT ); Wed, 6 Jul 2022 04:24:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34724 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232054AbiGFIYU (ORCPT ); Wed, 6 Jul 2022 04:24:20 -0400 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 751502317C; Wed, 6 Jul 2022 01:24:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095859; x=1688631859; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=4RqYsqOgMgPs4JDgf0SUKlW6JB5EWiu8UDnwronwhT8=; b=Fp6nN8p46kB/kFuRh3LhdQL+ikFtUaqnd7wgLeyZc4jRRwQIhK0jrhgQ GOk/qXDMDyOqRnTkO8snt/j3AdHD2qSJr3da+AFfLCOLr/71b+ysABvpy EPqaWiyNAl2mfPGufH52Ufnd0/GPf/eSRyibIxmcc5z1OYIvL2hY3cpcg fUwRiVCr+M3ik96XH5WzZJ6lh/iZXiLJ5Fbte1VRlhzmIjTHjc9yMQf0f IZGPlCgZpo0fw2DnoPenSCd+kQefFcP6+puQZIQrh2I1e1Ci57vl7xiC8 6che/8V4HAy5yt6oHI5dF1mskXnS4ezVg9ZYM0kBqiTZa4c/yognJgSJb A==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="266709759" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="266709759" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:24:18 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567967876" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:24:08 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 03/14] mm: Introduce memfile_notifier Date: Wed, 6 Jul 2022 16:20:05 +0800 Message-Id: <20220706082016.2603916-4-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This patch introduces memfile_notifier facility so existing memory file subsystems (e.g. tmpfs/hugetlbfs) can provide memory pages to allow a third kernel component to make use of memory bookmarked in the memory file and gets notified when the pages in the memory file become invalidated. It will be used for KVM to use a file descriptor as the guest memory backing store and KVM will use this memfile_notifier interface to interact with memory file subsystems. In the future there might be other consumers (e.g. VFIO with encrypted device memory). It consists below components: - memfile_backing_store: Each supported memory file subsystem can be implemented as a memory backing store which bookmarks memory and provides callbacks for other kernel systems (memfile_notifier consumers) to interact with. - memfile_notifier: memfile_notifier consumers defines callbacks and associate them to a file using memfile_register_notifier(). - memfile_node: A memfile_node is associated with the file (inode) from the backing store and includes feature flags and a list of registered memfile_notifier for notifying. In KVM usages, userspace is in charge of guest memory lifecycle: it first allocates pages in memory backing store and then passes the fd to KVM and lets KVM register memory slot to memory backing store via memfile_register_notifier. Co-developed-by: Kirill A. Shutemov Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/memfile_notifier.h | 93 ++++++++++++++++++++++++ mm/Kconfig | 4 + mm/Makefile | 1 + mm/memfile_notifier.c | 121 +++++++++++++++++++++++++++++++ 4 files changed, 219 insertions(+) create mode 100644 include/linux/memfile_notifier.h create mode 100644 mm/memfile_notifier.c diff --git a/include/linux/memfile_notifier.h b/include/linux/memfile_notifier.h new file mode 100644 index 000000000000..c5d66fd8ba53 --- /dev/null +++ b/include/linux/memfile_notifier.h @@ -0,0 +1,93 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_MEMFILE_NOTIFIER_H +#define _LINUX_MEMFILE_NOTIFIER_H + +#include +#include +#include +#include +#include + +/* memory in the file is inaccessible from userspace (e.g. read/write/mmap) */ +#define MEMFILE_F_USER_INACCESSIBLE BIT(0) +/* memory in the file is unmovable (e.g. via pagemigration)*/ +#define MEMFILE_F_UNMOVABLE BIT(1) +/* memory in the file is unreclaimable (e.g. via kswapd) */ +#define MEMFILE_F_UNRECLAIMABLE BIT(2) + +#define MEMFILE_F_ALLOWED_MASK (MEMFILE_F_USER_INACCESSIBLE | \ + MEMFILE_F_UNMOVABLE | \ + MEMFILE_F_UNRECLAIMABLE) + +struct memfile_node { + struct list_head notifiers; /* registered notifiers */ + unsigned long flags; /* MEMFILE_F_* flags */ +}; + +struct memfile_backing_store { + struct list_head list; + spinlock_t lock; + struct memfile_node* (*lookup_memfile_node)(struct file *file); + int (*get_pfn)(struct file *file, pgoff_t offset, pfn_t *pfn, + int *order); + void (*put_pfn)(pfn_t pfn); +}; + +struct memfile_notifier; +struct memfile_notifier_ops { + void (*invalidate)(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end); +}; + +struct memfile_notifier { + struct list_head list; + struct memfile_notifier_ops *ops; + struct memfile_backing_store *bs; +}; + +static inline void memfile_node_init(struct memfile_node *node) +{ + INIT_LIST_HEAD(&node->notifiers); + node->flags = 0; +} + +#ifdef CONFIG_MEMFILE_NOTIFIER +/* APIs for backing stores */ +extern void memfile_register_backing_store(struct memfile_backing_store *bs); +extern int memfile_node_set_flags(struct file *file, unsigned long flags); +extern void memfile_notifier_invalidate(struct memfile_node *node, + pgoff_t start, pgoff_t end); +/*APIs for notifier consumers */ +extern int memfile_register_notifier(struct file *file, unsigned long flags, + struct memfile_notifier *notifier); +extern void memfile_unregister_notifier(struct memfile_notifier *notifier); + +#else /* !CONFIG_MEMFILE_NOTIFIER */ +static inline void memfile_register_backing_store(struct memfile_backing_store *bs) +{ +} + +static inline int memfile_node_set_flags(struct file *file, unsigned long flags) +{ + return -EOPNOTSUPP; +} + +static inline void memfile_notifier_invalidate(struct memfile_node *node, + pgoff_t start, pgoff_t end) +{ +} + +static inline int memfile_register_notifier(struct file *file, + unsigned long flags, + struct memfile_notifier *notifier) +{ + return -EOPNOTSUPP; +} + +static inline void memfile_unregister_notifier(struct memfile_notifier *notifier) +{ +} + +#endif /* CONFIG_MEMFILE_NOTIFIER */ + +#endif /* _LINUX_MEMFILE_NOTIFIER_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 169e64192e48..19ab9350f5cb 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1130,6 +1130,10 @@ config PTE_MARKER_UFFD_WP purposes. It is required to enable userfaultfd write protection on file-backed memory types like shmem and hugetlbfs. +config MEMFILE_NOTIFIER + bool + select SRCU + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index 6f9ffa968a1a..b7e3fb5fa85b 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -133,3 +133,4 @@ obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o obj-$(CONFIG_IO_MAPPING) += io-mapping.o obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o +obj-$(CONFIG_MEMFILE_NOTIFIER) += memfile_notifier.o diff --git a/mm/memfile_notifier.c b/mm/memfile_notifier.c new file mode 100644 index 000000000000..799d3197903e --- /dev/null +++ b/mm/memfile_notifier.c @@ -0,0 +1,121 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2022 Intel Corporation. + * Chao Peng + */ + +#include +#include +#include + +DEFINE_STATIC_SRCU(memfile_srcu); +static __ro_after_init LIST_HEAD(backing_store_list); + + +void memfile_notifier_invalidate(struct memfile_node *node, + pgoff_t start, pgoff_t end) +{ + struct memfile_notifier *notifier; + int id; + + id = srcu_read_lock(&memfile_srcu); + list_for_each_entry_srcu(notifier, &node->notifiers, list, + srcu_read_lock_held(&memfile_srcu)) { + if (notifier->ops->invalidate) + notifier->ops->invalidate(notifier, start, end); + } + srcu_read_unlock(&memfile_srcu, id); +} + +void __init memfile_register_backing_store(struct memfile_backing_store *bs) +{ + spin_lock_init(&bs->lock); + list_add_tail(&bs->list, &backing_store_list); +} + +static void memfile_node_update_flags(struct file *file, unsigned long flags) +{ + struct address_space *mapping = file_inode(file)->i_mapping; + gfp_t gfp; + + gfp = mapping_gfp_mask(mapping); + if (flags & MEMFILE_F_UNMOVABLE) + gfp &= ~__GFP_MOVABLE; + else + gfp |= __GFP_MOVABLE; + mapping_set_gfp_mask(mapping, gfp); + + if (flags & MEMFILE_F_UNRECLAIMABLE) + mapping_set_unevictable(mapping); + else + mapping_clear_unevictable(mapping); +} + +int memfile_node_set_flags(struct file *file, unsigned long flags) +{ + struct memfile_backing_store *bs; + struct memfile_node *node; + + if (flags & ~MEMFILE_F_ALLOWED_MASK) + return -EINVAL; + + list_for_each_entry(bs, &backing_store_list, list) { + node = bs->lookup_memfile_node(file); + if (node) { + spin_lock(&bs->lock); + node->flags = flags; + spin_unlock(&bs->lock); + memfile_node_update_flags(file, flags); + return 0; + } + } + + return -EOPNOTSUPP; +} + +int memfile_register_notifier(struct file *file, unsigned long flags, + struct memfile_notifier *notifier) +{ + struct memfile_backing_store *bs; + struct memfile_node *node; + struct list_head *list; + + if (!file || !notifier || !notifier->ops) + return -EINVAL; + if (flags & ~MEMFILE_F_ALLOWED_MASK) + return -EINVAL; + + list_for_each_entry(bs, &backing_store_list, list) { + node = bs->lookup_memfile_node(file); + if (node) { + list = &node->notifiers; + notifier->bs = bs; + + spin_lock(&bs->lock); + if (list_empty(list)) + node->flags = flags; + else if (node->flags ^ flags) { + spin_unlock(&bs->lock); + return -EINVAL; + } + + list_add_rcu(¬ifier->list, list); + spin_unlock(&bs->lock); + memfile_node_update_flags(file, flags); + return 0; + } + } + + return -EOPNOTSUPP; +} +EXPORT_SYMBOL_GPL(memfile_register_notifier); + +void memfile_unregister_notifier(struct memfile_notifier *notifier) +{ + spin_lock(¬ifier->bs->lock); + list_del_rcu(¬ifier->list); + spin_unlock(¬ifier->bs->lock); + + synchronize_srcu(&memfile_srcu); +} +EXPORT_SYMBOL_GPL(memfile_unregister_notifier); From patchwork Wed Jul 6 08:20:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907531 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D8900CCA47C for ; Wed, 6 Jul 2022 08:24:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231964AbiGFIYy (ORCPT ); Wed, 6 Jul 2022 04:24:54 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35274 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231962AbiGFIYd (ORCPT ); Wed, 6 Jul 2022 04:24:33 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F0805248C3; Wed, 6 Jul 2022 01:24:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095871; x=1688631871; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=MKQ/rcKgV3drdHlWuFoH8sIMBaUN6eYji0w3R3tli4U=; b=HkSexBI8iotE10lNxXJYeNtQWP+CackIKTyXVfkd9yZPbztQMxwjnJ5D euu6Apo6Fh6FA4R07YCc6Zb4UGITL7DXzsU4SJFogP3kbLnSSWZz52uSa 12IwMn3X1zGzUm1Gt/1/OMUOV9qTImUCtvJ1NErEKsTIOh5e3yOTENjmt iq+lXgvLr39r0MzGseH32qiw7V0mHV/0GcPwhZH75SqgK04mNeildGCsC cBoeVaCHjU9n6tZSiX4M7pmKUImAPFJgbj1xwL4N0guFtfHEryUU/du36 WrPXDy+KyF9YmrTpX2LTWwDGhICHIoD0hjZ2KfIxFTHZGaXY1bka11ntv A==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="264100019" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="264100019" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:24:30 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567967895" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:24:19 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 04/14] mm/shmem: Support memfile_notifier Date: Wed, 6 Jul 2022 16:20:06 +0800 Message-Id: <20220706082016.2603916-5-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: "Kirill A. Shutemov" Implement shmem as a memfile_notifier backing store. Essentially it interacts with the memfile_notifier feature flags for userspace access/page migration/page reclaiming and implements the necessary memfile_backing_store callbacks. Signed-off-by: Kirill A. Shutemov Signed-off-by: Chao Peng --- include/linux/shmem_fs.h | 2 + mm/shmem.c | 109 ++++++++++++++++++++++++++++++++++++++- 2 files changed, 110 insertions(+), 1 deletion(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index a68f982f22d1..6031c0b08d26 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -9,6 +9,7 @@ #include #include #include +#include /* inode in-kernel data */ @@ -25,6 +26,7 @@ struct shmem_inode_info { struct simple_xattrs xattrs; /* list of xattrs */ atomic_t stop_eviction; /* hold when working on inode */ struct timespec64 i_crtime; /* file creation time */ + struct memfile_node memfile_node; /* memfile node */ struct inode vfs_inode; }; diff --git a/mm/shmem.c b/mm/shmem.c index 6c8aef15a17d..627e315c3b4d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -905,6 +905,17 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index) return page ? page_folio(page) : NULL; } +static void notify_invalidate(struct inode *inode, struct folio *folio, + pgoff_t start, pgoff_t end) +{ + struct shmem_inode_info *info = SHMEM_I(inode); + + start = max(start, folio->index); + end = min(end, folio->index + folio_nr_pages(folio)); + + memfile_notifier_invalidate(&info->memfile_node, start, end); +} + /* * Remove range of pages and swap entries from page cache, and free them. * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate. @@ -948,6 +959,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } index += folio_nr_pages(folio) - 1; + notify_invalidate(inode, folio, start, end); + if (!unfalloc || !folio_test_uptodate(folio)) truncate_inode_folio(mapping, folio); folio_unlock(folio); @@ -1021,6 +1034,9 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, index--; break; } + + notify_invalidate(inode, folio, start, end); + VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio); truncate_inode_folio(mapping, folio); @@ -1092,6 +1108,13 @@ static int shmem_setattr(struct user_namespace *mnt_userns, (newsize > oldsize && (info->seals & F_SEAL_GROW))) return -EPERM; + if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) { + if (oldsize) + return -EPERM; + if (!PAGE_ALIGNED(newsize)) + return -EINVAL; + } + if (newsize != oldsize) { error = shmem_reacct_size(SHMEM_I(inode)->flags, oldsize, newsize); @@ -1336,6 +1359,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) goto redirty; if (!total_swap_pages) goto redirty; + if (info->memfile_node.flags & MEMFILE_F_UNRECLAIMABLE) + goto redirty; /* * Our capabilities prevent regular writeback or sync from ever calling @@ -2271,6 +2296,9 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret; + if (info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) + return -EPERM; + /* arm64 - allow memory tagging on RAM-based files */ vma->vm_flags |= VM_MTE_ALLOWED; @@ -2306,6 +2334,7 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode info->i_crtime = inode->i_mtime; INIT_LIST_HEAD(&info->shrinklist); INIT_LIST_HEAD(&info->swaplist); + memfile_node_init(&info->memfile_node); simple_xattrs_init(&info->xattrs); cache_no_acl(inode); mapping_set_large_folios(inode->i_mapping); @@ -2477,6 +2506,8 @@ shmem_write_begin(struct file *file, struct address_space *mapping, if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; } + if (unlikely(info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE)) + return -EPERM; if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) sgp = SGP_NOALLOC; @@ -2556,6 +2587,13 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) end_index = i_size >> PAGE_SHIFT; if (index > end_index) break; + + if (SHMEM_I(inode)->memfile_node.flags & + MEMFILE_F_USER_INACCESSIBLE) { + error = -EPERM; + break; + } + if (index == end_index) { nr = i_size & ~PAGE_MASK; if (nr <= offset) @@ -2697,6 +2735,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, goto out; } + if ((info->memfile_node.flags & MEMFILE_F_USER_INACCESSIBLE) && + (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))) { + error = -EINVAL; + goto out; + } + shmem_falloc.waitq = &shmem_falloc_waitq; shmem_falloc.start = (u64)unmap_start >> PAGE_SHIFT; shmem_falloc.next = (unmap_end + 1) >> PAGE_SHIFT; @@ -3806,6 +3850,20 @@ static int shmem_error_remove_page(struct address_space *mapping, return 0; } +#ifdef CONFIG_MIGRATION +static int shmem_migrate_page(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + struct inode *inode = mapping->host; + struct shmem_inode_info *info = SHMEM_I(inode); + + if (info->memfile_node.flags & MEMFILE_F_UNMOVABLE) + return -EOPNOTSUPP; + return migrate_page(mapping, newpage, page, mode); +} +#endif + const struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .dirty_folio = noop_dirty_folio, @@ -3814,7 +3872,7 @@ const struct address_space_operations shmem_aops = { .write_end = shmem_write_end, #endif #ifdef CONFIG_MIGRATION - .migratepage = migrate_page, + .migratepage = shmem_migrate_page, #endif .error_remove_page = shmem_error_remove_page, }; @@ -3931,6 +3989,51 @@ static struct file_system_type shmem_fs_type = { .fs_flags = FS_USERNS_MOUNT, }; +#ifdef CONFIG_MEMFILE_NOTIFIER +static struct memfile_node *shmem_lookup_memfile_node(struct file *file) +{ + struct inode *inode = file_inode(file); + + if (!shmem_mapping(inode->i_mapping)) + return NULL; + + return &SHMEM_I(inode)->memfile_node; +} + + +static int shmem_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn, + int *order) +{ + struct page *page; + int ret; + + ret = shmem_getpage(file_inode(file), offset, &page, SGP_WRITE); + if (ret) + return ret; + + unlock_page(page); + *pfn = page_to_pfn_t(page); + *order = thp_order(compound_head(page)); + return 0; +} + +static void shmem_put_pfn(pfn_t pfn) +{ + struct page *page = pfn_t_to_page(pfn); + + if (!page) + return; + + put_page(page); +} + +static struct memfile_backing_store shmem_backing_store = { + .lookup_memfile_node = shmem_lookup_memfile_node, + .get_pfn = shmem_get_pfn, + .put_pfn = shmem_put_pfn, +}; +#endif /* CONFIG_MEMFILE_NOTIFIER */ + void __init shmem_init(void) { int error; @@ -3956,6 +4059,10 @@ void __init shmem_init(void) else shmem_huge = SHMEM_HUGE_NEVER; /* just in case it was patched */ #endif + +#ifdef CONFIG_MEMFILE_NOTIFIER + memfile_register_backing_store(&shmem_backing_store); +#endif return; out1: From patchwork Wed Jul 6 08:20:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907532 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94AD3CCA483 for ; Wed, 6 Jul 2022 08:25:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232179AbiGFIZO (ORCPT ); Wed, 6 Jul 2022 04:25:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231970AbiGFIYm (ORCPT ); Wed, 6 Jul 2022 04:24:42 -0400 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BC9932409C; Wed, 6 Jul 2022 01:24:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095881; x=1688631881; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=Po9DZUbErgtxR1/AAtJ9I8hzg80rqfcA4AORQCSRJWU=; b=asFMRB4bOYHEGR1jijlfUec4Z9tfjZDhTaPrLM8WrPL+ycXD5SHKn0VL IvqTB7/U10t7CKapRrUsM0gbLi/ZJ/ESzHdwqoV3kfMVGIUvEJ0ira2lz 73SIx093zU56hh5ICv1mJr+EWEEH6gltywo9b/96zn/wC5yVH8Dfqfq/g LCo33IlHJyF/XvpIxzoy+oMcTF2sPuo6hgcpR1bfmser/4O0GVax2XOyu t8ncZCB4MhKuZNl/A0Ef3FNsocJVpWENxxUQTrgvytCQTvLfI/uQBRoAN BfpBe7dHvPnwrAYFyQDMaRw4zdl5D6BtIhQgx6GCnu1g/uBBh+APUOfuL A==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="281231847" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="281231847" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:24:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567967920" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:24:30 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 05/14] mm/memfd: Introduce MFD_INACCESSIBLE flag Date: Wed, 6 Jul 2022 16:20:07 +0800 Message-Id: <20220706082016.2603916-6-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Introduce a new memfd_create() flag indicating the content of the created memfd is inaccessible from userspace through ordinary MMU access (e.g., read/write/mmap). However, the file content can be accessed via a different mechanism (e.g. KVM MMU) indirectly. It provides semantics required for KVM guest private memory support that a file descriptor with this flag set is going to be used as the source of guest memory in confidential computing environments such as Intel TDX/AMD SEV but may not be accessible from host userspace. The flag can not coexist with MFD_ALLOW_SEALING, future sealing is also impossible for a memfd created with this flag. Signed-off-by: Chao Peng --- include/uapi/linux/memfd.h | 1 + mm/memfd.c | 15 ++++++++++++++- 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h index 7a8a26751c23..48750474b904 100644 --- a/include/uapi/linux/memfd.h +++ b/include/uapi/linux/memfd.h @@ -8,6 +8,7 @@ #define MFD_CLOEXEC 0x0001U #define MFD_ALLOW_SEALING 0x0002U #define MFD_HUGETLB 0x0004U +#define MFD_INACCESSIBLE 0x0008U /* * Huge page size encoding when MFD_HUGETLB is specified, and a huge page diff --git a/mm/memfd.c b/mm/memfd.c index 2afd898798e4..72d7139ccced 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -18,6 +18,7 @@ #include #include #include +#include #include /* @@ -262,7 +263,8 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \ + MFD_INACCESSIBLE) SYSCALL_DEFINE2(memfd_create, const char __user *, uname, @@ -284,6 +286,10 @@ SYSCALL_DEFINE2(memfd_create, return -EINVAL; } + /* Disallow sealing when MFD_INACCESSIBLE is set. */ + if (flags & MFD_INACCESSIBLE && flags & MFD_ALLOW_SEALING) + return -EINVAL; + /* length includes terminating zero */ len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); if (len <= 0) @@ -330,12 +336,19 @@ SYSCALL_DEFINE2(memfd_create, if (flags & MFD_ALLOW_SEALING) { file_seals = memfd_file_seals_ptr(file); *file_seals &= ~F_SEAL_SEAL; + } else if (flags & MFD_INACCESSIBLE) { + error = memfile_node_set_flags(file, + MEMFILE_F_USER_INACCESSIBLE); + if (error) + goto err_file; } fd_install(fd, file); kfree(name); return fd; +err_file: + fput(file); err_fd: put_unused_fd(fd); err_name: From patchwork Wed Jul 6 08:20:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907533 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2503ECCA473 for ; Wed, 6 Jul 2022 08:25:19 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232078AbiGFIZR (ORCPT ); Wed, 6 Jul 2022 04:25:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35326 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232043AbiGFIYz (ORCPT ); Wed, 6 Jul 2022 04:24:55 -0400 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 614581409F; Wed, 6 Jul 2022 01:24:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095893; x=1688631893; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=HzvDsZFYJFjfwBhl27KbOiRZ4k+htychclP9Iyjwp74=; b=ZZ36cov+7fGTA2ia0KwyVTTr91YEKZs0khS8rBZ/6WekADN6vHL/bw/y 9XPgw8V+135P63XiiYaHYrApq6Bbps3LEYF/ROr0HMK21aa6ZD0nf30Lf vryC5lmoO4FJ3MI3Trez7jPHAeq/y2KZPt7PQJfAoYEdfFfC8u5VIIPLv k25YASpUaqzZhCsjh3FpHYR5r4w8feYXYUPNehgBBw1AynstUChhzPdvf AhqBvz3zcifYclbkTlrD4Fj/HYgaCI4r788bjMyIPltGv8dSAWIEVTPdD R2U4efFrQyVzzpwDzCitlWAJJOphBG07SSq7Ab0D7okjhZBCduCbVevgQ A==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="347665821" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="347665821" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:24:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567967955" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:24:41 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 06/14] KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS Date: Wed, 6 Jul 2022 16:20:08 +0800 Message-Id: <20220706082016.2603916-7-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org KVM_INTERNAL_MEM_SLOTS better reflects the fact those slots are not exposed to userspace and avoids confusion to real private slots that is going to be added. Signed-off-by: Chao Peng --- arch/mips/include/asm/kvm_host.h | 2 +- arch/x86/include/asm/kvm_host.h | 2 +- include/linux/kvm_host.h | 6 +++--- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h index 717716cc51c5..45a978c805bc 100644 --- a/arch/mips/include/asm/kvm_host.h +++ b/arch/mips/include/asm/kvm_host.h @@ -85,7 +85,7 @@ #define KVM_MAX_VCPUS 16 /* memory slots that does not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 0 +#define KVM_INTERNAL_MEM_SLOTS 0 #define KVM_HALT_POLL_NS_DEFAULT 500000 diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index de5a149d0971..dae190e19fce 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -53,7 +53,7 @@ #define KVM_MAX_VCPU_IDS (KVM_MAX_VCPUS * KVM_VCPU_ID_RATIO) /* memory slots that are not exposed to userspace */ -#define KVM_PRIVATE_MEM_SLOTS 3 +#define KVM_INTERNAL_MEM_SLOTS 3 #define KVM_HALT_POLL_NS_DEFAULT 200000 diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 3b40f8d68fbb..0bdb6044e316 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -656,12 +656,12 @@ struct kvm_irq_routing_table { }; #endif -#ifndef KVM_PRIVATE_MEM_SLOTS -#define KVM_PRIVATE_MEM_SLOTS 0 +#ifndef KVM_INTERNAL_MEM_SLOTS +#define KVM_INTERNAL_MEM_SLOTS 0 #endif #define KVM_MEM_SLOTS_NUM SHRT_MAX -#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_PRIVATE_MEM_SLOTS) +#define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS) #ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) From patchwork Wed Jul 6 08:20:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907534 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C96A1CCA482 for ; Wed, 6 Jul 2022 08:25:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232235AbiGFIZV (ORCPT ); Wed, 6 Jul 2022 04:25:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35956 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232095AbiGFIZG (ORCPT ); Wed, 6 Jul 2022 04:25:06 -0400 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0E0D23173; Wed, 6 Jul 2022 01:25:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095903; x=1688631903; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=OGH2abmwTaIHjo/AWx1ogoPQgnuP1dOfum5JiV1obCI=; b=ehGcjz/GgtvyFtGU+FlJzR/ZLNtHSs1Am4ZYMryWoZqvHY/WeZXpdqlF MTD7UVqb6j4Mfr54mjdYnfVZdFj+d5S44oRKLRAM++W/oJConkRfxka2i 7rW4MjfchlF+QvFq1BxBZ00VvKj5NykZ5jPBCSnZ/XWJtYJVEdXePay58 xOX1+He4A0uf7kObhxRDM0kBh4KWcR1gAmFp+LvfLSXvQClJexaU+v/9G kYEIDrjJXw6i+xQirNKGWeybtWSY6/0FXciFALvrEu3WbpOH0eeYg0VEU aJI5cEwIxQLn/RLe47nayvsGHrqJ/iXjA3wAWCrmAlWCG5212jGCPa2iN A==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="284416444" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="284416444" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:25:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567968030" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:24:52 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 07/14] KVM: Use gfn instead of hva for mmu_notifier_retry Date: Wed, 6 Jul 2022 16:20:09 +0800 Message-Id: <20220706082016.2603916-8-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Currently in mmu_notifier validate path, hva range is recorded and then checked in the mmu_notifier_retry_hva() from page fault path. However for the to be introduced private memory, a page fault may not have a hva associated, checking gfn(gpa) makes more sense. For existing non private memory case, gfn is expected to continue to work. The patch also fixes a potential bug in kvm_zap_gfn_range() which has already been using gfn when calling kvm_inc/dec_notifier_count() in current code. Signed-off-by: Chao Peng --- arch/x86/kvm/mmu/mmu.c | 2 +- include/linux/kvm_host.h | 18 ++++++++---------- virt/kvm/kvm_main.c | 6 +++--- 3 files changed, 12 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index f7fa4c31b7c5..0d882fad4bc1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, return true; return fault->slot && - mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, fault->hva); + mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn); } static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 0bdb6044e316..e9153b54e2a4 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -767,8 +767,8 @@ struct kvm { struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; long mmu_notifier_count; - unsigned long mmu_notifier_range_start; - unsigned long mmu_notifier_range_end; + gfn_t mmu_notifier_range_start; + gfn_t mmu_notifier_range_end; #endif struct list_head devices; u64 manual_dirty_log_protect; @@ -1362,10 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end); -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end); +void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); @@ -1923,9 +1921,9 @@ static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq) return 0; } -static inline int mmu_notifier_retry_hva(struct kvm *kvm, +static inline int mmu_notifier_retry_gfn(struct kvm *kvm, unsigned long mmu_seq, - unsigned long hva) + gfn_t gfn) { lockdep_assert_held(&kvm->mmu_lock); /* @@ -1935,8 +1933,8 @@ static inline int mmu_notifier_retry_hva(struct kvm *kvm, * positives, due to shortcuts when handing concurrent invalidations. */ if (unlikely(kvm->mmu_notifier_count) && - hva >= kvm->mmu_notifier_range_start && - hva < kvm->mmu_notifier_range_end) + gfn >= kvm->mmu_notifier_range_start && + gfn < kvm->mmu_notifier_range_end) return 1; if (kvm->mmu_notifier_seq != mmu_seq) return 1; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index da263c370d00..4d7f0e72366f 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -536,8 +536,7 @@ static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn, typedef bool (*hva_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); -typedef void (*on_lock_fn_t)(struct kvm *kvm, unsigned long start, - unsigned long end); +typedef void (*on_lock_fn_t)(struct kvm *kvm, gfn_t start, gfn_t end); typedef void (*on_unlock_fn_t)(struct kvm *kvm); @@ -624,7 +623,8 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, locked = true; KVM_MMU_LOCK(kvm); if (!IS_KVM_NULL_FN(range->on_lock)) - range->on_lock(kvm, range->start, range->end); + range->on_lock(kvm, gfn_range.start, + gfn_range.end); if (IS_KVM_NULL_FN(range->handler)) break; } From patchwork Wed Jul 6 08:20:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907535 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1F59CCA482 for ; Wed, 6 Jul 2022 08:26:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232222AbiGFIZ7 (ORCPT ); Wed, 6 Jul 2022 04:25:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36214 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231969AbiGFIZl (ORCPT ); Wed, 6 Jul 2022 04:25:41 -0400 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A8F462495A; Wed, 6 Jul 2022 01:25:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095919; x=1688631919; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=sS9EoSSO28MiiP0hU+aCuuctHY7SRSWTE16kACa1rxg=; b=jTmD8Mnpq61xTJAQIfix/D9GmxN3U2BeAz42ZUpnialYbdqtBSx9xOc0 yCpGoNzItPtD0PMJNsT0OPhurp/jamHVtWTWH6FmhBlpH6MRVfsGFf1X8 jn3HemiQR8f6nzFVmKIuBtLQES0FOYBWs9wbxDCE1ytKdxDKflYwYViFZ CT/CKm47tmfdfhk4M4MZwTCz0b3a/pKYGD3A9D1wI4rRYQgPhmw2Kbcts hKcBBrIT8up4HY83tNldOyQ1jjC7ql+nR//xDcM+TSoPWLzN/buPeDma4 kLPTZl0T8jYCTXwT+GKu3kh7H103KUNRTQ9do3DQkmdTMSlaUi/xKtl9D A==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="264100156" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="264100156" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:25:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567968092" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:25:03 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 08/14] KVM: Rename mmu_notifier_* Date: Wed, 6 Jul 2022 16:20:10 +0800 Message-Id: <20220706082016.2603916-9-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org The sync mechanism between mmu_notifier and page fault handler employs fields mmu_notifier_seq/count and mmu_notifier_range_start/end. For the to be added private memory, there is the same mechanism needed but not rely on mmu_notifier (It uses new introduced memfile_notifier). This patch renames the existing fields and related helper functions to a neutral name mmu_updating_* so private memory can reuse. No functional change intended. Signed-off-by: Chao Peng --- arch/arm64/kvm/mmu.c | 8 ++--- arch/mips/kvm/mmu.c | 10 +++--- arch/powerpc/include/asm/kvm_book3s_64.h | 2 +- arch/powerpc/kvm/book3s_64_mmu_host.c | 4 +-- arch/powerpc/kvm/book3s_64_mmu_hv.c | 4 +-- arch/powerpc/kvm/book3s_64_mmu_radix.c | 6 ++-- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/powerpc/kvm/book3s_hv_rm_mmu.c | 8 ++--- arch/powerpc/kvm/e500_mmu_host.c | 4 +-- arch/riscv/kvm/mmu.c | 4 +-- arch/x86/kvm/mmu/mmu.c | 14 ++++---- arch/x86/kvm/mmu/paging_tmpl.h | 4 +-- include/linux/kvm_host.h | 38 ++++++++++----------- virt/kvm/kvm_main.c | 42 +++++++++++------------- virt/kvm/pfncache.c | 14 ++++---- 15 files changed, 81 insertions(+), 83 deletions(-) diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c index 87f1cd0df36e..7ee6fafc24ee 100644 --- a/arch/arm64/kvm/mmu.c +++ b/arch/arm64/kvm/mmu.c @@ -993,7 +993,7 @@ transparent_hugepage_adjust(struct kvm *kvm, struct kvm_memory_slot *memslot, * THP doesn't start to split while we are adjusting the * refcounts. * - * We are sure this doesn't happen, because mmu_notifier_retry + * We are sure this doesn't happen, because mmu_updating_retry * was successful and we are holding the mmu_lock, so if this * THP is trying to split, it will be blocked in the mmu * notifier before touching any of the pages, specifically @@ -1188,9 +1188,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, return ret; } - mmu_seq = vcpu->kvm->mmu_notifier_seq; + mmu_seq = vcpu->kvm->mmu_updating_seq; /* - * Ensure the read of mmu_notifier_seq happens before we call + * Ensure the read of mmu_updating_seq happens before we call * gfn_to_pfn_prot (which calls get_user_pages), so that we don't risk * the page we just got a reference to gets unmapped before we have a * chance to grab the mmu_lock, which ensure that if the page gets @@ -1246,7 +1246,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, else write_lock(&kvm->mmu_lock); pgt = vcpu->arch.hw_mmu->pgt; - if (mmu_notifier_retry(kvm, mmu_seq)) + if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock; /* diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c index 1bfd1b501d82..abd468c6a749 100644 --- a/arch/mips/kvm/mmu.c +++ b/arch/mips/kvm/mmu.c @@ -615,17 +615,17 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa, * Used to check for invalidations in progress, of the pfn that is * returned by pfn_to_pfn_prot below. */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; /* - * Ensure the read of mmu_notifier_seq isn't reordered with PTE reads in + * Ensure the read of mmu_updating_seq isn't reordered with PTE reads in * gfn_to_pfn_prot() (which calls get_user_pages()), so that we don't * risk the page we get a reference to getting unmapped before we have a - * chance to grab the mmu_lock without mmu_notifier_retry() noticing. + * chance to grab the mmu_lock without mmu_updating_retry () noticing. * * This smp_rmb() pairs with the effective smp_wmb() of the combination * of the pte_unmap_unlock() after the PTE is zapped, and the * spin_lock() in kvm_mmu_notifier_invalidate_() before - * mmu_notifier_seq is incremented. + * mmu_updating_seq is incremented. */ smp_rmb(); @@ -638,7 +638,7 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, unsigned long gpa, spin_lock(&kvm->mmu_lock); /* Check if an invalidation has taken place since we got pfn */ - if (mmu_notifier_retry(kvm, mmu_seq)) { + if (mmu_updating_retry(kvm, mmu_seq)) { /* * This can happen when mappings are changed asynchronously, but * also synchronously if a COW is triggered by diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h index 4def2bd17b9b..4d35fb913de5 100644 --- a/arch/powerpc/include/asm/kvm_book3s_64.h +++ b/arch/powerpc/include/asm/kvm_book3s_64.h @@ -666,7 +666,7 @@ static inline pte_t *find_kvm_host_pte(struct kvm *kvm, unsigned long mmu_seq, VM_WARN(!spin_is_locked(&kvm->mmu_lock), "%s called with kvm mmu_lock not held \n", __func__); - if (mmu_notifier_retry(kvm, mmu_seq)) + if (mmu_updating_retry(kvm, mmu_seq)) return NULL; pte = __find_linux_pte(kvm->mm->pgd, ea, NULL, hshift); diff --git a/arch/powerpc/kvm/book3s_64_mmu_host.c b/arch/powerpc/kvm/book3s_64_mmu_host.c index 1ae09992c9ea..78f1aae8cb60 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_host.c +++ b/arch/powerpc/kvm/book3s_64_mmu_host.c @@ -90,7 +90,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte, unsigned long pfn; /* used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); /* Get host physical address for gpa */ @@ -151,7 +151,7 @@ int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct kvmppc_pte *orig_pte, cpte = kvmppc_mmu_hpte_cache_next(vcpu); spin_lock(&kvm->mmu_lock); - if (!cpte || mmu_notifier_retry(kvm, mmu_seq)) { + if (!cpte || mmu_updating_retry(kvm, mmu_seq)) { r = -EAGAIN; goto out_unlock; } diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c index 514fd45c1994..bcdec6a6f2a7 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c @@ -578,7 +578,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu, return -EFAULT; /* used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); ret = -EFAULT; @@ -693,7 +693,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu, /* Check if we might have been invalidated; let the guest retry if so */ ret = RESUME_GUEST; - if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) { + if (mmu_updating_retry(vcpu->kvm, mmu_seq)) { unlock_rmap(rmap); goto out_unlock; } diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c index 42851c32ff3b..c8890ccc3f40 100644 --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c @@ -639,7 +639,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte, /* Check if we might have been invalidated; let the guest retry if so */ spin_lock(&kvm->mmu_lock); ret = -EAGAIN; - if (mmu_notifier_retry(kvm, mmu_seq)) + if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock; /* Now traverse again under the lock and change the tree */ @@ -829,7 +829,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu, bool large_enable; /* used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); /* @@ -1190,7 +1190,7 @@ void kvmppc_radix_flush_memslot(struct kvm *kvm, * Increase the mmu notifier sequence number to prevent any page * fault that read the memslot earlier from writing a PTE. */ - kvm->mmu_notifier_seq++; + kvm->mmu_updating_seq++; spin_unlock(&kvm->mmu_lock); } diff --git a/arch/powerpc/kvm/book3s_hv_nested.c b/arch/powerpc/kvm/book3s_hv_nested.c index 0644732d1a25..09f841f730da 100644 --- a/arch/powerpc/kvm/book3s_hv_nested.c +++ b/arch/powerpc/kvm/book3s_hv_nested.c @@ -1579,7 +1579,7 @@ static long int __kvmhv_nested_page_fault(struct kvm_vcpu *vcpu, /* 2. Find the host pte for this L1 guest real address */ /* Used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); /* See if can find translation in our partition scoped tables for L1 */ diff --git a/arch/powerpc/kvm/book3s_hv_rm_mmu.c b/arch/powerpc/kvm/book3s_hv_rm_mmu.c index 2257fb18cb72..952b504dc98a 100644 --- a/arch/powerpc/kvm/book3s_hv_rm_mmu.c +++ b/arch/powerpc/kvm/book3s_hv_rm_mmu.c @@ -219,7 +219,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, g_ptel = ptel; /* used later to detect if we might have been invalidated */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); /* Find the memslot (if any) for this address */ @@ -366,7 +366,7 @@ long kvmppc_do_h_enter(struct kvm *kvm, unsigned long flags, rmap = real_vmalloc_addr(rmap); lock_rmap(rmap); /* Check for pending invalidations under the rmap chain lock */ - if (mmu_notifier_retry(kvm, mmu_seq)) { + if (mmu_updating_retry(kvm, mmu_seq)) { /* inval in progress, write a non-present HPTE */ pteh |= HPTE_V_ABSENT; pteh &= ~HPTE_V_VALID; @@ -932,7 +932,7 @@ static long kvmppc_do_h_page_init_zero(struct kvm_vcpu *vcpu, int i; /* Used later to detect if we might have been invalidated */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock); @@ -960,7 +960,7 @@ static long kvmppc_do_h_page_init_copy(struct kvm_vcpu *vcpu, long ret = H_SUCCESS; /* Used later to detect if we might have been invalidated */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); arch_spin_lock(&kvm->mmu_lock.rlock.raw_lock); diff --git a/arch/powerpc/kvm/e500_mmu_host.c b/arch/powerpc/kvm/e500_mmu_host.c index 7f16afc331ef..d7636b926f25 100644 --- a/arch/powerpc/kvm/e500_mmu_host.c +++ b/arch/powerpc/kvm/e500_mmu_host.c @@ -339,7 +339,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, unsigned long flags; /* used to check for invalidations in progress */ - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); /* @@ -460,7 +460,7 @@ static inline int kvmppc_e500_shadow_map(struct kvmppc_vcpu_e500 *vcpu_e500, } spin_lock(&kvm->mmu_lock); - if (mmu_notifier_retry(kvm, mmu_seq)) { + if (mmu_updating_retry(kvm, mmu_seq)) { ret = -EAGAIN; goto out; } diff --git a/arch/riscv/kvm/mmu.c b/arch/riscv/kvm/mmu.c index 081f8d2b9cf3..a7db374d3861 100644 --- a/arch/riscv/kvm/mmu.c +++ b/arch/riscv/kvm/mmu.c @@ -654,7 +654,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu, return ret; } - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; hfn = gfn_to_pfn_prot(kvm, gfn, is_write, &writeable); if (hfn == KVM_PFN_ERR_HWPOISON) { @@ -674,7 +674,7 @@ int kvm_riscv_gstage_map(struct kvm_vcpu *vcpu, spin_lock(&kvm->mmu_lock); - if (mmu_notifier_retry(kvm, mmu_seq)) + if (mmu_updating_retry(kvm, mmu_seq)) goto out_unlock; if (writeable) { diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 0d882fad4bc1..545eb74305fe 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -2908,7 +2908,7 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) * If addresses are being invalidated, skip prefetching to avoid * accidentally prefetching those addresses. */ - if (unlikely(vcpu->kvm->mmu_notifier_count)) + if (unlikely(vcpu->kvm->mmu_updating_count)) return; __direct_pte_prefetch(vcpu, sp, sptep); @@ -2950,7 +2950,7 @@ static int host_pfn_mapping_level(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn, /* * Lookup the mapping level in the current mm. The information * may become stale soon, but it is safe to use as long as - * 1) mmu_notifier_retry was checked after taking mmu_lock, and + * 1) mmu_updating_retry was checked after taking mmu_lock, and * 2) mmu_lock is taken now. * * We still need to disable IRQs to prevent concurrent tear down @@ -3035,7 +3035,7 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault return; /* - * mmu_notifier_retry() was successful and mmu_lock is held, so + * mmu_updating_retry was successful and mmu_lock is held, so * the pmd can't be split from under us. */ fault->goal_level = fault->req_level; @@ -4182,7 +4182,7 @@ static bool is_page_fault_stale(struct kvm_vcpu *vcpu, return true; return fault->slot && - mmu_notifier_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn); + mmu_updating_retry_gfn(vcpu->kvm, mmu_seq, fault->gfn); } static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) @@ -4206,7 +4206,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault if (r) return r; - mmu_seq = vcpu->kvm->mmu_notifier_seq; + mmu_seq = vcpu->kvm->mmu_updating_seq; smp_rmb(); r = kvm_faultin_pfn(vcpu, fault); @@ -6023,7 +6023,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) write_lock(&kvm->mmu_lock); - kvm_inc_notifier_count(kvm, gfn_start, gfn_end); + kvm_mmu_updating_begin(kvm, gfn_start, gfn_end); flush = __kvm_zap_rmaps(kvm, gfn_start, gfn_end); @@ -6037,7 +6037,7 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end) kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end - gfn_start); - kvm_dec_notifier_count(kvm, gfn_start, gfn_end); + kvm_mmu_updating_end(kvm, gfn_start, gfn_end); write_unlock(&kvm->mmu_lock); } diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 2448fa8d8438..acf7e41aa02b 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -589,7 +589,7 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, * If addresses are being invalidated, skip prefetching to avoid * accidentally prefetching those addresses. */ - if (unlikely(vcpu->kvm->mmu_notifier_count)) + if (unlikely(vcpu->kvm->mmu_updating_count)) return; if (sp->role.direct) @@ -838,7 +838,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault else fault->max_level = walker.level; - mmu_seq = vcpu->kvm->mmu_notifier_seq; + mmu_seq = vcpu->kvm->mmu_updating_seq; smp_rmb(); r = kvm_faultin_pfn(vcpu, fault); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index e9153b54e2a4..c262ebb168a7 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -765,10 +765,10 @@ struct kvm { #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) struct mmu_notifier mmu_notifier; - unsigned long mmu_notifier_seq; - long mmu_notifier_count; - gfn_t mmu_notifier_range_start; - gfn_t mmu_notifier_range_end; + unsigned long mmu_updating_seq; + long mmu_updating_count; + gfn_t mmu_updating_range_start; + gfn_t mmu_updating_range_end; #endif struct list_head devices; u64 manual_dirty_log_protect; @@ -1362,8 +1362,8 @@ void kvm_mmu_free_memory_cache(struct kvm_mmu_memory_cache *mc); void *kvm_mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); #endif -void kvm_inc_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); -void kvm_dec_notifier_count(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end); +void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end); long kvm_arch_dev_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg); @@ -1901,42 +1901,42 @@ extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) -static inline int mmu_notifier_retry(struct kvm *kvm, unsigned long mmu_seq) +static inline int mmu_updating_retry(struct kvm *kvm, unsigned long mmu_seq) { - if (unlikely(kvm->mmu_notifier_count)) + if (unlikely(kvm->mmu_updating_count)) return 1; /* - * Ensure the read of mmu_notifier_count happens before the read - * of mmu_notifier_seq. This interacts with the smp_wmb() in + * Ensure the read of mmu_updating_count happens before the read + * of mmu_updating_seq. This interacts with the smp_wmb() in * mmu_notifier_invalidate_range_end to make sure that the caller - * either sees the old (non-zero) value of mmu_notifier_count or - * the new (incremented) value of mmu_notifier_seq. + * either sees the old (non-zero) value of mmu_updating_count or + * the new (incremented) value of mmu_updating_seq. * PowerPC Book3s HV KVM calls this under a per-page lock * rather than under kvm->mmu_lock, for scalability, so * can't rely on kvm->mmu_lock to keep things ordered. */ smp_rmb(); - if (kvm->mmu_notifier_seq != mmu_seq) + if (kvm->mmu_updating_seq != mmu_seq) return 1; return 0; } -static inline int mmu_notifier_retry_gfn(struct kvm *kvm, +static inline int mmu_updating_retry_gfn(struct kvm *kvm, unsigned long mmu_seq, gfn_t gfn) { lockdep_assert_held(&kvm->mmu_lock); /* - * If mmu_notifier_count is non-zero, then the range maintained by + * If mmu_updating_count is non-zero, then the range maintained by * kvm_mmu_notifier_invalidate_range_start contains all addresses that * might be being invalidated. Note that it may include some false * positives, due to shortcuts when handing concurrent invalidations. */ - if (unlikely(kvm->mmu_notifier_count) && - gfn >= kvm->mmu_notifier_range_start && - gfn < kvm->mmu_notifier_range_end) + if (unlikely(kvm->mmu_updating_count) && + gfn >= kvm->mmu_updating_range_start && + gfn < kvm->mmu_updating_range_end) return 1; - if (kvm->mmu_notifier_seq != mmu_seq) + if (kvm->mmu_updating_seq != mmu_seq) return 1; return 0; } diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 4d7f0e72366f..3ae4944b9f15 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -698,30 +698,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn, /* * .change_pte() must be surrounded by .invalidate_range_{start,end}(). - * If mmu_notifier_count is zero, then no in-progress invalidations, + * If mmu_updating_count is zero, then no in-progress invalidations, * including this one, found a relevant memslot at start(); rechecking * memslots here is unnecessary. Note, a false positive (count elevated * by a different invalidation) is sub-optimal but functionally ok. */ WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); - if (!READ_ONCE(kvm->mmu_notifier_count)) + if (!READ_ONCE(kvm->mmu_updating_count)) return; kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn); } -void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end) +void kvm_mmu_updating_begin(struct kvm *kvm, gfn_t start, gfn_t end) { /* * The count increase must become visible at unlock time as no * spte can be established without taking the mmu_lock and * count is also read inside the mmu_lock critical section. */ - kvm->mmu_notifier_count++; - if (likely(kvm->mmu_notifier_count == 1)) { - kvm->mmu_notifier_range_start = start; - kvm->mmu_notifier_range_end = end; + kvm->mmu_updating_count++; + if (likely(kvm->mmu_updating_count == 1)) { + kvm->mmu_updating_range_start = start; + kvm->mmu_updating_range_end = end; } else { /* * Fully tracking multiple concurrent ranges has diminishing @@ -732,10 +731,10 @@ void kvm_inc_notifier_count(struct kvm *kvm, unsigned long start, * accumulate and persist until all outstanding invalidates * complete. */ - kvm->mmu_notifier_range_start = - min(kvm->mmu_notifier_range_start, start); - kvm->mmu_notifier_range_end = - max(kvm->mmu_notifier_range_end, end); + kvm->mmu_updating_range_start = + min(kvm->mmu_updating_range_start, start); + kvm->mmu_updating_range_end = + max(kvm->mmu_updating_range_end, end); } } @@ -748,7 +747,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, .end = range->end, .pte = __pte(0), .handler = kvm_unmap_gfn_range, - .on_lock = kvm_inc_notifier_count, + .on_lock = kvm_mmu_updating_begin, .on_unlock = kvm_arch_guest_memory_reclaimed, .flush_on_ret = true, .may_block = mmu_notifier_range_blockable(range), @@ -759,7 +758,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, /* * Prevent memslot modification between range_start() and range_end() * so that conditionally locking provides the same result in both - * functions. Without that guarantee, the mmu_notifier_count + * functions. Without that guarantee, the mmu_updating_count * adjustments will be imbalanced. * * Pairs with the decrement in range_end(). @@ -775,7 +774,7 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, * any given time, and the caches themselves can check for hva overlap, * i.e. don't need to rely on memslot overlap checks for performance. * Because this runs without holding mmu_lock, the pfn caches must use - * mn_active_invalidate_count (see above) instead of mmu_notifier_count. + * mn_active_invalidate_count (see above) instead of mmu_updating_count. */ gfn_to_pfn_cache_invalidate_start(kvm, range->start, range->end, hva_range.may_block); @@ -785,22 +784,21 @@ static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, return 0; } -void kvm_dec_notifier_count(struct kvm *kvm, unsigned long start, - unsigned long end) +void kvm_mmu_updating_end(struct kvm *kvm, gfn_t start, gfn_t end) { /* * This sequence increase will notify the kvm page fault that * the page that is going to be mapped in the spte could have * been freed. */ - kvm->mmu_notifier_seq++; + kvm->mmu_updating_seq++; smp_wmb(); /* * The above sequence increase must be visible before the * below count decrease, which is ensured by the smp_wmb above - * in conjunction with the smp_rmb in mmu_notifier_retry(). + * in conjunction with the smp_rmb in mmu_updating_retry(). */ - kvm->mmu_notifier_count--; + kvm->mmu_updating_count--; } static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, @@ -812,7 +810,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, .end = range->end, .pte = __pte(0), .handler = (void *)kvm_null_fn, - .on_lock = kvm_dec_notifier_count, + .on_lock = kvm_mmu_updating_end, .on_unlock = (void *)kvm_null_fn, .flush_on_ret = false, .may_block = mmu_notifier_range_blockable(range), @@ -833,7 +831,7 @@ static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn, if (wake) rcuwait_wake_up(&kvm->mn_memslots_update_rcuwait); - BUG_ON(kvm->mmu_notifier_count < 0); + BUG_ON(kvm->mmu_updating_count < 0); } static int kvm_mmu_notifier_clear_flush_young(struct mmu_notifier *mn, diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c index ab519f72f2cd..aa6d24966a76 100644 --- a/virt/kvm/pfncache.c +++ b/virt/kvm/pfncache.c @@ -112,27 +112,27 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s { /* * mn_active_invalidate_count acts for all intents and purposes - * like mmu_notifier_count here; but the latter cannot be used + * like mmu_updating_count here; but the latter cannot be used * here because the invalidation of caches in the mmu_notifier - * event occurs _before_ mmu_notifier_count is elevated. + * event occurs _before_ mmu_updating_count is elevated. * * Note, it does not matter that mn_active_invalidate_count * is not protected by gpc->lock. It is guaranteed to * be elevated before the mmu_notifier acquires gpc->lock, and - * isn't dropped until after mmu_notifier_seq is updated. + * isn't dropped until after mmu_updating_seq is updated. */ if (kvm->mn_active_invalidate_count) return true; /* * Ensure mn_active_invalidate_count is read before - * mmu_notifier_seq. This pairs with the smp_wmb() in + * mmu_updating_seq. This pairs with the smp_wmb() in * mmu_notifier_invalidate_range_end() to guarantee either the * old (non-zero) value of mn_active_invalidate_count or the - * new (incremented) value of mmu_notifier_seq is observed. + * new (incremented) value of mmu_updating_seq is observed. */ smp_rmb(); - return kvm->mmu_notifier_seq != mmu_seq; + return kvm->mmu_updating_seq != mmu_seq; } static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) @@ -155,7 +155,7 @@ static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, struct gfn_to_pfn_cache *gpc) gpc->valid = false; do { - mmu_seq = kvm->mmu_notifier_seq; + mmu_seq = kvm->mmu_updating_seq; smp_rmb(); write_unlock_irq(&gpc->lock); From patchwork Wed Jul 6 08:20:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907536 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C7C4C433EF for ; Wed, 6 Jul 2022 08:26:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232286AbiGFI0e (ORCPT ); Wed, 6 Jul 2022 04:26:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36106 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232107AbiGFIZx (ORCPT ); Wed, 6 Jul 2022 04:25:53 -0400 Received: from mga04.intel.com (mga04.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26B2324BCE; Wed, 6 Jul 2022 01:25:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095930; x=1688631930; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=weaxkp9/xqFnQWhYd4tsfdzie5zGF32jOb8m6XNOVC4=; b=NcebVdNpV0G0J5Saspwt23A+aErfXnNq9LK6ibS0G5hFrhjI9utRymcj +pH4MaPi3HMkFAV9ZZ6hJAJgJYUhh3+9iq5KRQdSVHw2vngGzpjnW0hBr nUEewWsRfL/bi8pzL+ieI2Hw1LsdQlFvhubx2/ReHgkPaMYd/UTsmu3k0 hKZe+p0eULR1Lb2J9Y9ZNftGqfF+CmW1qPxs0aixY/8F6vo/lirg71EPP uQB3A2YcdrZkxwrrurkOWKm56Q3C701t4Amp7Ug4R3k69DeAFqT6bPd+z e2oy9hTH/HLuLEQww+66oHJuVziMHYSquWWsac6l391tVtRIJ0rqq+rq8 g==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="282433454" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="282433454" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:25:29 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567968152" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:25:15 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 09/14] KVM: Extend the memslot to support fd-based private memory Date: Wed, 6 Jul 2022 16:20:11 +0800 Message-Id: <20220706082016.2603916-10-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Extend the memslot definition to provide guest private memory through a file descriptor(fd) instead of userspace_addr(hva). Such guest private memory(fd) may never be mapped into userspace so no userspace_addr(hva) can be used. Instead add another two new fields (private_fd/private_offset), plus the existing memory_size to represent the private memory range. Such memslot can still have the existing userspace_addr(hva). When use, a single memslot can maintain both private memory through private fd(private_fd/private_offset) and shared memory through hva(userspace_addr). Whether the private or shared part is effective for a guest GPA is maintained by other KVM code. Since there is no userspace mapping for private fd so we cannot rely on get_user_pages() to get the pfn in KVM, instead we add a new memfile_notifier in the memslot and rely on it to get pfn by interacting the callbacks from memory backing store with the fd/offset. This new extension is indicated by a new flag KVM_MEM_PRIVATE. At compile time, a new config HAVE_KVM_PRIVATE_MEM is added and right now it is selected on X86_64 for Intel TDX usage. To make KVM easy, internally we use a binary compatible alias struct kvm_user_mem_region to handle both the normal and the '_ext' variants. Co-developed-by: Yu Zhang Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 38 ++++++++++++++++---- arch/x86/kvm/Kconfig | 2 ++ arch/x86/kvm/x86.c | 2 +- include/linux/kvm_host.h | 13 +++++-- include/uapi/linux/kvm.h | 28 +++++++++++++++ virt/kvm/Kconfig | 3 ++ virt/kvm/kvm_main.c | 64 +++++++++++++++++++++++++++++----- 7 files changed, 132 insertions(+), 18 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index bafaeedd455c..4f27c973a952 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -1319,7 +1319,7 @@ yet and must be cleared on entry. :Capability: KVM_CAP_USER_MEMORY :Architectures: all :Type: vm ioctl -:Parameters: struct kvm_userspace_memory_region (in) +:Parameters: struct kvm_userspace_memory_region(_ext) (in) :Returns: 0 on success, -1 on error :: @@ -1332,9 +1332,18 @@ yet and must be cleared on entry. __u64 userspace_addr; /* start of the userspace allocated memory */ }; + struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; + /* for kvm_memory_region::flags */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) + #define KVM_MEM_PRIVATE (1UL << 2) This ioctl allows the user to create, modify or delete a guest physical memory slot. Bits 0-15 of "slot" specify the slot id and this value @@ -1365,12 +1374,27 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr be identical. This allows large pages in the guest to be backed by large pages in the host. -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and -KVM_MEM_READONLY. The former can be set to instruct KVM to keep track of -writes to memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to -use it. The latter can be set, if KVM_CAP_READONLY_MEM capability allows it, -to make a new slot read-only. In this case, writes to this memory will be -posted to userspace as KVM_EXIT_MMIO exits. +kvm_userspace_memory_region_ext includes all the kvm_userspace_memory_region +fields. It also includes additional fields for some specific features. See +below description of flags field for more information. It's recommended to use +kvm_userspace_memory_region_ext in new userspace code. + +The flags field supports below flags: + +- KVM_MEM_LOG_DIRTY_PAGES can be set to instruct KVM to keep track of writes to + memory within the slot. See KVM_GET_DIRTY_LOG ioctl to know how to use it. + +- KVM_MEM_READONLY can be set, if KVM_CAP_READONLY_MEM capability allows it, to + make a new slot read-only. In this case, writes to this memory will be posted + to userspace as KVM_EXIT_MMIO exits. + +- KVM_MEM_PRIVATE can be set to indicate a new slot has private memory backed by + a file descirptor(fd) and the content of the private memory is invisible to + userspace. In this case, userspace should use private_fd/private_offset in + kvm_userspace_memory_region_ext to instruct KVM to provide private memory to + guest. Userspace should guarantee not to map the same pfn indicated by + private_fd/private_offset to different gfns with multiple memslots. Failed to + do this may result undefined behavior. When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of the memory region are automatically reflected into the guest. For example, an diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index e3cbd7706136..1f160801e2a7 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -48,6 +48,8 @@ config KVM select SRCU select INTERVAL_TREE select HAVE_KVM_PM_NOTIFIER if PM + select HAVE_KVM_PRIVATE_MEM if X86_64 + select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 567d13405445..77d16b90045c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12154,7 +12154,7 @@ void __user * __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, } for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { - struct kvm_userspace_memory_region m; + struct kvm_user_mem_region m; m.slot = id | (i << 16); m.flags = 0; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index c262ebb168a7..1b203c8aa696 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -44,6 +44,7 @@ #include #include +#include #ifndef KVM_MAX_VCPU_IDS #define KVM_MAX_VCPU_IDS KVM_MAX_VCPUS @@ -576,8 +577,16 @@ struct kvm_memory_slot { u32 flags; short id; u16 as_id; + struct file *private_file; + loff_t private_offset; + struct memfile_notifier notifier; }; +static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) +{ + return slot && (slot->flags & KVM_MEM_PRIVATE); +} + static inline bool kvm_slot_dirty_track_enabled(const struct kvm_memory_slot *slot) { return slot->flags & KVM_MEM_LOG_DIRTY_PAGES; @@ -1109,9 +1118,9 @@ enum kvm_mr_change { }; int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_user_mem_region *mem); int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem); + const struct kvm_user_mem_region *mem); void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); void kvm_arch_memslots_updated(struct kvm *kvm, u64 gen); int kvm_arch_prepare_memory_region(struct kvm *kvm, diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index a36e78710382..c467c69b7ad7 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -103,6 +103,33 @@ struct kvm_userspace_memory_region { __u64 userspace_addr; /* start of the userspace allocated memory */ }; +struct kvm_userspace_memory_region_ext { + struct kvm_userspace_memory_region region; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; + +#ifdef __KERNEL__ +/* + * kvm_user_mem_region is a kernel-only alias of kvm_userspace_memory_region_ext + * that "unpacks" kvm_userspace_memory_region so that KVM can directly access + * all fields from the top-level "extended" region. + */ +struct kvm_user_mem_region { + __u32 slot; + __u32 flags; + __u64 guest_phys_addr; + __u64 memory_size; + __u64 userspace_addr; + __u64 private_offset; + __u32 private_fd; + __u32 pad1; + __u64 pad2[14]; +}; +#endif + /* * The bit 0 ~ bit 15 of kvm_memory_region::flags are visible for userspace, * other bits are reserved for kvm internal use which are defined in @@ -110,6 +137,7 @@ struct kvm_userspace_memory_region { */ #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0) #define KVM_MEM_READONLY (1UL << 1) +#define KVM_MEM_PRIVATE (1UL << 2) /* for KVM_IRQ_LINE */ struct kvm_irq_level { diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index a8c5c9f06b3c..ccaff13cc5b8 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -72,3 +72,6 @@ config KVM_XFER_TO_GUEST_WORK config HAVE_KVM_PM_NOTIFIER bool + +config HAVE_KVM_PRIVATE_MEM + bool diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 3ae4944b9f15..230c8ff9659c 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1508,7 +1508,7 @@ static void kvm_replace_memslot(struct kvm *kvm, } } -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem) +static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; @@ -1902,7 +1902,7 @@ static bool kvm_check_memslot_overlap(struct kvm_memslots *slots, int id, * Must be called holding kvm->slots_lock for write. */ int __kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_user_mem_region *mem) { struct kvm_memory_slot *old, *new; struct kvm_memslots *slots; @@ -2006,7 +2006,7 @@ int __kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(__kvm_set_memory_region); int kvm_set_memory_region(struct kvm *kvm, - const struct kvm_userspace_memory_region *mem) + const struct kvm_user_mem_region *mem) { int r; @@ -2018,7 +2018,7 @@ int kvm_set_memory_region(struct kvm *kvm, EXPORT_SYMBOL_GPL(kvm_set_memory_region); static int kvm_vm_ioctl_set_memory_region(struct kvm *kvm, - struct kvm_userspace_memory_region *mem) + struct kvm_user_mem_region *mem) { if ((u16)mem->slot >= KVM_USER_MEM_SLOTS) return -EINVAL; @@ -4608,6 +4608,33 @@ static int kvm_vm_ioctl_get_stats_fd(struct kvm *kvm) return fd; } +#define SANITY_CHECK_MEM_REGION_FIELD(field) \ +do { \ + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ + offsetof(struct kvm_userspace_memory_region, field)); \ + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ + sizeof_field(struct kvm_userspace_memory_region, field)); \ +} while (0) + +#define SANITY_CHECK_MEM_REGION_EXT_FIELD(field) \ +do { \ + BUILD_BUG_ON(offsetof(struct kvm_user_mem_region, field) != \ + offsetof(struct kvm_userspace_memory_region_ext, field)); \ + BUILD_BUG_ON(sizeof_field(struct kvm_user_mem_region, field) != \ + sizeof_field(struct kvm_userspace_memory_region_ext, field)); \ +} while (0) + +static void kvm_sanity_check_user_mem_region_alias(void) +{ + SANITY_CHECK_MEM_REGION_FIELD(slot); + SANITY_CHECK_MEM_REGION_FIELD(flags); + SANITY_CHECK_MEM_REGION_FIELD(guest_phys_addr); + SANITY_CHECK_MEM_REGION_FIELD(memory_size); + SANITY_CHECK_MEM_REGION_FIELD(userspace_addr); + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_offset); + SANITY_CHECK_MEM_REGION_EXT_FIELD(private_fd); +} + static long kvm_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) { @@ -4631,14 +4658,35 @@ static long kvm_vm_ioctl(struct file *filp, break; } case KVM_SET_USER_MEMORY_REGION: { - struct kvm_userspace_memory_region kvm_userspace_mem; + struct kvm_user_mem_region mem; + unsigned long size; + u32 flags; + + kvm_sanity_check_user_mem_region_alias(); + + memset(&mem, 0, sizeof(mem)); r = -EFAULT; - if (copy_from_user(&kvm_userspace_mem, argp, - sizeof(kvm_userspace_mem))) + + if (get_user(flags, + (u32 __user *)(argp + offsetof(typeof(mem), flags)))) + goto out; + + if (flags & KVM_MEM_PRIVATE) { + r = -EINVAL; + goto out; + } + + size = sizeof(struct kvm_userspace_memory_region); + + if (copy_from_user(&mem, argp, size)) + goto out; + + r = -EINVAL; + if ((flags ^ mem.flags) & KVM_MEM_PRIVATE) goto out; - r = kvm_vm_ioctl_set_memory_region(kvm, &kvm_userspace_mem); + r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } case KVM_GET_DIRTY_LOG: { From patchwork Wed Jul 6 08:20:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907537 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0F4F6CCA480 for ; Wed, 6 Jul 2022 08:26:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232170AbiGFI0s (ORCPT ); Wed, 6 Jul 2022 04:26:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34330 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231142AbiGFI0X (ORCPT ); Wed, 6 Jul 2022 04:26:23 -0400 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E7FFE24084; Wed, 6 Jul 2022 01:25:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095942; x=1688631942; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=FGkTJa97HOG1krLu0LkGFpeexUG5AtVix0TiuS9l5H4=; b=JTFReDBqmRN2ZKDrXwK0d5X/q5taymz7KHt11u+CKpTCiGLLXudlEi7L fvt2LN26Df7Ewb7PCoxnC5Po8hzkGdm+sFJOGIsXPabEs/ng51R3tQTW+ P1eqJbNq6VNohlO/bpy01pp8eu2xU42EOV7q+hzSUiHu9jv1HsYuDY2y3 TwQ1+D80Ndn2jL68M8UXyLzQ/aN3HUqHZe4/Om3m8MDtOGmJQokYHZlwS BJn/ByXk9Zb/d0/rNZf2k+9cn7GVUWvDkjNWP1xq5X1l3Dn99/iqbfzP+ gcoN1yJmhIie7PkGZVX23EvOM6XPTyqmn65NUKP/t+oOERlRNQx2LwdAq g==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="272467241" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="272467241" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:25:41 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567968196" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:25:29 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 10/14] KVM: Add KVM_EXIT_MEMORY_FAULT exit Date: Wed, 6 Jul 2022 16:20:12 +0800 Message-Id: <20220706082016.2603916-11-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org This new KVM exit allows userspace to handle memory-related errors. It indicates an error happens in KVM at guest memory range [gpa, gpa+size). The flags includes additional information for userspace to handle the error. Currently bit 0 is defined as 'private memory' where '1' indicates error happens due to private memory access and '0' indicates error happens due to shared memory access. After private memory is enabled, this new exit will be used for KVM to exit to userspace for shared memory <-> private memory conversion in memory encryption usage. In such usage, typically there are two kind of memory conversions: - explicit conversion: happens when guest explicitly calls into KVM to map a range (as private or shared), KVM then exits to userspace to do the map/unmap operations. - implicit conversion: happens in KVM page fault handler. * if the fault is due to a private memory access then causes a userspace exit for a shared->private conversion when the page is recognized as shared by KVM. * If the fault is due to a shared memory access then causes a userspace exit for a private->shared conversion when the page is recognized as private by KVM. Suggested-by: Sean Christopherson Co-developed-by: Yu Zhang Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 22 ++++++++++++++++++++++ include/uapi/linux/kvm.h | 9 +++++++++ 2 files changed, 31 insertions(+) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 4f27c973a952..5ecfc7fbe0ee 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -6583,6 +6583,28 @@ array field represents return values. The userspace should update the return values of SBI call before resuming the VCPU. For more details on RISC-V SBI spec refer, https://github.com/riscv/riscv-sbi-doc. +:: + + /* KVM_EXIT_MEMORY_FAULT */ + struct { + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; +If exit reason is KVM_EXIT_MEMORY_FAULT then it indicates that the VCPU has +encountered a memory error which is not handled by KVM kernel module and +userspace may choose to handle it. The 'flags' field indicates the memory +properties of the exit. + + - KVM_MEMORY_EXIT_FLAG_PRIVATE - indicates the memory error is caused by + private memory access when the bit is set otherwise the memory error is + caused by shared memory access when the bit is clear. + +'gpa' and 'size' indicate the memory range the error occurs at. The userspace +may handle the error and return to KVM to retry the previous memory access. + :: /* KVM_EXIT_NOTIFY */ diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index c467c69b7ad7..83c278f284dd 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -299,6 +299,7 @@ struct kvm_xen_exit { #define KVM_EXIT_XEN 34 #define KVM_EXIT_RISCV_SBI 35 #define KVM_EXIT_NOTIFY 36 +#define KVM_EXIT_MEMORY_FAULT 37 /* For KVM_EXIT_INTERNAL_ERROR */ /* Emulate instruction failed. */ @@ -530,6 +531,14 @@ struct kvm_run { #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) __u32 flags; } notify; + /* KVM_EXIT_MEMORY_FAULT */ + struct { +#define KVM_MEMORY_EXIT_FLAG_PRIVATE (1 << 0) + __u32 flags; + __u32 padding; + __u64 gpa; + __u64 size; + } memory; /* Fix the size of the union. */ char padding[256]; }; From patchwork Wed Jul 6 08:20:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907538 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B6D50CCA47C for ; Wed, 6 Jul 2022 08:27:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232351AbiGFI1E (ORCPT ); Wed, 6 Jul 2022 04:27:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34180 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232300AbiGFI0f (ORCPT ); Wed, 6 Jul 2022 04:26:35 -0400 Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2FA6A240B8; Wed, 6 Jul 2022 01:25:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095953; x=1688631953; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=E5etUVyGAjbLnFNE6o2Ik8Z3TK8BR2wbeWe9rjh6p1w=; b=cVdIwKEpxzamaRkEqMGaOi9NfpABi0RkWA9lWJUSN4LdZOPZ1Kjoqvsi ZNPp/wiBBoCop4865keh0TdoaEcWVH1izYXQkvMZkFAhtrJkvFKwAgurW X6TGmwYIjIVGZgQBtaIN+47BnirBzdXugK5gsWFNwTcuvTUCkCKHq8o5c wdho5uD+2z/BAbKiBUaVJemDcEL8xvJTmjhl8ANTcVhTg7Wbgzeaz+Pc/ AtcgC2i9bTdDVip5VXoTvgwokTI38iIMPK5fAsvhLG1ATdN+tok3iDp4j WZr9mJ6V6pY4Tj9XMuXcE5JfhKN1oaPaIgl4+wlGJkWiwckmnouSztGbV Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="263467365" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="263467365" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:25:52 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567968229" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:25:40 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 11/14] KVM: Register/unregister the guest private memory regions Date: Wed, 6 Jul 2022 16:20:13 +0800 Message-Id: <20220706082016.2603916-12-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org If CONFIG_HAVE_KVM_PRIVATE_MEM=y, userspace can register/unregister the guest private memory regions through KVM_MEMORY_ENCRYPT_{UN,}REG_REGION ioctls. The patch reuses existing SEV ioctl but differs that the address in the region for private memory is gpa while SEV case it's hva. The private memory region is stored as xarray in KVM for memory efficiency in normal usages and zapping existing memory mappings is also a side effect of these two ioctls. Signed-off-by: Chao Peng --- Documentation/virt/kvm/api.rst | 17 +++++++--- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu.h | 2 -- include/linux/kvm_host.h | 8 +++++ virt/kvm/kvm_main.c | 57 +++++++++++++++++++++++++++++++++ 6 files changed, 80 insertions(+), 6 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 5ecfc7fbe0ee..dfb4caecab73 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4715,10 +4715,19 @@ Documentation/virt/kvm/amd-memory-encryption.rst. This ioctl can be used to register a guest memory region which may contain encrypted data (e.g. guest RAM, SMRAM etc). -It is used in the SEV-enabled guest. When encryption is enabled, a guest -memory region may contain encrypted data. The SEV memory encryption -engine uses a tweak such that two identical plaintext pages, each at -different locations will have differing ciphertexts. So swapping or +Currently this ioctl supports registering memory regions for two usages: +private memory and SEV-encrypted memory. + +When private memory is enabled, this ioctl is used to register guest private +memory region and the addr/size of kvm_enc_region represents guest physical +address (GPA). In this usage, this ioctl zaps the existing guest memory +mappings in KVM that fallen into the region. + +When SEV-encrypted memory is enabled, this ioctl is used to register guest +memory region which may contain encrypted data for a SEV-enabled guest. The +addr/size of kvm_enc_region represents userspace address (HVA). The SEV +memory encryption engine uses a tweak such that two identical plaintext pages, +each at different locations will have differing ciphertexts. So swapping or moving ciphertext of those pages will not result in plaintext being swapped. So relocating (or migrating) physical backing pages for the SEV guest will require some additional steps. diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dae190e19fce..92120e3a224e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -37,6 +37,7 @@ #include #define __KVM_HAVE_ARCH_VCPU_DEBUGFS +#define __KVM_HAVE_ZAP_GFN_RANGE #define KVM_MAX_VCPUS 1024 diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 1f160801e2a7..05861b9656a4 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -50,6 +50,7 @@ config KVM select HAVE_KVM_PM_NOTIFIER if PM select HAVE_KVM_PRIVATE_MEM if X86_64 select MEMFILE_NOTIFIER if HAVE_KVM_PRIVATE_MEM + select XARRAY_MULTI if HAVE_KVM_PRIVATE_MEM help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index a99acec925eb..428cd2e88cbd 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -209,8 +209,6 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, return -(u32)fault & errcode; } -void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); - int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu); int kvm_mmu_post_init_vm(struct kvm *kvm); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 1b203c8aa696..da33f8828456 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -260,6 +260,10 @@ bool kvm_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range); #endif +#ifdef __KVM_HAVE_ZAP_GFN_RANGE +void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); +#endif + enum { OUTSIDE_GUEST_MODE, IN_GUEST_MODE, @@ -795,6 +799,9 @@ struct kvm { struct notifier_block pm_notifier; #endif char stats_id[KVM_STATS_NAME_SIZE]; +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + struct xarray mem_attr_array; +#endif }; #define kvm_err(fmt, ...) \ @@ -1459,6 +1466,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu); int kvm_arch_post_init_vm(struct kvm *kvm); void kvm_arch_pre_destroy_vm(struct kvm *kvm); int kvm_arch_create_vm_debugfs(struct kvm *kvm); +bool kvm_arch_private_mem_supported(struct kvm *kvm); #ifndef __KVM_HAVE_ARCH_VM_ALLOC /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 230c8ff9659c..bb714c2a4b06 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -914,6 +914,35 @@ static int kvm_init_mmu_notifier(struct kvm *kvm) #endif /* CONFIG_MMU_NOTIFIER && KVM_ARCH_WANT_MMU_NOTIFIER */ +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +#define KVM_MEM_ATTR_PRIVATE 0x0001 +static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl, + struct kvm_enc_region *region) +{ + unsigned long start, end; + void *entry; + int r; + + if (region->size == 0 || region->addr + region->size < region->addr) + return -EINVAL; + if (region->addr & (PAGE_SIZE - 1) || region->size & (PAGE_SIZE - 1)) + return -EINVAL; + + start = region->addr >> PAGE_SHIFT; + end = (region->addr + region->size - 1) >> PAGE_SHIFT; + + entry = ioctl == KVM_MEMORY_ENCRYPT_REG_REGION ? + xa_mk_value(KVM_MEM_ATTR_PRIVATE) : NULL; + + r = xa_err(xa_store_range(&kvm->mem_attr_array, start, end, + entry, GFP_KERNEL_ACCOUNT)); + + kvm_zap_gfn_range(kvm, start, end + 1); + + return r; +} +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */ + #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER static int kvm_pm_notifier_call(struct notifier_block *bl, unsigned long state, @@ -1138,6 +1167,9 @@ static struct kvm *kvm_create_vm(unsigned long type) spin_lock_init(&kvm->mn_invalidate_lock); rcuwait_init(&kvm->mn_memslots_update_rcuwait); xa_init(&kvm->vcpu_array); +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + xa_init(&kvm->mem_attr_array); +#endif INIT_LIST_HEAD(&kvm->gpc_list); spin_lock_init(&kvm->gpc_lock); @@ -1305,6 +1337,9 @@ static void kvm_destroy_vm(struct kvm *kvm) kvm_free_memslots(kvm, &kvm->__memslots[i][0]); kvm_free_memslots(kvm, &kvm->__memslots[i][1]); } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + xa_destroy(&kvm->mem_attr_array); +#endif cleanup_srcu_struct(&kvm->irq_srcu); cleanup_srcu_struct(&kvm->srcu); kvm_arch_free_vm(kvm); @@ -1508,6 +1543,11 @@ static void kvm_replace_memslot(struct kvm *kvm, } } +bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) +{ + return false; +} + static int check_memory_region_flags(const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; @@ -4689,6 +4729,22 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_set_memory_region(kvm, &mem); break; } +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + case KVM_MEMORY_ENCRYPT_REG_REGION: + case KVM_MEMORY_ENCRYPT_UNREG_REGION: { + struct kvm_enc_region region; + + if (!kvm_arch_private_mem_supported(kvm)) + goto arch_vm_ioctl; + + r = -EFAULT; + if (copy_from_user(®ion, argp, sizeof(region))) + goto out; + + r = kvm_vm_ioctl_set_encrypted_region(kvm, ioctl, ®ion); + break; + } +#endif case KVM_GET_DIRTY_LOG: { struct kvm_dirty_log log; @@ -4842,6 +4898,7 @@ static long kvm_vm_ioctl(struct file *filp, r = kvm_vm_ioctl_get_stats_fd(kvm); break; default: +arch_vm_ioctl: r = kvm_arch_vm_ioctl(filp, ioctl, arg); } out: From patchwork Wed Jul 6 08:20:14 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907539 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 48D71C43334 for ; Wed, 6 Jul 2022 08:27:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232274AbiGFI1W (ORCPT ); Wed, 6 Jul 2022 04:27:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36030 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232195AbiGFI05 (ORCPT ); Wed, 6 Jul 2022 04:26:57 -0400 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2390E24BFE; Wed, 6 Jul 2022 01:26:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095964; x=1688631964; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=uyXuuswsm4FBC92XS6UZc5YOiOshhz238zDGx3OfAMA=; b=P1hU83cdetEp4onZoBpy781BtKRUVp6JDd/VDMA38NmeFTjIhsUjxjMu O9K1YtqpzyGWYlB0j4y+OIvjsLFohNslS1JZa5c0wAN5Josva5w9RMyzc ucg3Zj0x/4E6Hrsk/kemUaKBshJRbGrlmfdq3E+3aXRsRSatQIWP/VmF4 m5+3+mxWrItOvpi9CqLY4S9Pm4PDgG3Xl/o4Pkz5mzTkGNBS1aWzw4vNm Nn3/pa09VOaq3F7w4tBZs2lJsRFi61Xs+lHYgwsg6wQXpWGrNX5JEqFgG TIB+l3FAz2jycfdPVF5U5VCaN1XqHySVvZrmiftFUqEheW4ouHI6VgC03 Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="283705659" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="283705659" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:26:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567968257" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:25:51 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 12/14] KVM: Handle page fault for private memory Date: Wed, 6 Jul 2022 16:20:14 +0800 Message-Id: <20220706082016.2603916-13-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org A page fault can carry the private/shared information for KVM_MEM_PRIVATE memslot, this can be filled by architecture code(like TDX code). To handle page fault for such access, KVM maps the page only when this private property matches the host's view on the page. For a successful match, private pfn is obtained with memfile_notifier callbacks from private fd and shared pfn is obtained with existing get_user_pages. For a failed match, KVM causes a KVM_EXIT_MEMORY_FAULT exit to userspace. Userspace then can convert memory between private/shared from host's view then retry the access. Co-developed-by: Yu Zhang Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- arch/x86/kvm/mmu/mmu.c | 60 ++++++++++++++++++++++++++++++++- arch/x86/kvm/mmu/mmu_internal.h | 18 ++++++++++ arch/x86/kvm/mmu/mmutrace.h | 1 + include/linux/kvm_host.h | 35 ++++++++++++++++++- 4 files changed, 112 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 545eb74305fe..27dbdd4fe8d1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3004,6 +3004,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, if (max_level == PG_LEVEL_4K) return PG_LEVEL_4K; + if (kvm_mem_is_private(kvm, gfn)) + return max_level; + host_level = host_pfn_mapping_level(kvm, gfn, pfn, slot); return min(host_level, max_level); } @@ -4101,10 +4104,52 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) kvm_mmu_do_page_fault(vcpu, work->cr2_or_gpa, 0, true); } +static inline u8 order_to_level(int order) +{ + enum pg_level level; + + for (level = KVM_MAX_HUGEPAGE_LEVEL; level > PG_LEVEL_4K; level--) + if (order >= page_level_shift(level) - PAGE_SHIFT) + return level; + return level; +} + +static int kvm_faultin_pfn_private(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault) +{ + int order; + struct kvm_memory_slot *slot = fault->slot; + bool private_exist = kvm_mem_is_private(vcpu->kvm, fault->gfn); + + if (fault->is_private != private_exist) { + vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; + if (fault->is_private) + vcpu->run->memory.flags = KVM_MEMORY_EXIT_FLAG_PRIVATE; + else + vcpu->run->memory.flags = 0; + vcpu->run->memory.padding = 0; + vcpu->run->memory.gpa = fault->gfn << PAGE_SHIFT; + vcpu->run->memory.size = PAGE_SIZE; + return RET_PF_USER; + } + + if (fault->is_private) { + if (kvm_private_mem_get_pfn(slot, fault->gfn, &fault->pfn, &order)) + return RET_PF_RETRY; + fault->max_level = min(order_to_level(order), fault->max_level); + fault->map_writable = !(slot->flags & KVM_MEM_READONLY); + return RET_PF_FIXED; + } + + /* Fault is shared, fallthrough. */ + return RET_PF_CONTINUE; +} + static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { struct kvm_memory_slot *slot = fault->slot; bool async; + int r; /* * Retry the page fault if the gfn hit a memslot that is being deleted @@ -4133,6 +4178,12 @@ static int kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) return RET_PF_EMULATE; } + if (kvm_slot_can_be_private(slot)) { + r = kvm_faultin_pfn_private(vcpu, fault); + if (r != RET_PF_CONTINUE) + return r == RET_PF_FIXED ? RET_PF_CONTINUE : r; + } + async = false; fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, &async, fault->write, &fault->map_writable, @@ -4241,7 +4292,11 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault read_unlock(&vcpu->kvm->mmu_lock); else write_unlock(&vcpu->kvm->mmu_lock); - kvm_release_pfn_clean(fault->pfn); + + if (fault->is_private) + kvm_private_mem_put_pfn(fault->slot, fault->pfn); + else + kvm_release_pfn_clean(fault->pfn); return r; } @@ -5518,6 +5573,9 @@ int noinline kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 err return -EIO; } + if (r == RET_PF_USER) + return 0; + if (r < 0) return r; if (r != RET_PF_EMULATE) diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index ae2d660e2dab..fb9c298abcf0 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -188,6 +188,7 @@ struct kvm_page_fault { /* Derived from mmu and global state. */ const bool is_tdp; + const bool is_private; const bool nx_huge_page_workaround_enabled; /* @@ -236,6 +237,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); * RET_PF_RETRY: let CPU fault again on the address. * RET_PF_EMULATE: mmio page fault, emulate the instruction directly. * RET_PF_INVALID: the spte is invalid, let the real page fault path update it. + * RET_PF_USER: need to exit to userspace to handle this fault. * RET_PF_FIXED: The faulting entry has been fixed. * RET_PF_SPURIOUS: The faulting entry was already fixed, e.g. by another vCPU. * @@ -252,6 +254,7 @@ enum { RET_PF_RETRY, RET_PF_EMULATE, RET_PF_INVALID, + RET_PF_USER, RET_PF_FIXED, RET_PF_SPURIOUS, }; @@ -318,4 +321,19 @@ void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc); void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp); +#ifndef CONFIG_HAVE_KVM_PRIVATE_MEM +static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot, + gfn_t gfn, kvm_pfn_t *pfn, int *order) +{ + WARN_ON_ONCE(1); + return -EOPNOTSUPP; +} + +static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ + WARN_ON_ONCE(1); +} +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */ + #endif /* __KVM_X86_MMU_INTERNAL_H */ diff --git a/arch/x86/kvm/mmu/mmutrace.h b/arch/x86/kvm/mmu/mmutrace.h index ae86820cef69..2d7555381955 100644 --- a/arch/x86/kvm/mmu/mmutrace.h +++ b/arch/x86/kvm/mmu/mmutrace.h @@ -58,6 +58,7 @@ TRACE_DEFINE_ENUM(RET_PF_CONTINUE); TRACE_DEFINE_ENUM(RET_PF_RETRY); TRACE_DEFINE_ENUM(RET_PF_EMULATE); TRACE_DEFINE_ENUM(RET_PF_INVALID); +TRACE_DEFINE_ENUM(RET_PF_USER); TRACE_DEFINE_ENUM(RET_PF_FIXED); TRACE_DEFINE_ENUM(RET_PF_SPURIOUS); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index da33f8828456..8f56426aa1e3 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -778,6 +778,10 @@ struct kvm { #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) struct mmu_notifier mmu_notifier; +#endif + +#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) || \ + defined(CONFIG_MEMFILE_NOTIFIER) unsigned long mmu_updating_seq; long mmu_updating_count; gfn_t mmu_updating_range_start; @@ -1917,7 +1921,8 @@ extern const struct _kvm_stats_desc kvm_vm_stats_desc[]; extern const struct kvm_stats_header kvm_vcpu_stats_header; extern const struct _kvm_stats_desc kvm_vcpu_stats_desc[]; -#if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) +#if (defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER)) || \ + defined(CONFIG_MEMFILE_NOTIFIER) static inline int mmu_updating_retry(struct kvm *kvm, unsigned long mmu_seq) { if (unlikely(kvm->mmu_updating_count)) @@ -2266,4 +2271,32 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu) /* Max number of entries allowed for each kvm dirty ring */ #define KVM_DIRTY_RING_MAX_ENTRIES 65536 +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM +static inline int kvm_private_mem_get_pfn(struct kvm_memory_slot *slot, + gfn_t gfn, kvm_pfn_t *pfn, int *order) +{ + int ret; + pfn_t pfnt; + pgoff_t index = gfn - slot->base_gfn + + (slot->private_offset >> PAGE_SHIFT); + + ret = slot->notifier.bs->get_pfn(slot->private_file, index, &pfnt, + order); + *pfn = pfn_t_to_pfn(pfnt); + return ret; +} + +static inline void kvm_private_mem_put_pfn(struct kvm_memory_slot *slot, + kvm_pfn_t pfn) +{ + slot->notifier.bs->put_pfn(pfn_to_pfn_t(pfn)); +} + +static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) +{ + return !!xa_load(&kvm->mem_attr_array, gfn); +} + +#endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */ + #endif From patchwork Wed Jul 6 08:20:15 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907541 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76651C43334 for ; Wed, 6 Jul 2022 08:27:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232137AbiGFI1t (ORCPT ); Wed, 6 Jul 2022 04:27:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36078 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231366AbiGFI1W (ORCPT ); Wed, 6 Jul 2022 04:27:22 -0400 Received: from mga11.intel.com (mga11.intel.com [192.55.52.93]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8AEF824951; Wed, 6 Jul 2022 01:26:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095984; x=1688631984; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=btM1la5Pz1GNel76VNHG7FnCfSGO0SToWA3Mb7+FUy8=; b=P92jnDz/bjiOVm6yuG7oMQDt9dFMPPKcvBAYcZ5kdrI60FrZaC8b1zMQ uA+vcr8jJI6Evl1MJtyWyOsrGs5jNSSQC8qPHU6gu+XHx/rvfgikxJt+I Svu4XXTa4U2cL87KqAeNS1veF4+oueg1eL7lSR7AEm3ofx00Z3L8hQSEZ mUyJRb4kKX1KTkm46/gNW0u/QYxymg+KSQ8ezNbkwjgd8QJD0L8ZZFgTE FHDOAzjK3RMQ9TnJSby/IlfVV5HM+MxUEcFSAdaP0E4Vf7ZKdZF0ktoBW 6w4g28hM2WxdVwk2jEzDbikdMhQgr+nCVWaaebDQl+CdBBkKK+Pvnh244 Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="281232196" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="281232196" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:26:24 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567968337" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:26:03 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 13/14] KVM: Enable and expose KVM_MEM_PRIVATE Date: Wed, 6 Jul 2022 16:20:15 +0800 Message-Id: <20220706082016.2603916-14-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Register private memslot to fd-based memory backing store and handle the memfile notifiers to zap the existing mappings. Currently the register is happened at memslot creating time and the initial support does not include page migration/swap. KVM_MEM_PRIVATE is not exposed by default, architecture code can turn on it by implementing kvm_arch_private_mem_supported(). A 'kvm' reference is added in memslot structure since in memfile_notifier callbacks we can only obtain a memslot reference while kvm is need to do the zapping. Co-developed-by: Yu Zhang Signed-off-by: Yu Zhang Signed-off-by: Chao Peng --- include/linux/kvm_host.h | 1 + virt/kvm/kvm_main.c | 117 ++++++++++++++++++++++++++++++++++++--- 2 files changed, 109 insertions(+), 9 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 8f56426aa1e3..4e5a0db68799 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -584,6 +584,7 @@ struct kvm_memory_slot { struct file *private_file; loff_t private_offset; struct memfile_notifier notifier; + struct kvm *kvm; }; static inline bool kvm_slot_can_be_private(const struct kvm_memory_slot *slot) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index bb714c2a4b06..d6f7e074cab2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -941,6 +941,63 @@ static int kvm_vm_ioctl_set_encrypted_region(struct kvm *kvm, unsigned int ioctl return r; } + +static void kvm_memfile_notifier_invalidate(struct memfile_notifier *notifier, + pgoff_t start, pgoff_t end) +{ + struct kvm_memory_slot *slot = container_of(notifier, + struct kvm_memory_slot, + notifier); + unsigned long base_pgoff = slot->private_offset >> PAGE_SHIFT; + gfn_t start_gfn = slot->base_gfn; + gfn_t end_gfn = slot->base_gfn + slot->npages; + + + if (start > base_pgoff) + start_gfn = slot->base_gfn + start - base_pgoff; + + if (end < base_pgoff + slot->npages) + end_gfn = slot->base_gfn + end - base_pgoff; + + if (start_gfn >= end_gfn) + return; + + kvm_zap_gfn_range(slot->kvm, start_gfn, end_gfn); +} + +static struct memfile_notifier_ops kvm_memfile_notifier_ops = { + .invalidate = kvm_memfile_notifier_invalidate, +}; + +#define KVM_MEMFILE_FLAGS (MEMFILE_F_USER_INACCESSIBLE | \ + MEMFILE_F_UNMOVABLE | \ + MEMFILE_F_UNRECLAIMABLE) + +static inline int kvm_private_mem_register(struct kvm_memory_slot *slot) +{ + slot->notifier.ops = &kvm_memfile_notifier_ops; + return memfile_register_notifier(slot->private_file, KVM_MEMFILE_FLAGS, + &slot->notifier); +} + +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot) +{ + memfile_unregister_notifier(&slot->notifier); +} + +#else /* !CONFIG_HAVE_KVM_PRIVATE_MEM */ + +static inline int kvm_private_mem_register(struct kvm_memory_slot *slot) +{ + WARN_ON_ONCE(1); + return -EOPNOTSUPP; +} + +static inline void kvm_private_mem_unregister(struct kvm_memory_slot *slot) +{ + WARN_ON_ONCE(1); +} + #endif /* CONFIG_HAVE_KVM_PRIVATE_MEM */ #ifdef CONFIG_HAVE_KVM_PM_NOTIFIER @@ -987,6 +1044,11 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot) /* This does not remove the slot from struct kvm_memslots data structures */ static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot) { + if (slot->flags & KVM_MEM_PRIVATE) { + kvm_private_mem_unregister(slot); + fput(slot->private_file); + } + kvm_destroy_dirty_bitmap(slot); kvm_arch_free_memslot(kvm, slot); @@ -1548,10 +1610,16 @@ bool __weak kvm_arch_private_mem_supported(struct kvm *kvm) return false; } -static int check_memory_region_flags(const struct kvm_user_mem_region *mem) +static int check_memory_region_flags(struct kvm *kvm, + const struct kvm_user_mem_region *mem) { u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES; +#ifdef CONFIG_HAVE_KVM_PRIVATE_MEM + if (kvm_arch_private_mem_supported(kvm)) + valid_flags |= KVM_MEM_PRIVATE; +#endif + #ifdef __KVM_HAVE_READONLY_MEM valid_flags |= KVM_MEM_READONLY; #endif @@ -1627,6 +1695,12 @@ static int kvm_prepare_memory_region(struct kvm *kvm, { int r; + if (change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE) { + r = kvm_private_mem_register(new); + if (r) + return r; + } + /* * If dirty logging is disabled, nullify the bitmap; the old bitmap * will be freed on "commit". If logging is enabled in both old and @@ -1655,6 +1729,9 @@ static int kvm_prepare_memory_region(struct kvm *kvm, if (r && new && new->dirty_bitmap && (!old || !old->dirty_bitmap)) kvm_destroy_dirty_bitmap(new); + if (r && change == KVM_MR_CREATE && new->flags & KVM_MEM_PRIVATE) + kvm_private_mem_unregister(new); + return r; } @@ -1952,7 +2029,7 @@ int __kvm_set_memory_region(struct kvm *kvm, int as_id, id; int r; - r = check_memory_region_flags(mem); + r = check_memory_region_flags(kvm, mem); if (r) return r; @@ -1971,6 +2048,10 @@ int __kvm_set_memory_region(struct kvm *kvm, !access_ok((void __user *)(unsigned long)mem->userspace_addr, mem->memory_size)) return -EINVAL; + if (mem->flags & KVM_MEM_PRIVATE && + (mem->private_offset & (PAGE_SIZE - 1) || + mem->private_offset > U64_MAX - mem->memory_size)) + return -EINVAL; if (as_id >= KVM_ADDRESS_SPACE_NUM || id >= KVM_MEM_SLOTS_NUM) return -EINVAL; if (mem->guest_phys_addr + mem->memory_size < mem->guest_phys_addr) @@ -2009,6 +2090,9 @@ int __kvm_set_memory_region(struct kvm *kvm, if ((kvm->nr_memslot_pages + npages) < kvm->nr_memslot_pages) return -EINVAL; } else { /* Modify an existing slot. */ + /* Private memslots are immutable, they can only be deleted. */ + if (mem->flags & KVM_MEM_PRIVATE) + return -EINVAL; if ((mem->userspace_addr != old->userspace_addr) || (npages != old->npages) || ((mem->flags ^ old->flags) & KVM_MEM_READONLY)) @@ -2037,10 +2121,27 @@ int __kvm_set_memory_region(struct kvm *kvm, new->npages = npages; new->flags = mem->flags; new->userspace_addr = mem->userspace_addr; + if (mem->flags & KVM_MEM_PRIVATE) { + new->private_file = fget(mem->private_fd); + if (!new->private_file) { + r = -EINVAL; + goto out; + } + new->private_offset = mem->private_offset; + } + + new->kvm = kvm; r = kvm_set_memslot(kvm, old, new, change); if (r) - kfree(new); + goto out; + + return 0; + +out: + if (new->private_file) + fput(new->private_file); + kfree(new); return r; } EXPORT_SYMBOL_GPL(__kvm_set_memory_region); @@ -4712,12 +4813,10 @@ static long kvm_vm_ioctl(struct file *filp, (u32 __user *)(argp + offsetof(typeof(mem), flags)))) goto out; - if (flags & KVM_MEM_PRIVATE) { - r = -EINVAL; - goto out; - } - - size = sizeof(struct kvm_userspace_memory_region); + if (flags & KVM_MEM_PRIVATE) + size = sizeof(struct kvm_userspace_memory_region_ext); + else + size = sizeof(struct kvm_userspace_memory_region); if (copy_from_user(&mem, argp, size)) goto out; From patchwork Wed Jul 6 08:20:16 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Chao Peng X-Patchwork-Id: 12907540 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8BBABCCA482 for ; Wed, 6 Jul 2022 08:27:49 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232102AbiGFI1s (ORCPT ); Wed, 6 Jul 2022 04:27:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36226 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232426AbiGFI11 (ORCPT ); Wed, 6 Jul 2022 04:27:27 -0400 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 180FF240BF; Wed, 6 Jul 2022 01:26:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1657095997; x=1688631997; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=XcUkjLL+YpWw11P834waxgWm/NgeT+vT/ACdu5U5CAo=; b=bXM4SmRunfF2UdZ6n2adSVce6ocOAHH1xRPLAz10BV2oKs9zugqpZkDy 8i33755KPbazt9wepKm6IEY9asDygjbALi4nVDTUKElkJVhrytr9AY+iK DMtg4aMn8kiGYKQ7RxwoeO+ZVSKw/DE6Sjp0yXxFaFrmSlRLnyJ59JWqW cjfUGgcV7pYTkEupEnsOMoikiq642Z/9e8cVFbFRWfd57+w97Id6rkxlE 94y5G41P/YmbhWo6wHoy7WDEtgDB0Kc0o2RORHR3+CCPMpgcTA6faftRQ 9/d2CCGhHdmRTsZuRYjZLmND6Cb8Tax3SjIOkgKFBb0QbNGSwZ/WayQXC g==; X-IronPort-AV: E=McAfee;i="6400,9594,10399"; a="272467402" X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="272467402" Received: from orsmga006.jf.intel.com ([10.7.209.51]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Jul 2022 01:26:34 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.92,249,1650956400"; d="scan'208";a="567968372" Received: from chaop.bj.intel.com ([10.240.192.101]) by orsmga006.jf.intel.com with ESMTP; 06 Jul 2022 01:26:16 -0700 From: Chao Peng To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, linux-kselftest@vger.kernel.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , Chao Peng , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song Subject: [PATCH v7 14/14] memfd_create.2: Describe MFD_INACCESSIBLE flag Date: Wed, 6 Jul 2022 16:20:16 +0800 Message-Id: <20220706082016.2603916-15-chao.p.peng@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> References: <20220706082016.2603916-1-chao.p.peng@linux.intel.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Signed-off-by: Chao Peng --- man2/memfd_create.2 | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/man2/memfd_create.2 b/man2/memfd_create.2 index 89e9c4136..2698222ae 100644 --- a/man2/memfd_create.2 +++ b/man2/memfd_create.2 @@ -101,6 +101,19 @@ meaning that no other seals can be set on the file. .\" FIXME Why is the MFD_ALLOW_SEALING behavior not simply the default? .\" Is it worth adding some text explaining this? .TP +.BR MFD_INACCESSIBLE +Disallow userspace access through ordinary MMU accesses via +.BR read (2), +.BR write (2) +and +.BR mmap (2). +The file size cannot be changed once initialized. +This flag cannot coexist with +.B MFD_ALLOW_SEALING +and when this flag is set, the initial set of seals will be +.B F_SEAL_SEAL, +meaning that no other seals can be set on the file. +.TP .BR MFD_HUGETLB " (since Linux 4.14)" .\" commit 749df87bd7bee5a79cef073f5d032ddb2b211de8 The anonymous file will be created in the hugetlbfs filesystem using