[v13,16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

Introduce an ioctl(), KVM_CREATE_GUEST_MEMFD, to allow creating file-based
memory that is tied to a specific KVM virtual machine and whose primary
purpose is to serve guest memory.

A guest-first memory subsystem allows for optimizations and enhancements
that are kludgy or outright infeasible to implement/support in a generic
memory subsystem.  With guest_memfd, guest protections and mapping sizes
are fully decoupled from host userspace mappings.   E.g. KVM currently
doesn't support mapping memory as writable in the guest without it also
being writable in host userspace, as KVM's ABI uses VMA protections to
define the allow guest protection.  Userspace can fudge this by
establishing two mappings, a writable mapping for the guest and readable
one for itself, but that’s suboptimal on multiple fronts.

Similarly, KVM currently requires the guest mapping size to be a strict
subset of the host userspace mapping size, e.g. KVM doesn’t support
creating a 1GiB guest mapping unless userspace also has a 1GiB guest
mapping.  Decoupling the mappings sizes would allow userspace to precisely
map only what is needed without impacting guest performance, e.g. to
harden against unintentional accesses to guest memory.

Decoupling guest and userspace mappings may also allow for a cleaner
alternative to high-granularity mappings for HugeTLB, which has reached a
bit of an impasse and is unlikely to ever be merged.

A guest-first memory subsystem also provides clearer line of sight to
things like a dedicated memory pool (for slice-of-hardware VMs) and
elimination of "struct page" (for offload setups where userspace _never_
needs to mmap() guest memory).

More immediately, being able to map memory into KVM guests without mapping
said memory into the host is critical for Confidential VMs (CoCo VMs), the
initial use case for guest_memfd.  While AMD's SEV and Intel's TDX prevent
untrusted software from reading guest private data by encrypting guest
memory with a key that isn't usable by the untrusted host, projects such
as Protected KVM (pKVM) provide confidentiality and integrity *without*
relying on memory encryption.  And with SEV-SNP and TDX, accessing guest
private memory can be fatal to the host, i.e. KVM must be prevent host
userspace from accessing guest memory irrespective of hardware behavior.

Attempt #1 to support CoCo VMs was to add a VMA flag to mark memory as
being mappable only by KVM (or a similarly enlightened kernel subsystem).
That approach was abandoned largely due to it needing to play games with
PROT_NONE to prevent userspace from accessing guest memory.

Attempt #2 to was to usurp PG_hwpoison to prevent the host from mapping
guest private memory into userspace, but that approach failed to meet
several requirements for software-based CoCo VMs, e.g. pKVM, as the kernel
wouldn't easily be able to enforce a 1:1 page:guest association, let alone
a 1:1 pfn:gfn mapping.  And using PG_hwpoison does not work for memory
that isn't backed by 'struct page', e.g. if devices gain support for
exposing encrypted memory regions to guests.

Attempt #3 was to extend the memfd() syscall and wrap shmem to provide
dedicated file-based guest memory.  That approach made it as far as v10
before feedback from Hugh Dickins and Christian Brauner (and others) led
to it demise.

Hugh's objection was that piggybacking shmem made no sense for KVM's use
case as KVM didn't actually *want* the features provided by shmem.  I.e.
KVM was using memfd() and shmem to avoid having to manage memory directly,
not because memfd() and shmem were the optimal solution, e.g. things like
read/write/mmap in shmem were dead weight.

Christian pointed out flaws with implementing a partial overlay (wrapping
only _some_ of shmem), e.g. poking at inode_operations or super_operations
would show shmem stuff, but address_space_operations and file_operations
would show KVM's overlay.  Paraphrashing heavily, Christian suggested KVM
stop being lazy and create a proper API.

Link: https://lore.kernel.org/all/20201020061859.18385-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210416154106.23721-1-kirill.shutemov@linux.intel.com
Link: https://lore.kernel.org/all/20210824005248.200037-1-seanjc@google.com
Link: https://lore.kernel.org/all/20211111141352.26311-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/20221202061347.1070246-1-chao.p.peng@linux.intel.com
Link: https://lore.kernel.org/all/ff5c5b97-acdf-9745-ebe5-c6609dd6322e@google.com
Link: https://lore.kernel.org/all/20230418-anfallen-irdisch-6993a61be10b@brauner
Link: https://lore.kernel.org/all/ZEM5Zq8oo+xnApW9@google.com
Link: https://lore.kernel.org/linux-mm/20230306191944.GA15773@monkey
Link: https://lore.kernel.org/linux-mm/ZII1p8ZHlHaQ3dDl@casper.infradead.org
Cc: Fuad Tabba <tabba@google.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerley Tng <ackerleytng@google.com>
Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Maciej Szmigiero <mail@maciej.szmigiero.name>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Quentin Perret <qperret@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Wang <wei.w.wang@intel.com>
Cc: Liam Merwick <liam.merwick@oracle.com>
Cc: Isaku Yamahata <isaku.yamahata@gmail.com>
Co-developed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Co-developed-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Co-developed-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Co-developed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 Documentation/virt/kvm/api.rst |  69 ++++-
 include/linux/kvm_host.h       |  48 +++
 include/uapi/linux/kvm.h       |  15 +-
 virt/kvm/Kconfig               |   4 +
 virt/kvm/Makefile.kvm          |   1 +
 virt/kvm/guest_memfd.c         | 548 +++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c            |  68 +++-
 virt/kvm/kvm_mm.h              |  26 ++
 8 files changed, 764 insertions(+), 15 deletions(-)
 create mode 100644 virt/kvm/guest_memfd.c

Message ID	20231027182217.3615211-17-seanjc@google.com (mailing list archive)
State	New, archived
Headers	show Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net [23.128.96.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CB0523D398 for <linux-fsdevel@vger.kernel.org>; Fri, 27 Oct 2023 18:23:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="rYkud3yJ" Received: from mail-pl1-x64a.google.com (mail-pl1-x64a.google.com [IPv6:2607:f8b0:4864:20::64a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6045F198E for <linux-fsdevel@vger.kernel.org>; Fri, 27 Oct 2023 11:22:59 -0700 (PDT) Received: by mail-pl1-x64a.google.com with SMTP id d9443c01a7336-1caaaa873efso23352445ad.3 for <linux-fsdevel@vger.kernel.org>; Fri, 27 Oct 2023 11:22:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698430977; x=1699035777; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:reply-to:from:to:cc:subject:date :message-id:reply-to; bh=UYJQgiZVR4iLUgC6P0Z91HbpOn98YamOrMc7NG6h1kc=; b=rYkud3yJHLNKx6U68IxOxPa2JjvRFqlbxlcKWb83G6vxrKxV6hL7AOQNdTMRtoosbq c2b5e9d4lxQqiV3Iot1tNeoRVjzLOOd4ruNGN0Lw/Qm5amLS51TCchKeBpe2uNG+rm2k hJB+MN91aCW7gHC36GdKa1/iF1cEY4mRhvFcESra0LLudGcqcmFomhGn94yI+oHotTbG ZnOL8ALlx/1NSUnw2LroMm1OasrShI3K0LsMD/klphugQIC4Ezxdf/8O8tqEsZ576RNM GaVLAf7wESW7A6yo09lVuTvnIewIbeJNGhMh0X/0BZ/uqykbzt6a0SnerR/JBrAg+5vN oExA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698430977; x=1699035777; h=content-transfer-encoding:cc:to:from:subject:message-id:references :mime-version:in-reply-to:date:reply-to:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=UYJQgiZVR4iLUgC6P0Z91HbpOn98YamOrMc7NG6h1kc=; b=T5ANdsxBp178HPAONBv+qRP4G02zVWfLI9mAmE/o8Ui7VXCC8Lwy55SLpJWINLd+ut ZFL9FK/SFIdpPb0ikfiVABQSwpG/kWE7bYUx17CfBtLsWjlZTDOuGwzrU8zef2gIAOPs ubJTgJhfJOsvXlGPNlVR/f+POG/HGCLXRuHQAZ/C7TXw1eKYLsN/cmxPTw29WqpxjTZI BlAHv018onTvb6m5qkVUULSGXcQM2a5nJ8uT8A/vflES3PvvoZxvViGQpnB4v9+Ll5lE LfEePkjz2uI71rHHvVhjTHx7f8LIyUlIYblvsRRxfKI6dcLURyGqE8DbZtR4Ks6SgSQ6 wZow== X-Gm-Message-State: AOJu0YwcLZTEh8MYfd4Fvm5MGV5oY5maoJB/3sUKUabenM98YvmMSIkl 8xFwdgqmhKcONjkzmGmObAs/g0MiF1Y= X-Google-Smtp-Source: AGHT+IFQTLCtveekXD1rlOFkw+wx7na7JcLPvjDu5+U2W4dimd59+gDznHjkZmFjwgKKy1fzp+yLv/HHt2A= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:902:7b8f:b0:1c9:f356:b7d5 with SMTP id w15-20020a1709027b8f00b001c9f356b7d5mr60393pll.7.1698430977569; Fri, 27 Oct 2023 11:22:57 -0700 (PDT) Reply-To: Sean Christopherson <seanjc@google.com> Date: Fri, 27 Oct 2023 11:21:58 -0700 In-Reply-To: <20231027182217.3615211-1-seanjc@google.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: <linux-fsdevel.vger.kernel.org> List-Subscribe: <mailto:linux-fsdevel+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-fsdevel+unsubscribe@vger.kernel.org> Mime-Version: 1.0 References: <20231027182217.3615211-1-seanjc@google.com> X-Mailer: git-send-email 2.42.0.820.g83a721a137-goog Message-ID: <20231027182217.3615211-17-seanjc@google.com> Subject: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory From: Sean Christopherson <seanjc@google.com> To: Paolo Bonzini <pbonzini@redhat.com>, Marc Zyngier <maz@kernel.org>, Oliver Upton <oliver.upton@linux.dev>, Huacai Chen <chenhuacai@kernel.org>, Michael Ellerman <mpe@ellerman.id.au>, Anup Patel <anup@brainfault.org>, Paul Walmsley <paul.walmsley@sifive.com>, Palmer Dabbelt <palmer@dabbelt.com>, Albert Ou <aou@eecs.berkeley.edu>, Sean Christopherson <seanjc@google.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Christian Brauner <brauner@kernel.org>, "Matthew Wilcox (Oracle)" <willy@infradead.org>, Andrew Morton <akpm@linux-foundation.org> Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev, linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Xiaoyao Li <xiaoyao.li@intel.com>, Xu Yilun <yilun.xu@intel.com>, Chao Peng <chao.p.peng@linux.intel.com>, Fuad Tabba <tabba@google.com>, Jarkko Sakkinen <jarkko@kernel.org>, Anish Moorthy <amoorthy@google.com>, David Matlack <dmatlack@google.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, Isaku Yamahata <isaku.yamahata@intel.com>, " =?utf-8?q?Micka=C3=ABl_Sala?= =?utf-8?q?=C3=BCn?= " <mic@digikod.net>, Vlastimil Babka <vbabka@suse.cz>, Vishal Annapurve <vannapurve@google.com>, Ackerley Tng <ackerleytng@google.com>, Maciej Szmigiero <mail@maciej.szmigiero.name>, David Hildenbrand <david@redhat.com>, Quentin Perret <qperret@google.com>, Michael Roth <michael.roth@amd.com>, Wang <wei.w.wang@intel.com>, Liam Merwick <liam.merwick@oracle.com>, Isaku Yamahata <isaku.yamahata@gmail.com>, "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Series	KVM: guest_memfd() and per-page attributes \| expand [v13,00/35] KVM: guest_memfd() and per-page attributes [v13,01/35] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges [v13,02/35] KVM: Assert that mmu_invalidate_in_progress never goes negative [v13,03/35] KVM: Use gfn instead of hva for mmu_notifier_retry [v13,04/35] KVM: WARN if there are dangling MMU invalidations at VM destruction [v13,05/35] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER [v13,06/35] KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMU [v13,07/35] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER [v13,08/35] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 [v13,09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace [v13,10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory [v13,11/35] KVM: Drop .on_unlock() mmu_notifier hook [v13,12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events [v13,13/35] KVM: Introduce per-page memory attributes [v13,14/35] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable [v13,15/35] fs: Export anon_inode_getfile_secure() for use by KVM [v13,16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory [v13,17/35] KVM: Add transparent hugepage support for dedicated guest memory [v13,18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN [v13,19/35] KVM: x86: Disallow hugepages when memory attributes are mixed [v13,20/35] KVM: x86/mmu: Handle page fault for private memory [v13,21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro [v13,22/35] KVM: Allow arch code to track number of memslot address spaces per VM [v13,23/35] KVM: x86: Add support for "protected VMs" that can utilize private memory [v13,24/35] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper [v13,25/35] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 [v13,26/35] KVM: selftests: Add support for creating private memslots [v13,27/35] KVM: selftests: Add helpers to convert guest memory b/w private and shared [v13,28/35] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) [v13,29/35] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type [v13,30/35] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data [v13,31/35] KVM: selftests: Add x86-only selftest for private memory conversions [v13,32/35] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper [v13,33/35] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() [v13,34/35] KVM: selftests: Add basic selftest for guest_memfd() [v13,35/35] KVM: selftests: Test KVM exit behavior for private memory/access

[v13,16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

Commit Message

Comments

Patch