Message ID | 20240709132041.3625501-1-roypat@amazon.co.uk (mailing list archive) |
---|---|
Headers | show |
Series | Unmapping guest_memfd from Direct Map | expand |
On 7/9/24 3:20 PM, Patrick Roy wrote: > Hey all, > > This RFC series is a rough draft adding support for running > non-confidential compute VMs in guest_memfd, based on prior discussions > with Sean [1]. Our specific usecase for this is the ability to unmap > guest memory from the host kernel's direct map, as a mitigation against > a large class of speculative execution issues. > > === Implementation === > > This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD` > to remove its pages from the direct map when they are allocated. When > trying to run a guest from such a VM, we now face the problem that > without either userspace or kernelspace mappings of guest_memfd, KVM > cannot access guest memory to, for example, do MMIO emulation of access > memory used to guest/host communication. We have multiple options for > solving this when running non-CoCo VMs: (1) implement a TDX-light > solution, where the guest shares memory that KVM needs to access, and > relies on paravirtual solutions where this is not possible (e.g. MMIO), > (2) have KVM use userspace mappings of guest_memfd (e.g. a > memfd_secret-style solution), or (3) dynamically reinsert pages into the > direct map whenever KVM wants to access them. > > This RFC goes for option (3). Option (1) is a lot of overhead for very > little gain, since we are not actually constrained by a physical > inability to access guest memory (e.g. we are not in a TDX context where > accesses to guest memory cause a #MC). Option (2) has previously been > rejected [1]. Do the pages have to have the same address when they are temporarily mapped? Wouldn't it be easier to do something similar to kmap_local_page() used for HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it doesn't have to alter the shared directmap. Maybe that was already discussed somewhere as unsuitable but didn't spot it here. > In this patch series, we make sufficient parts of KVM gmem-aware to be > able to boot a Linux initrd from private memory on x86. These include > KVM's MMIO emulation (including guest page table walking) and kvm-clock. > For VM types which do not allow accessing gmem, we return -EFAULT and > attempt to prepare a KVM_EXIT_MEMORY_FAULT. > > Additionally, this patch series adds support for "restricted" userspace > mappings of guest_memfd, which work similar to memfd_secret (e.g. > disallow get_user_pages), which allows handling I/O and loading the > guest kernel in a simple way. Support for this is completely independent > of the rest of the functionality introduced in this patch series. > However, it is required to build a minimal hypervisor PoC that actually > allows booting a VM from a disk. > > === Performance === > > We have run some preliminary performance benchmarks to assess the impact > of on-the-fly direct map manipulations. We were mainly interested in the > impact of manipulating the direct map for MMIO emulation on virtio-mmio. > Particularly, we were worried about the impact of the TLB and L1/2/3 > Cache flushes that set_memory_[n]p entails. > > In our setup, we have taken a modified Firecracker VMM, spawned a Linux > guest with 1 vCPU, and used fio to stress a virtio_blk device. We found > that the cache flushes caused throughput to drop from around 600MB/s to > ~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum > 8375C CPU with 64 cores). We then converted our prototype to use > set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and > found that without cache flushes the pure impact of the direct map > manipulation is indistinguishable from noise. This is why we use > set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in > this RFC. > > Note that in this comparison, both the baseline, as well as the > guest_memfd-supporting version of Firecracker were made to bounce I/O > buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs, > the virtio stack cannot directly pass guest buffers to read/write > syscalls. > > === Security === > > We want to use unmapping guest memory from the host kernel as a security > mitigation against transient execution attacks. Temporarily restoring > direct map entries whenever KVM requires access to guest memory leaves a > gap in this mitigation. We believe this to be acceptable for the above > cases, since pages used for paravirtual guest/host communication (e.g. > kvm-clock) and guest page tables do not contain sensitive data. MMIO > emulation will only end up reading pages containing privileged > instructions (e.g. guest kernel code). > > === Summary === > > Patches 1-4 are about hot-patching various points inside of KVM that > access guest memory to correctly handle the case where memory happens to > be guest-private. This means either handling the access as a memory > error, or simply accessing the memslot's guest_memfd instead of looking > at the userspace provided VMA if the VM type allows these kind of > accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will > make it remove its pages from the kernel's direct map. Whenever KVM > wants to access guest-private memory, it will temporarily re-insert the > relevant pages. Patches 7-8 allow for restricted userspace mappings > (e.g. get_user_pages paths are disabled like for memfd_secret) of > guest_memfd, so that userspace has an easy path for loading the guest > kernel and handling I/O-buffers. > > === ToDos / Limitations === > > There are still a few rough edges that need to be addressed before > dropping the "RFC" tag, e.g. > > * Handle errors of set_direct_map_default_not_flush in > kvm_gmem_invalidate_folio instead of calling BUG_ON > * Lift the limitation of "at most one gfn_to_pfn_cache for each > gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map > entries when needed"). It currently means that guests with more than 1 > vcpu fail to boot, because multiple vcpus can put their kvm-clock PV > structures into the same page (gfn) > * Write selftests, particularly around hole punching, direct map removal, > and mmap. > > Lastly, there's the question of nested virtualization which Sean brought > up in previous discussions, which runs into similar problems as MMIO. I > have looked at it very briefly. On Intel, KVM uses various gfn->uhva > caches, which run in similar problems as the gfn_to_hva_caches dealt > with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory > is private"). However, previous attempts at just converting this to > gfn_to_pfn_cache (which would make them work with guest_memfd) proved > complicated [2]. I suppose initially, we should probably disallow nested > virtualization in VMs that have their memory removed from the direct > map. > > Best, > Patrick > > [1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com/ > [2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@google.com/ > > Patrick Roy (8): > kvm: Allow reading/writing gmem using kvm_{read,write}_guest > kvm: use slowpath in gfn_to_hva_cache if memory is private > kvm: pfncache: enlighten about gmem > kvm: x86: support walking guest page tables in gmem > kvm: gmem: add option to remove guest private memory from direct map > kvm: gmem: Temporarily restore direct map entries when needed > mm: secretmem: use AS_INACCESSIBLE to prohibit GUP > kvm: gmem: Allow restricted userspace mappings > > arch/x86/kvm/mmu/paging_tmpl.h | 94 +++++++++++++++++++----- > include/linux/kvm_host.h | 5 ++ > include/linux/kvm_types.h | 1 + > include/linux/secretmem.h | 13 +++- > include/uapi/linux/kvm.h | 2 + > mm/secretmem.c | 6 +- > virt/kvm/guest_memfd.c | 83 +++++++++++++++++++-- > virt/kvm/kvm_main.c | 112 +++++++++++++++++++++++++++- > virt/kvm/pfncache.c | 130 +++++++++++++++++++++++++++++---- > 9 files changed, 399 insertions(+), 47 deletions(-) > > > base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84
On Mon, 2024-07-22 at 13:28 +0100, "Vlastimil Babka (SUSE)" wrote: >> === Implementation === >> >> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD` >> to remove its pages from the direct map when they are allocated. When >> trying to run a guest from such a VM, we now face the problem that >> without either userspace or kernelspace mappings of guest_memfd, KVM >> cannot access guest memory to, for example, do MMIO emulation of access >> memory used to guest/host communication. We have multiple options for >> solving this when running non-CoCo VMs: (1) implement a TDX-light >> solution, where the guest shares memory that KVM needs to access, and >> relies on paravirtual solutions where this is not possible (e.g. MMIO), >> (2) have KVM use userspace mappings of guest_memfd (e.g. a >> memfd_secret-style solution), or (3) dynamically reinsert pages into the >> direct map whenever KVM wants to access them. >> >> This RFC goes for option (3). Option (1) is a lot of overhead for very >> little gain, since we are not actually constrained by a physical >> inability to access guest memory (e.g. we are not in a TDX context where >> accesses to guest memory cause a #MC). Option (2) has previously been >> rejected [1]. > > Do the pages have to have the same address when they are temporarily mapped? > Wouldn't it be easier to do something similar to kmap_local_page() used for > HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it > doesn't have to alter the shared directmap. > > Maybe that was already discussed somewhere as unsuitable but didn't spot it > here. For what I had prototyped here, there's no requirement to have the pages mapped at the same address (I remember briefly looking at memremap to achieve the temporary mappings, but since that doesnt work for normal memory, I gave up on that path). However, I think guest_memfd is moving into a direction where ranges marked as "in-place shared" (e.g. those that are temporarily reinserted into the direct map in this RFC) should be able to be GUP'd [1]. I think for that the direct map entries would need to be present, right? >> In this patch series, we make sufficient parts of KVM gmem-aware to be >> able to boot a Linux initrd from private memory on x86. These include >> KVM's MMIO emulation (including guest page table walking) and kvm-clock. >> For VM types which do not allow accessing gmem, we return -EFAULT and >> attempt to prepare a KVM_EXIT_MEMORY_FAULT. >> >> Additionally, this patch series adds support for "restricted" userspace >> mappings of guest_memfd, which work similar to memfd_secret (e.g. >> disallow get_user_pages), which allows handling I/O and loading the >> guest kernel in a simple way. Support for this is completely independent >> of the rest of the functionality introduced in this patch series. >> However, it is required to build a minimal hypervisor PoC that actually >> allows booting a VM from a disk. [1]: https://lore.kernel.org/kvm/489d1494-626c-40d9-89ec-4afc4cd0624b@redhat.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4
On Tue, Jul 9, 2024 at 6:21 AM Patrick Roy <roypat@amazon.co.uk> wrote: > > Hey all, > > This RFC series is a rough draft adding support for running > non-confidential compute VMs in guest_memfd, based on prior discussions > with Sean [1]. Our specific usecase for this is the ability to unmap > guest memory from the host kernel's direct map, as a mitigation against > a large class of speculative execution issues. Not to sound like a salesman, but did you happen to come across the RFC for ASI? https://lore.kernel.org/lkml/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com/ The current implementation considers userspace allocations as sensitive, so when a VM is running with ASI, the memory of other VMs is unmapped from the direct map (i.e. in the restricted address space). It also incorporates a mechanism to map this memory on-demand when needed (i.e. switch to the unrestricted address space), and running mitigations at this point to make sure it isn't exploited. In theory, it should be a more generic approach because it should apply to VMs that do not use guest_memfd as well, and it should be extensible to protect other parts of memory (e.g. sensitive kernel allocations). I understand that unmapping guest_memfd memory from the direct map in general could still be favorable, and for other reasons beyond mitigating speculative execution attacks. Just thought you may be interested in looking at ASI.
On 26.07.24 08:55, Patrick Roy wrote: > > > On Mon, 2024-07-22 at 13:28 +0100, "Vlastimil Babka (SUSE)" wrote: >>> === Implementation === >>> >>> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD` >>> to remove its pages from the direct map when they are allocated. When >>> trying to run a guest from such a VM, we now face the problem that >>> without either userspace or kernelspace mappings of guest_memfd, KVM >>> cannot access guest memory to, for example, do MMIO emulation of access >>> memory used to guest/host communication. We have multiple options for >>> solving this when running non-CoCo VMs: (1) implement a TDX-light >>> solution, where the guest shares memory that KVM needs to access, and >>> relies on paravirtual solutions where this is not possible (e.g. MMIO), >>> (2) have KVM use userspace mappings of guest_memfd (e.g. a >>> memfd_secret-style solution), or (3) dynamically reinsert pages into the >>> direct map whenever KVM wants to access them. >>> >>> This RFC goes for option (3). Option (1) is a lot of overhead for very >>> little gain, since we are not actually constrained by a physical >>> inability to access guest memory (e.g. we are not in a TDX context where >>> accesses to guest memory cause a #MC). Option (2) has previously been >>> rejected [1]. >> >> Do the pages have to have the same address when they are temporarily mapped? >> Wouldn't it be easier to do something similar to kmap_local_page() used for >> HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it >> doesn't have to alter the shared directmap. >> >> Maybe that was already discussed somewhere as unsuitable but didn't spot it >> here. > > For what I had prototyped here, there's no requirement to have the pages > mapped at the same address (I remember briefly looking at memremap to > achieve the temporary mappings, but since that doesnt work for normal > memory, I gave up on that path). However, I think guest_memfd is moving > into a direction where ranges marked as "in-place shared" (e.g. those > that are temporarily reinserted into the direct map in this RFC) should > be able to be GUP'd [1]. I think for that the direct map entries would > need to be present, right? Yes, we'd allow GUP. Of course, one could think of a similar extension like secretmem that would allow shared memory to get mapped into user page tables but would disallow any GUP on (shared) guest_memfd memory.