mbox series

[RFC,0/8] Unmapping guest_memfd from Direct Map

Message ID 20240709132041.3625501-1-roypat@amazon.co.uk (mailing list archive)
Headers show
Series Unmapping guest_memfd from Direct Map | expand

Message

Patrick Roy July 9, 2024, 1:20 p.m. UTC
Hey all,

This RFC series is a rough draft adding support for running
non-confidential compute VMs in guest_memfd, based on prior discussions
with Sean [1]. Our specific usecase for this is the ability to unmap
guest memory from the host kernel's direct map, as a mitigation against
a large class of speculative execution issues.

=== Implementation ===

This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
to remove its pages from the direct map when they are allocated. When
trying to run a guest from such a VM, we now face the problem that
without either userspace or kernelspace mappings of guest_memfd, KVM
cannot access guest memory to, for example, do MMIO emulation of access
memory used to guest/host communication. We have multiple options for
solving this when running non-CoCo VMs: (1) implement a TDX-light
solution, where the guest shares memory that KVM needs to access, and
relies on paravirtual solutions where this is not possible (e.g. MMIO),
(2) have KVM use userspace mappings of guest_memfd (e.g. a
memfd_secret-style solution), or (3) dynamically reinsert pages into the
direct map whenever KVM wants to access them.

This RFC goes for option (3). Option (1) is a lot of overhead for very
little gain, since we are not actually constrained by a physical
inability to access guest memory (e.g. we are not in a TDX context where
accesses to guest memory cause a #MC). Option (2) has previously been
rejected [1].

In this patch series, we make sufficient parts of KVM gmem-aware to be
able to boot a Linux initrd from private memory on x86. These include
KVM's MMIO emulation (including guest page table walking) and kvm-clock.
For VM types which do not allow accessing gmem, we return -EFAULT and
attempt to prepare a KVM_EXIT_MEMORY_FAULT.

Additionally, this patch series adds support for "restricted" userspace
mappings of guest_memfd, which work similar to memfd_secret (e.g.
disallow get_user_pages), which allows handling I/O and loading the
guest kernel in a simple way. Support for this is completely independent
of the rest of the functionality introduced in this patch series.
However, it is required to build a minimal hypervisor PoC that actually
allows booting a VM from a disk.

=== Performance ===

We have run some preliminary performance benchmarks to assess the impact
of on-the-fly direct map manipulations. We were mainly interested in the
impact of manipulating the direct map for MMIO emulation on virtio-mmio.
Particularly, we were worried about the impact of the TLB and L1/2/3
Cache flushes that set_memory_[n]p entails.

In our setup, we have taken a modified Firecracker VMM, spawned a Linux
guest with 1 vCPU, and used fio to stress a virtio_blk device. We found
that the cache flushes caused throughput to drop from around 600MB/s to
~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum
8375C CPU with 64 cores). We then converted our prototype to use
set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and
found that without cache flushes the pure impact of the direct map
manipulation is indistinguishable from noise. This is why we use
set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in
this RFC.

Note that in this comparison, both the baseline, as well as the
guest_memfd-supporting version of Firecracker were made to bounce I/O
buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs,
the virtio stack cannot directly pass guest buffers to read/write
syscalls.

=== Security ===

We want to use unmapping guest memory from the host kernel as a security
mitigation against transient execution attacks. Temporarily restoring
direct map entries whenever KVM requires access to guest memory leaves a
gap in this mitigation. We believe this to be acceptable for the above
cases, since pages used for paravirtual guest/host communication (e.g.
kvm-clock) and guest page tables do not contain sensitive data. MMIO
emulation will only end up reading pages containing privileged
instructions (e.g. guest kernel code).

=== Summary ===

Patches 1-4 are about hot-patching various points inside of KVM that
access guest memory to correctly handle the case where memory happens to
be guest-private. This means either handling the access as a memory
error, or simply accessing the memslot's guest_memfd instead of looking
at the userspace provided VMA if the VM type allows these kind of
accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will
make it remove its pages from the kernel's direct map. Whenever KVM
wants to access guest-private memory, it will temporarily re-insert the
relevant pages. Patches 7-8 allow for restricted userspace mappings
(e.g. get_user_pages paths are disabled like for memfd_secret) of
guest_memfd, so that userspace has an easy path for loading the guest
kernel and handling I/O-buffers.

=== ToDos / Limitations ===

There are still a few rough edges that need to be addressed before
dropping the "RFC" tag, e.g.

* Handle errors of set_direct_map_default_not_flush in
  kvm_gmem_invalidate_folio instead of calling BUG_ON
* Lift the limitation of "at most one gfn_to_pfn_cache for each
  gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map
  entries when needed"). It currently means that guests with more than 1
  vcpu fail to boot, because multiple vcpus can put their kvm-clock PV
  structures into the same page (gfn)
* Write selftests, particularly around hole punching, direct map removal,
  and mmap.

Lastly, there's the question of nested virtualization which Sean brought
up in previous discussions, which runs into similar problems as MMIO. I
have looked at it very briefly. On Intel, KVM uses various gfn->uhva
caches, which run in similar problems as the gfn_to_hva_caches dealt
with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory
is private"). However, previous attempts at just converting this to
gfn_to_pfn_cache (which would make them work with guest_memfd) proved
complicated [2]. I suppose initially, we should probably disallow nested
virtualization in VMs that have their memory removed from the direct
map.

Best,
Patrick

[1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com/
[2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@google.com/

Patrick Roy (8):
  kvm: Allow reading/writing gmem using kvm_{read,write}_guest
  kvm: use slowpath in gfn_to_hva_cache if memory is private
  kvm: pfncache: enlighten about gmem
  kvm: x86: support walking guest page tables in gmem
  kvm: gmem: add option to remove guest private memory from direct map
  kvm: gmem: Temporarily restore direct map entries when needed
  mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
  kvm: gmem: Allow restricted userspace mappings

 arch/x86/kvm/mmu/paging_tmpl.h |  94 +++++++++++++++++++-----
 include/linux/kvm_host.h       |   5 ++
 include/linux/kvm_types.h      |   1 +
 include/linux/secretmem.h      |  13 +++-
 include/uapi/linux/kvm.h       |   2 +
 mm/secretmem.c                 |   6 +-
 virt/kvm/guest_memfd.c         |  83 +++++++++++++++++++--
 virt/kvm/kvm_main.c            | 112 +++++++++++++++++++++++++++-
 virt/kvm/pfncache.c            | 130 +++++++++++++++++++++++++++++----
 9 files changed, 399 insertions(+), 47 deletions(-)


base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84

Comments

Vlastimil Babka (SUSE) July 22, 2024, 12:28 p.m. UTC | #1
On 7/9/24 3:20 PM, Patrick Roy wrote:
> Hey all,
> 
> This RFC series is a rough draft adding support for running
> non-confidential compute VMs in guest_memfd, based on prior discussions
> with Sean [1]. Our specific usecase for this is the ability to unmap
> guest memory from the host kernel's direct map, as a mitigation against
> a large class of speculative execution issues.
> 
> === Implementation ===
> 
> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
> to remove its pages from the direct map when they are allocated. When
> trying to run a guest from such a VM, we now face the problem that
> without either userspace or kernelspace mappings of guest_memfd, KVM
> cannot access guest memory to, for example, do MMIO emulation of access
> memory used to guest/host communication. We have multiple options for
> solving this when running non-CoCo VMs: (1) implement a TDX-light
> solution, where the guest shares memory that KVM needs to access, and
> relies on paravirtual solutions where this is not possible (e.g. MMIO),
> (2) have KVM use userspace mappings of guest_memfd (e.g. a
> memfd_secret-style solution), or (3) dynamically reinsert pages into the
> direct map whenever KVM wants to access them.
> 
> This RFC goes for option (3). Option (1) is a lot of overhead for very
> little gain, since we are not actually constrained by a physical
> inability to access guest memory (e.g. we are not in a TDX context where
> accesses to guest memory cause a #MC). Option (2) has previously been
> rejected [1].

Do the pages have to have the same address when they are temporarily mapped?
Wouldn't it be easier to do something similar to kmap_local_page() used for
HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it
doesn't have to alter the shared directmap.

Maybe that was already discussed somewhere as unsuitable but didn't spot it
here.

> In this patch series, we make sufficient parts of KVM gmem-aware to be
> able to boot a Linux initrd from private memory on x86. These include
> KVM's MMIO emulation (including guest page table walking) and kvm-clock.
> For VM types which do not allow accessing gmem, we return -EFAULT and
> attempt to prepare a KVM_EXIT_MEMORY_FAULT.
> 
> Additionally, this patch series adds support for "restricted" userspace
> mappings of guest_memfd, which work similar to memfd_secret (e.g.
> disallow get_user_pages), which allows handling I/O and loading the
> guest kernel in a simple way. Support for this is completely independent
> of the rest of the functionality introduced in this patch series.
> However, it is required to build a minimal hypervisor PoC that actually
> allows booting a VM from a disk.
> 
> === Performance ===
> 
> We have run some preliminary performance benchmarks to assess the impact
> of on-the-fly direct map manipulations. We were mainly interested in the
> impact of manipulating the direct map for MMIO emulation on virtio-mmio.
> Particularly, we were worried about the impact of the TLB and L1/2/3
> Cache flushes that set_memory_[n]p entails.
> 
> In our setup, we have taken a modified Firecracker VMM, spawned a Linux
> guest with 1 vCPU, and used fio to stress a virtio_blk device. We found
> that the cache flushes caused throughput to drop from around 600MB/s to
> ~50MB/s (~90%) for both reads and writes (on a Intel(R) Xeon(R) Platinum
> 8375C CPU with 64 cores). We then converted our prototype to use
> set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p and
> found that without cache flushes the pure impact of the direct map
> manipulation is indistinguishable from noise. This is why we use
> set_direct_map_{invalid,default}_noflush instead of set_memory_[n]p in
> this RFC.
> 
> Note that in this comparison, both the baseline, as well as the
> guest_memfd-supporting version of Firecracker were made to bounce I/O
> buffers in VMM userspace. As GUP is disabled for the guest_memfd VMAs,
> the virtio stack cannot directly pass guest buffers to read/write
> syscalls.
> 
> === Security ===
> 
> We want to use unmapping guest memory from the host kernel as a security
> mitigation against transient execution attacks. Temporarily restoring
> direct map entries whenever KVM requires access to guest memory leaves a
> gap in this mitigation. We believe this to be acceptable for the above
> cases, since pages used for paravirtual guest/host communication (e.g.
> kvm-clock) and guest page tables do not contain sensitive data. MMIO
> emulation will only end up reading pages containing privileged
> instructions (e.g. guest kernel code).
> 
> === Summary ===
> 
> Patches 1-4 are about hot-patching various points inside of KVM that
> access guest memory to correctly handle the case where memory happens to
> be guest-private. This means either handling the access as a memory
> error, or simply accessing the memslot's guest_memfd instead of looking
> at the userspace provided VMA if the VM type allows these kind of
> accesses. Patches 5-6 add a flag to KVM_CREATE_GUEST_MEMFD that will
> make it remove its pages from the kernel's direct map. Whenever KVM
> wants to access guest-private memory, it will temporarily re-insert the
> relevant pages. Patches 7-8 allow for restricted userspace mappings
> (e.g. get_user_pages paths are disabled like for memfd_secret) of
> guest_memfd, so that userspace has an easy path for loading the guest
> kernel and handling I/O-buffers.
> 
> === ToDos / Limitations ===
> 
> There are still a few rough edges that need to be addressed before
> dropping the "RFC" tag, e.g.
> 
> * Handle errors of set_direct_map_default_not_flush in
>   kvm_gmem_invalidate_folio instead of calling BUG_ON
> * Lift the limitation of "at most one gfn_to_pfn_cache for each
>   gfn/pfn" in e1c61f0a7963 ("kvm: gmem: Temporarily restore direct map
>   entries when needed"). It currently means that guests with more than 1
>   vcpu fail to boot, because multiple vcpus can put their kvm-clock PV
>   structures into the same page (gfn)
> * Write selftests, particularly around hole punching, direct map removal,
>   and mmap.
> 
> Lastly, there's the question of nested virtualization which Sean brought
> up in previous discussions, which runs into similar problems as MMIO. I
> have looked at it very briefly. On Intel, KVM uses various gfn->uhva
> caches, which run in similar problems as the gfn_to_hva_caches dealt
> with in 200834b15dda ("kvm: use slowpath in gfn_to_hva_cache if memory
> is private"). However, previous attempts at just converting this to
> gfn_to_pfn_cache (which would make them work with guest_memfd) proved
> complicated [2]. I suppose initially, we should probably disallow nested
> virtualization in VMs that have their memory removed from the direct
> map.
> 
> Best,
> Patrick
> 
> [1]: https://lore.kernel.org/linux-mm/cc1bb8e9bc3e1ab637700a4d3defeec95b55060a.camel@amazon.com/
> [2]: https://lore.kernel.org/kvm/ZBEEQtmtNPaEqU1i@google.com/
> 
> Patrick Roy (8):
>   kvm: Allow reading/writing gmem using kvm_{read,write}_guest
>   kvm: use slowpath in gfn_to_hva_cache if memory is private
>   kvm: pfncache: enlighten about gmem
>   kvm: x86: support walking guest page tables in gmem
>   kvm: gmem: add option to remove guest private memory from direct map
>   kvm: gmem: Temporarily restore direct map entries when needed
>   mm: secretmem: use AS_INACCESSIBLE to prohibit GUP
>   kvm: gmem: Allow restricted userspace mappings
> 
>  arch/x86/kvm/mmu/paging_tmpl.h |  94 +++++++++++++++++++-----
>  include/linux/kvm_host.h       |   5 ++
>  include/linux/kvm_types.h      |   1 +
>  include/linux/secretmem.h      |  13 +++-
>  include/uapi/linux/kvm.h       |   2 +
>  mm/secretmem.c                 |   6 +-
>  virt/kvm/guest_memfd.c         |  83 +++++++++++++++++++--
>  virt/kvm/kvm_main.c            | 112 +++++++++++++++++++++++++++-
>  virt/kvm/pfncache.c            | 130 +++++++++++++++++++++++++++++----
>  9 files changed, 399 insertions(+), 47 deletions(-)
> 
> 
> base-commit: 890a64810d59b1a58ed26efc28cfd821fc068e84
Patrick Roy July 26, 2024, 6:55 a.m. UTC | #2
On Mon, 2024-07-22 at 13:28 +0100, "Vlastimil Babka (SUSE)" wrote:
>> === Implementation ===
>>
>> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
>> to remove its pages from the direct map when they are allocated. When
>> trying to run a guest from such a VM, we now face the problem that
>> without either userspace or kernelspace mappings of guest_memfd, KVM
>> cannot access guest memory to, for example, do MMIO emulation of access
>> memory used to guest/host communication. We have multiple options for
>> solving this when running non-CoCo VMs: (1) implement a TDX-light
>> solution, where the guest shares memory that KVM needs to access, and
>> relies on paravirtual solutions where this is not possible (e.g. MMIO),
>> (2) have KVM use userspace mappings of guest_memfd (e.g. a
>> memfd_secret-style solution), or (3) dynamically reinsert pages into the
>> direct map whenever KVM wants to access them.
>>
>> This RFC goes for option (3). Option (1) is a lot of overhead for very
>> little gain, since we are not actually constrained by a physical
>> inability to access guest memory (e.g. we are not in a TDX context where
>> accesses to guest memory cause a #MC). Option (2) has previously been
>> rejected [1].
> 
> Do the pages have to have the same address when they are temporarily mapped?
> Wouldn't it be easier to do something similar to kmap_local_page() used for
> HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it
> doesn't have to alter the shared directmap.
> 
> Maybe that was already discussed somewhere as unsuitable but didn't spot it
> here.

For what I had prototyped here, there's no requirement to have the pages
mapped at the same address (I remember briefly looking at memremap to
achieve the temporary mappings, but since that doesnt work for normal
memory, I gave up on that path). However, I think guest_memfd is moving
into a direction where ranges marked as "in-place shared" (e.g. those
that are temporarily reinserted into the direct map in this RFC)  should
be able to be GUP'd [1]. I think for that the direct map entries would
need to be present, right?

>> In this patch series, we make sufficient parts of KVM gmem-aware to be
>> able to boot a Linux initrd from private memory on x86. These include
>> KVM's MMIO emulation (including guest page table walking) and kvm-clock.
>> For VM types which do not allow accessing gmem, we return -EFAULT and
>> attempt to prepare a KVM_EXIT_MEMORY_FAULT.
>>
>> Additionally, this patch series adds support for "restricted" userspace
>> mappings of guest_memfd, which work similar to memfd_secret (e.g.
>> disallow get_user_pages), which allows handling I/O and loading the
>> guest kernel in a simple way. Support for this is completely independent
>> of the rest of the functionality introduced in this patch series.
>> However, it is required to build a minimal hypervisor PoC that actually
>> allows booting a VM from a disk.
 
[1]: https://lore.kernel.org/kvm/489d1494-626c-40d9-89ec-4afc4cd0624b@redhat.com/T/#mc944a6fdcd20a35f654c2be99f9c91a117c1bed4
Yosry Ahmed July 26, 2024, 4:44 p.m. UTC | #3
On Tue, Jul 9, 2024 at 6:21 AM Patrick Roy <roypat@amazon.co.uk> wrote:
>
> Hey all,
>
> This RFC series is a rough draft adding support for running
> non-confidential compute VMs in guest_memfd, based on prior discussions
> with Sean [1]. Our specific usecase for this is the ability to unmap
> guest memory from the host kernel's direct map, as a mitigation against
> a large class of speculative execution issues.

Not to sound like a salesman, but did you happen to come across the RFC for ASI?
https://lore.kernel.org/lkml/20240712-asi-rfc-24-v1-0-144b319a40d8@google.com/

The current implementation considers userspace allocations as
sensitive, so when a VM is running with ASI, the memory of other VMs
is unmapped from the direct map (i.e. in the restricted address
space). It also incorporates a mechanism to map this memory on-demand
when needed (i.e. switch to the unrestricted address space), and
running mitigations at this point to make sure it isn't exploited.

In theory, it should be a more generic approach because it should
apply to VMs that do not use guest_memfd as well, and it should be
extensible to protect other parts of memory (e.g. sensitive kernel
allocations).

I understand that unmapping guest_memfd memory from the direct map in
general could still be favorable, and for other reasons beyond
mitigating speculative execution attacks. Just thought you may be
interested in looking at ASI.
David Hildenbrand July 30, 2024, 10:17 a.m. UTC | #4
On 26.07.24 08:55, Patrick Roy wrote:
> 
> 
> On Mon, 2024-07-22 at 13:28 +0100, "Vlastimil Babka (SUSE)" wrote:
>>> === Implementation ===
>>>
>>> This patch series introduces a new flag to the `KVM_CREATE_GUEST_MEMFD`
>>> to remove its pages from the direct map when they are allocated. When
>>> trying to run a guest from such a VM, we now face the problem that
>>> without either userspace or kernelspace mappings of guest_memfd, KVM
>>> cannot access guest memory to, for example, do MMIO emulation of access
>>> memory used to guest/host communication. We have multiple options for
>>> solving this when running non-CoCo VMs: (1) implement a TDX-light
>>> solution, where the guest shares memory that KVM needs to access, and
>>> relies on paravirtual solutions where this is not possible (e.g. MMIO),
>>> (2) have KVM use userspace mappings of guest_memfd (e.g. a
>>> memfd_secret-style solution), or (3) dynamically reinsert pages into the
>>> direct map whenever KVM wants to access them.
>>>
>>> This RFC goes for option (3). Option (1) is a lot of overhead for very
>>> little gain, since we are not actually constrained by a physical
>>> inability to access guest memory (e.g. we are not in a TDX context where
>>> accesses to guest memory cause a #MC). Option (2) has previously been
>>> rejected [1].
>>
>> Do the pages have to have the same address when they are temporarily mapped?
>> Wouldn't it be easier to do something similar to kmap_local_page() used for
>> HIMEM? I.e. you get a temporary kernel mapping to do what's needed, but it
>> doesn't have to alter the shared directmap.
>>
>> Maybe that was already discussed somewhere as unsuitable but didn't spot it
>> here.
> 
> For what I had prototyped here, there's no requirement to have the pages
> mapped at the same address (I remember briefly looking at memremap to
> achieve the temporary mappings, but since that doesnt work for normal
> memory, I gave up on that path). However, I think guest_memfd is moving
> into a direction where ranges marked as "in-place shared" (e.g. those
> that are temporarily reinserted into the direct map in this RFC)  should
> be able to be GUP'd [1]. I think for that the direct map entries would
> need to be present, right?

Yes, we'd allow GUP. Of course, one could think of a similar extension 
like secretmem that would allow shared memory to get mapped into user 
page tables but would disallow any GUP on (shared) guest_memfd memory.