mbox series

[RFC,v3,0/6] Direct Map Removal for guest_memfd

Message ID 20241030134912.515725-1-roypat@amazon.co.uk (mailing list archive)
Headers show
Series Direct Map Removal for guest_memfd | expand

Message

Patrick Roy Oct. 30, 2024, 1:49 p.m. UTC
Unmapping virtual machine guest memory from the host kernel's direct map
is a successful mitigation against Spectre-style transient execution
issues: If the kernel page tables do not contain entries pointing to
guest memory, then any attempted speculative read through the direct map
will necessarily be blocked by the MMU before any observable
microarchitectural side-effects happen. This means that Spectre-gadgets
and similar cannot be used to target virtual machine memory. Roughly 60%
of speculative execution issues fall into this category [1, Table 1].

This patch series extends guest_memfd with the ability to remove its
memory from the host kernel's direct map, to be able to attain the above
protection for KVM guests running inside guest_memfd.

=== Changes to v2 ===

- Handle direct map removal for physically contiguous pages in arch code
  (Mike R.)
- Track the direct map state in guest_memfd itself instead of at the
  folio level, to prepare for huge pages support (Sean C.)
- Allow configuring direct map state of not-yet faulted in memory
  (Vishal A.)
- Pay attention to alignment in ftrace structs (Steven R.)

Most significantly, I've reduced the patch series to focus only on
direct map removal for guest_memfd for now, leaving the whole "how to do
non-CoCo VMs in guest_memfd" for later. If this separation is
acceptable, then I think I can drop the RFC tag in the next revision
(I've mainly kept it here because I'm not entirely sure what to do with
patches 3 and 4).

=== Implementation ===

This patch series introduces a new flag to the KVM_CREATE_GUEST_MEMFD
that causes guest_memfd to remove its pages from the host kernel's
direct map immediately after population/preparation.  It also adds
infrastructure for tracking the direct map state of all gmem folios
inside the guest_memfd inode. Storing this information in the inode has
the advantage that the code is ready for future hugepages extensions,
where only removing/reinserting direct map entries for sub-ranges of a
huge folio is a valid usecase, and it allows pre-configuring the direct
map state of not-yet faulted in parts of memory (for example, when the
VMM is receiving a RX virtio buffer from the guest).

=== Summary ===

Patch 1 (from Mike Rapoport) adds arch APIs for manipulating the direct
map for ranges of physically contiguous pages, which are used by
guest_memfd in follow up patches. Patch 2 adds the
KVM_GMEM_NO_DIRECT_MAP flag and the logic for configuring direct map
state of freshly prepared folios. Patches 3 and 4 mainly serve an
illustrative purpose, to show how the framework from patch 2 can be
extended with routines for runtime direct map manipulation. Patches 5
and 6 deal with documentation and self-tests respectively.

[1]: https://download.vusec.net/papers/quarantine_raid23.pdf
[RFC v1]: https://lore.kernel.org/kvm/20240709132041.3625501-1-roypat@amazon.co.uk/
[RFC v2]: https://lore.kernel.org/kvm/20240910163038.1298452-1-roypat@amazon.co.uk/

Mike Rapoport (Microsoft) (1):
  arch: introduce set_direct_map_valid_noflush()

Patrick Roy (5):
  kvm: gmem: add flag to remove memory from kernel direct map
  kvm: gmem: implement direct map manipulation routines
  kvm: gmem: add trace point for direct map state changes
  kvm: document KVM_GMEM_NO_DIRECT_MAP flag
  kvm: selftests: run gmem tests with KVM_GMEM_NO_DIRECT_MAP set

 Documentation/virt/kvm/api.rst                |  14 +
 arch/arm64/include/asm/set_memory.h           |   1 +
 arch/arm64/mm/pageattr.c                      |  10 +
 arch/loongarch/include/asm/set_memory.h       |   1 +
 arch/loongarch/mm/pageattr.c                  |  21 ++
 arch/riscv/include/asm/set_memory.h           |   1 +
 arch/riscv/mm/pageattr.c                      |  15 +
 arch/s390/include/asm/set_memory.h            |   1 +
 arch/s390/mm/pageattr.c                       |  11 +
 arch/x86/include/asm/set_memory.h             |   1 +
 arch/x86/mm/pat/set_memory.c                  |   8 +
 include/linux/set_memory.h                    |   6 +
 include/trace/events/kvm.h                    |  22 ++
 include/uapi/linux/kvm.h                      |   2 +
 .../testing/selftests/kvm/guest_memfd_test.c  |   2 +-
 .../kvm/x86_64/private_mem_conversions_test.c |   7 +-
 virt/kvm/guest_memfd.c                        | 280 +++++++++++++++++-
 17 files changed, 384 insertions(+), 19 deletions(-)


base-commit: 5cb1659f412041e4780f2e8ee49b2e03728a2ba6

Comments

David Hildenbrand Oct. 31, 2024, 9:50 a.m. UTC | #1
On 30.10.24 14:49, Patrick Roy wrote:
> Unmapping virtual machine guest memory from the host kernel's direct map
> is a successful mitigation against Spectre-style transient execution
> issues: If the kernel page tables do not contain entries pointing to
> guest memory, then any attempted speculative read through the direct map
> will necessarily be blocked by the MMU before any observable
> microarchitectural side-effects happen. This means that Spectre-gadgets
> and similar cannot be used to target virtual machine memory. Roughly 60%
> of speculative execution issues fall into this category [1, Table 1].
> 
> This patch series extends guest_memfd with the ability to remove its
> memory from the host kernel's direct map, to be able to attain the above
> protection for KVM guests running inside guest_memfd.
> 
> === Changes to v2 ===
> 
> - Handle direct map removal for physically contiguous pages in arch code
>    (Mike R.)
> - Track the direct map state in guest_memfd itself instead of at the
>    folio level, to prepare for huge pages support (Sean C.)
> - Allow configuring direct map state of not-yet faulted in memory
>    (Vishal A.)
> - Pay attention to alignment in ftrace structs (Steven R.)
> 
> Most significantly, I've reduced the patch series to focus only on
> direct map removal for guest_memfd for now, leaving the whole "how to do
> non-CoCo VMs in guest_memfd" for later. If this separation is
> acceptable, then I think I can drop the RFC tag in the next revision
> (I've mainly kept it here because I'm not entirely sure what to do with
> patches 3 and 4).

Hi,

keeping upcoming "shared and private memory in guest_memfd" in mind, I 
assume the focus would be to only remove the direct map for private memory?

So in the current upstream state, you would only be removing the direct 
map for private memory, currently translating to "encrypted"/"protected" 
memory that is inaccessible either way already.

Correct?
Patrick Roy Oct. 31, 2024, 10:42 a.m. UTC | #2
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
> On 30.10.24 14:49, Patrick Roy wrote:
>> Unmapping virtual machine guest memory from the host kernel's direct map
>> is a successful mitigation against Spectre-style transient execution
>> issues: If the kernel page tables do not contain entries pointing to
>> guest memory, then any attempted speculative read through the direct map
>> will necessarily be blocked by the MMU before any observable
>> microarchitectural side-effects happen. This means that Spectre-gadgets
>> and similar cannot be used to target virtual machine memory. Roughly 60%
>> of speculative execution issues fall into this category [1, Table 1].
>>
>> This patch series extends guest_memfd with the ability to remove its
>> memory from the host kernel's direct map, to be able to attain the above
>> protection for KVM guests running inside guest_memfd.
>>
>> === Changes to v2 ===
>>
>> - Handle direct map removal for physically contiguous pages in arch code
>>    (Mike R.)
>> - Track the direct map state in guest_memfd itself instead of at the
>>    folio level, to prepare for huge pages support (Sean C.)
>> - Allow configuring direct map state of not-yet faulted in memory
>>    (Vishal A.)
>> - Pay attention to alignment in ftrace structs (Steven R.)
>>
>> Most significantly, I've reduced the patch series to focus only on
>> direct map removal for guest_memfd for now, leaving the whole "how to do
>> non-CoCo VMs in guest_memfd" for later. If this separation is
>> acceptable, then I think I can drop the RFC tag in the next revision
>> (I've mainly kept it here because I'm not entirely sure what to do with
>> patches 3 and 4).
> 
> Hi,
> 
> keeping upcoming "shared and private memory in guest_memfd" in mind, I
> assume the focus would be to only remove the direct map for private memory?
> 
> So in the current upstream state, you would only be removing the direct
> map for private memory, currently translating to "encrypted"/"protected"
> memory that is inaccessible either way already.
> 
> Correct?

Yea, with the upcomming "shared and private" stuff, I would expect the
the shared<->private conversions would call the routines from patch 3 to
restore direct map entries on private->shared, and zap them on
shared->private.

But as you said, the current upstream state has no notion of "shared"
memory in guest_memfd, so everything is private and thus everything is
direct map removed (although it is indeed already inaccessible anyway
for TDX and friends. That's what makes this patch series a bit awkward
:( )

> -- 
> Cheers,
> 
> David / dhildenb
> 

Best, 
Patrick
Manwaring, Derek Nov. 1, 2024, 12:10 a.m. UTC | #3
On 2024-10-31 at 10:42+0000 Patrick Roy wrote:
> On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote:
> > On 30.10.24 14:49, Patrick Roy wrote:
> >> Most significantly, I've reduced the patch series to focus only on
> >> direct map removal for guest_memfd for now, leaving the whole "how to do
> >> non-CoCo VMs in guest_memfd" for later. If this separation is
> >> acceptable, then I think I can drop the RFC tag in the next revision
> >> (I've mainly kept it here because I'm not entirely sure what to do with
> >> patches 3 and 4).
> >
> > Hi,
> >
> > keeping upcoming "shared and private memory in guest_memfd" in mind, I
> > assume the focus would be to only remove the direct map for private memory?
> >
> > So in the current upstream state, you would only be removing the direct
> > map for private memory, currently translating to "encrypted"/"protected"
> > memory that is inaccessible either way already.
> >
> > Correct?
>
> Yea, with the upcomming "shared and private" stuff, I would expect the
> the shared<->private conversions would call the routines from patch 3 to
> restore direct map entries on private->shared, and zap them on
> shared->private.
>
> But as you said, the current upstream state has no notion of "shared"
> memory in guest_memfd, so everything is private and thus everything is
> direct map removed (although it is indeed already inaccessible anyway
> for TDX and friends. That's what makes this patch series a bit awkward
> :( )

TDX and SEV encryption happens between the core and main memory, so
cached guest data we're most concerned about for transient execution
attacks isn't necessarily inaccessible.

I'd be interested what Intel, AMD, and other folks think on this, but I
think direct map removal is worthwhile for CoCo cases as well.

Derek