mbox series

[RFC,00/16] KVM protected memory extension

Message ID 20200522125214.31348-1-kirill.shutemov@linux.intel.com (mailing list archive)
Headers show
Series KVM protected memory extension | expand

Message

Kirill A . Shutemov May 22, 2020, 12:51 p.m. UTC
== Background / Problem ==

There are a number of hardware features (MKTME, SEV) which protect guest
memory from some unauthorized host access. The patchset proposes a purely
software feature that mitigates some of the same host-side read-only
attacks.


== What does this set mitigate? ==

 - Host kernel ”accidental” access to guest data (think speculation)

 - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))

 - Host userspace access to guest data (compromised qemu)

== What does this set NOT mitigate? ==

 - Full host kernel compromise.  Kernel will just map the pages again.

 - Hardware attacks


The patchset is RFC-quality: it works but has known issues that must be
addressed before it can be considered for applying.

We are looking for high-level feedback on the concept.  Some open
questions:

 - This protects from some kernel and host userspace read-only attacks,
   but does not place the host kernel outside the trust boundary. Is it
   still valuable?

 - Can this approach be used to avoid cache-coherency problems with
   hardware encryption schemes that repurpose physical bits?

 - The guest kernel must be modified for this to work.  Is that a deal
   breaker, especially for public clouds?

 - Are the costs of removing pages from the direct map too high to be
   feasible?

== Series Overview ==

The hardware features protect guest data by encrypting it and then
ensuring that only the right guest can decrypt it.  This has the
side-effect of making the kernel direct map and userspace mapping
(QEMU et al) useless.  But, this teaches us something very useful:
neither the kernel or userspace mappings are really necessary for normal
guest operations.

Instead of using encryption, this series simply unmaps the memory. One
advantage compared to allowing access to ciphertext is that it allows bad
accesses to be caught instead of simply reading garbage.

Protection from physical attacks needs to be provided by some other means.
On Intel platforms, (single-key) Total Memory Encryption (TME) provides
mitigation against physical attacks, such as DIMM interposers sniffing
memory bus traffic.

The patchset modifies both host and guest kernel. The guest OS must enable
the feature via hypercall and mark any memory range that has to be shared
with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
bit in the guest’s page table while this approach uses a hypercall.

For removing the userspace mapping, use a trick similar to what NUMA
balancing does: convert memory that belongs to KVM memory slots to
PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
VMA must be treated in a special way in the GUP and fault paths. The flag
allows GUP to return the page even though it is mapped with PROT_NONE, but
only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
would result in -EFAULT.

Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
flushes local TLB. I think it's a reasonable compromise between security and
perfromance.

Zapping the PTE would bring the page back to the direct mapping after clearing.
At least for now, we don't remove file-backed pages from the direct mapping.
File-backed pages could be accessed via read/write syscalls. It adds
complexity.

Occasionally, host kernel has to access guest memory that was not made
shared by the guest. For instance, it happens for instruction emulation.
Normally, it's done via copy_to/from_user() which would fail with -EFAULT
now. We introduced a new pair of helpers: copy_to/from_guest(). The new
helpers acquire the page via GUP, map it into kernel address space with
kmap_atomic()-style mechanism and only then copy the data.

For some instruction emulation copying is not good enough: cmpxchg
emulation has to have direct access to the guest memory. __kvm_map_gfn()
is modified to accommodate the case.

The patchset is on top of v5.7-rc6 plus this patch:

https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com

== Open Issues ==

Unmapping the pages from direct mapping bring a few of issues that have
not rectified yet:

 - Touching direct mapping leads to fragmentation. We need to be able to
   recover from it. I have a buggy patch that aims at recovering 2M/1G page.
   It has to be fixed and tested properly

 - Page migration and KSM is not supported yet.

 - Live migration of a guest would require a new flow. Not sure yet how it
   would look like.

 - The feature interfere with NUMA balancing. Not sure yet if it's
   possible to make them work together.

 - Guests have no mechanism to ensure that even a well-behaving host has
   unmapped its private data.  With SEV, for instance, the guest only has
   to trust the hardware to encrypt a page after the C bit is set in a
   guest PTE.  A mechanism for a guest to query the host mapping state, or
   to constantly assert the intent for a page to be Private would be
   valuable.
Kirill A. Shutemov (16):
  x86/mm: Move force_dma_unencrypted() to common code
  x86/kvm: Introduce KVM memory protection feature
  x86/kvm: Make DMA pages shared
  x86/kvm: Use bounce buffers for KVM memory protection
  x86/kvm: Make VirtIO use DMA API in KVM guest
  KVM: Use GUP instead of copy_from/to_user() to access guest memory
  KVM: mm: Introduce VM_KVM_PROTECTED
  KVM: x86: Use GUP for page walk instead of __get_user()
  KVM: Protected memory extension
  KVM: x86: Enabled protected memory extension
  KVM: Rework copy_to/from_guest() to avoid direct mapping
  x86/kvm: Share steal time page with host
  x86/kvmclock: Share hvclock memory with the host
  KVM: Introduce gfn_to_pfn_memslot_protected()
  KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  KVM: Unmap protected pages from direct mapping

 arch/powerpc/kvm/book3s_64_mmu_hv.c    |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   2 +-
 arch/x86/Kconfig                       |  11 +-
 arch/x86/include/asm/io.h              |   6 +-
 arch/x86/include/asm/kvm_para.h        |   5 +
 arch/x86/include/asm/pgtable_types.h   |   1 +
 arch/x86/include/uapi/asm/kvm_para.h   |   3 +-
 arch/x86/kernel/kvm.c                  |  24 +-
 arch/x86/kernel/kvmclock.c             |   2 +-
 arch/x86/kernel/pci-swiotlb.c          |   3 +-
 arch/x86/kvm/cpuid.c                   |   3 +
 arch/x86/kvm/mmu/mmu.c                 |   6 +-
 arch/x86/kvm/mmu/paging_tmpl.h         |  10 +-
 arch/x86/kvm/x86.c                     |   9 +
 arch/x86/mm/Makefile                   |   2 +
 arch/x86/mm/ioremap.c                  |  16 +-
 arch/x86/mm/mem_encrypt.c              |  50 ----
 arch/x86/mm/mem_encrypt_common.c       |  62 ++++
 arch/x86/mm/pat/set_memory.c           |   8 +
 drivers/virtio/virtio_ring.c           |   4 +
 include/linux/kvm_host.h               |  14 +-
 include/linux/mm.h                     |  12 +
 include/uapi/linux/kvm_para.h          |   5 +-
 mm/gup.c                               |  20 +-
 mm/huge_memory.c                       |  29 +-
 mm/ksm.c                               |   3 +
 mm/memory.c                            |  16 +
 mm/mmap.c                              |   3 +
 mm/mprotect.c                          |   1 +
 mm/rmap.c                              |   4 +
 virt/kvm/async_pf.c                    |   4 +-
 virt/kvm/kvm_main.c                    | 390 +++++++++++++++++++++++--
 32 files changed, 627 insertions(+), 103 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c

Comments

kirill.shutemov@linux.intel.com May 25, 2020, 5:27 a.m. UTC | #1
On Fri, May 22, 2020 at 03:51:58PM +0300, Kirill A. Shutemov wrote:
> == Background / Problem ==
> 
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.

CC people who worked on the related patchsets.
 
> == What does this set mitigate? ==
> 
>  - Host kernel ”accidental” access to guest data (think speculation)
> 
>  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> 
>  - Host userspace access to guest data (compromised qemu)
> 
> == What does this set NOT mitigate? ==
> 
>  - Full host kernel compromise.  Kernel will just map the pages again.
> 
>  - Hardware attacks
> 
> 
> The patchset is RFC-quality: it works but has known issues that must be
> addressed before it can be considered for applying.
> 
> We are looking for high-level feedback on the concept.  Some open
> questions:
> 
>  - This protects from some kernel and host userspace read-only attacks,
>    but does not place the host kernel outside the trust boundary. Is it
>    still valuable?
> 
>  - Can this approach be used to avoid cache-coherency problems with
>    hardware encryption schemes that repurpose physical bits?
> 
>  - The guest kernel must be modified for this to work.  Is that a deal
>    breaker, especially for public clouds?
> 
>  - Are the costs of removing pages from the direct map too high to be
>    feasible?
> 
> == Series Overview ==
> 
> The hardware features protect guest data by encrypting it and then
> ensuring that only the right guest can decrypt it.  This has the
> side-effect of making the kernel direct map and userspace mapping
> (QEMU et al) useless.  But, this teaches us something very useful:
> neither the kernel or userspace mappings are really necessary for normal
> guest operations.
> 
> Instead of using encryption, this series simply unmaps the memory. One
> advantage compared to allowing access to ciphertext is that it allows bad
> accesses to be caught instead of simply reading garbage.
> 
> Protection from physical attacks needs to be provided by some other means.
> On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> mitigation against physical attacks, such as DIMM interposers sniffing
> memory bus traffic.
> 
> The patchset modifies both host and guest kernel. The guest OS must enable
> the feature via hypercall and mark any memory range that has to be shared
> with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> bit in the guest’s page table while this approach uses a hypercall.
> 
> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
> 
> Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> flushes local TLB. I think it's a reasonable compromise between security and
> perfromance.
> 
> Zapping the PTE would bring the page back to the direct mapping after clearing.
> At least for now, we don't remove file-backed pages from the direct mapping.
> File-backed pages could be accessed via read/write syscalls. It adds
> complexity.
> 
> Occasionally, host kernel has to access guest memory that was not made
> shared by the guest. For instance, it happens for instruction emulation.
> Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> helpers acquire the page via GUP, map it into kernel address space with
> kmap_atomic()-style mechanism and only then copy the data.
> 
> For some instruction emulation copying is not good enough: cmpxchg
> emulation has to have direct access to the guest memory. __kvm_map_gfn()
> is modified to accommodate the case.
> 
> The patchset is on top of v5.7-rc6 plus this patch:
> 
> https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com
> 
> == Open Issues ==
> 
> Unmapping the pages from direct mapping bring a few of issues that have
> not rectified yet:
> 
>  - Touching direct mapping leads to fragmentation. We need to be able to
>    recover from it. I have a buggy patch that aims at recovering 2M/1G page.
>    It has to be fixed and tested properly
> 
>  - Page migration and KSM is not supported yet.
> 
>  - Live migration of a guest would require a new flow. Not sure yet how it
>    would look like.
> 
>  - The feature interfere with NUMA balancing. Not sure yet if it's
>    possible to make them work together.
> 
>  - Guests have no mechanism to ensure that even a well-behaving host has
>    unmapped its private data.  With SEV, for instance, the guest only has
>    to trust the hardware to encrypt a page after the C bit is set in a
>    guest PTE.  A mechanism for a guest to query the host mapping state, or
>    to constantly assert the intent for a page to be Private would be
>    valuable.
Liran Alon May 25, 2020, 1:47 p.m. UTC | #2
On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> == Background / Problem ==
>
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.
>
>
> == What does this set mitigate? ==
>
>   - Host kernel ”accidental” access to guest data (think speculation)

Just to clarify: This is any host kernel memory info-leak vulnerability. 
Not just speculative execution memory info-leaks. Also architectural ones.

In addition, note that removing guest data from host kernel VA space 
also makes guest<->host memory exploits more difficult.
E.g. Guest cannot use already available memory buffer in kernel VA space 
for ROP or placing valuable guest-controlled code/data in general.

>
>   - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>
>   - Host userspace access to guest data (compromised qemu)

I don't quite understand what is the benefit of preventing userspace VMM 
access to guest data while the host kernel can still access it.

QEMU is more easily compromised than the host kernel because it's 
guest<->host attack surface is larger (E.g. Various device emulation).
But this compromise comes from the guest itself. Not other guests. In 
contrast to host kernel attack surface, which an info-leak there can
be exploited from one guest to leak another guest data.
>
> == What does this set NOT mitigate? ==
>
>   - Full host kernel compromise.  Kernel will just map the pages again.
>
>   - Hardware attacks
>
>
> The patchset is RFC-quality: it works but has known issues that must be
> addressed before it can be considered for applying.
>
> We are looking for high-level feedback on the concept.  Some open
> questions:
>
>   - This protects from some kernel and host userspace read-only attacks,
>     but does not place the host kernel outside the trust boundary. Is it
>     still valuable?
I don't currently see a good argument for preventing host userspace 
access to guest data while host kernel can still access it.
But there is definitely strong benefit of mitigating kernel info-leaks 
exploitable from one guest to leak another guest data.
>
>   - Can this approach be used to avoid cache-coherency problems with
>     hardware encryption schemes that repurpose physical bits?
>
>   - The guest kernel must be modified for this to work.  Is that a deal
>     breaker, especially for public clouds?
>
>   - Are the costs of removing pages from the direct map too high to be
>     feasible?

If I remember correctly, this perf cost was too high when considering 
XPFO (eXclusive Page Frame Ownership) patch-series.
This created two major perf costs:
1) Removing pages from direct-map prevented direct-map from simply be 
entirely mapped as 1GB huge-pages.
2) Frequent allocation/free of userspace pages resulted in frequent TLB 
invalidations.

Having said that, (1) can be mitigated in case guest data is completely 
allocated from 1GB hugetlbfs to guarantee it will not
create smaller holes in direct-map. And (2) is not relevant for QEMU/KVM 
use-case.

This makes me wonder:
XPFO patch-series, applied to the context of QEMU/KVM, seems to provide 
exactly the functionality of this patch-series,
with the exception of the additional "feature" of preventing guest data 
from also being accessible to host userspace VMM.
i.e. XPFO will unmap guest pages from host kernel direct-map while still 
keeping them mapped in host userspace VMM page-tables.

If I understand correctly, this "feature" is what brings most of the 
extra complexity of this patch-series compared to XPFO.
It requires guest modification to explicitly specify to host which pages 
can be accessed by userspace VMM, it requires
changes to add new VM_KVM_PROTECTED VMA flag & FOLL_KVM for GUP, and it 
creates issues with Live-Migration support.

So if there is no strong convincing argument for the motivation to 
prevent userspace VMM access to guest data *while host kernel
can still access guest data*, I don't see a good reason for using this 
approach.

Furthermore, I would like to point out that just unmapping guest data 
from kernel direct-map is not sufficient to prevent all
guest-to-guest info-leaks via a kernel memory info-leak vulnerability. 
This is because host kernel VA space have other regions
which contains guest sensitive data. For example, KVM per-vCPU struct 
(which holds vCPU state) is allocated on slab and therefore
still leakable.

I recommend you will have a look at my (and Alexandre Charte) KVM Forum 
2019 talk on KVM ASI which provides extensive background
on the various attempts done by the community for mitigating host kernel 
memory info-leaks exploitable by guest to leak other guests data:
https://static.sched.com/hosted_files/kvmforum2019/34/KVM%20Forum%202019%20KVM%20ASI.pdf

>
> == Series Overview ==
>
> The hardware features protect guest data by encrypting it and then
> ensuring that only the right guest can decrypt it.  This has the
> side-effect of making the kernel direct map and userspace mapping
> (QEMU et al) useless.  But, this teaches us something very useful:
> neither the kernel or userspace mappings are really necessary for normal
> guest operations.
>
> Instead of using encryption, this series simply unmaps the memory. One
> advantage compared to allowing access to ciphertext is that it allows bad
> accesses to be caught instead of simply reading garbage.
>
> Protection from physical attacks needs to be provided by some other means.
> On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> mitigation against physical attacks, such as DIMM interposers sniffing
> memory bus traffic.
>
> The patchset modifies both host and guest kernel. The guest OS must enable
> the feature via hypercall and mark any memory range that has to be shared
> with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> bit in the guest’s page table while this approach uses a hypercall.
>
> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
>
> Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> flushes local TLB. I think it's a reasonable compromise between security and
> perfromance.
>
> Zapping the PTE would bring the page back to the direct mapping after clearing.
> At least for now, we don't remove file-backed pages from the direct mapping.
> File-backed pages could be accessed via read/write syscalls. It adds
> complexity.
>
> Occasionally, host kernel has to access guest memory that was not made
> shared by the guest. For instance, it happens for instruction emulation.
> Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> helpers acquire the page via GUP, map it into kernel address space with
> kmap_atomic()-style mechanism and only then copy the data.
>
> For some instruction emulation copying is not good enough: cmpxchg
> emulation has to have direct access to the guest memory. __kvm_map_gfn()
> is modified to accommodate the case.
>
> The patchset is on top of v5.7-rc6 plus this patch:
>
> https://urldefense.com/v3/__https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com__;!!GqivPVa7Brio!MSTb9DzpOUJMLMaMq-J7QOkopsKIlAYXpIxiu5FwFYfRctwIyNi8zBJWvlt89j8$
>
> == Open Issues ==
>
> Unmapping the pages from direct mapping bring a few of issues that have
> not rectified yet:
>
>   - Touching direct mapping leads to fragmentation. We need to be able to
>     recover from it. I have a buggy patch that aims at recovering 2M/1G page.
>     It has to be fixed and tested properly
As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs 
will lead to holes in kernel direct-map which force it to not be mapped 
anymore as a series of 1GB huge-pages.
This have non-trivial performance cost. Thus, I am not sure addressing 
this use-case is valuable.
>
>   - Page migration and KSM is not supported yet.
>
>   - Live migration of a guest would require a new flow. Not sure yet how it
>     would look like.

Note that Live-Migration issue is a result of not making guest data 
accessible to host userspace VMM.

-Liran

>
>   - The feature interfere with NUMA balancing. Not sure yet if it's
>     possible to make them work together.
>
>   - Guests have no mechanism to ensure that even a well-behaving host has
>     unmapped its private data.  With SEV, for instance, the guest only has
>     to trust the hardware to encrypt a page after the C bit is set in a
>     guest PTE.  A mechanism for a guest to query the host mapping state, or
>     to constantly assert the intent for a page to be Private would be
>     valuable.
Kirill A . Shutemov May 25, 2020, 2:46 p.m. UTC | #3
On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
> 
> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> > == Background / Problem ==
> > 
> > There are a number of hardware features (MKTME, SEV) which protect guest
> > memory from some unauthorized host access. The patchset proposes a purely
> > software feature that mitigates some of the same host-side read-only
> > attacks.
> > 
> > 
> > == What does this set mitigate? ==
> > 
> >   - Host kernel ”accidental” access to guest data (think speculation)
> 
> Just to clarify: This is any host kernel memory info-leak vulnerability. Not
> just speculative execution memory info-leaks. Also architectural ones.
> 
> In addition, note that removing guest data from host kernel VA space also
> makes guest<->host memory exploits more difficult.
> E.g. Guest cannot use already available memory buffer in kernel VA space for
> ROP or placing valuable guest-controlled code/data in general.
> 
> > 
> >   - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> > 
> >   - Host userspace access to guest data (compromised qemu)
> 
> I don't quite understand what is the benefit of preventing userspace VMM
> access to guest data while the host kernel can still access it.

Let me clarify: the guest memory mapped into host userspace is not
accessible by both host kernel and userspace. Host still has way to access
it via a new interface: GUP(FOLL_KVM). The GUP will give you struct page
that kernel has to map (temporarily) if need to access the data. So only
blessed codepaths would know how to deal with the memory.

It can help preventing some host->guest attack on the compromised host.
Like if an VM has successfully attacked the host it cannot attack other
VMs as easy.

It would also help to protect against guest->host attack by removing one
more places where the guest's data is mapped on the host.

> QEMU is more easily compromised than the host kernel because it's
> guest<->host attack surface is larger (E.g. Various device emulation).
> But this compromise comes from the guest itself. Not other guests. In
> contrast to host kernel attack surface, which an info-leak there can
> be exploited from one guest to leak another guest data.

Consider the case when unprivileged guest user exploits bug in a QEMU
device emulation to gain access to data it cannot normally have access
within the guest. With the feature it would able to see only other shared
regions of guest memory such as DMA and IO buffers, but not the rest.

> > 
> > == What does this set NOT mitigate? ==
> > 
> >   - Full host kernel compromise.  Kernel will just map the pages again.
> > 
> >   - Hardware attacks
> > 
> > 
> > The patchset is RFC-quality: it works but has known issues that must be
> > addressed before it can be considered for applying.
> > 
> > We are looking for high-level feedback on the concept.  Some open
> > questions:
> > 
> >   - This protects from some kernel and host userspace read-only attacks,
> >     but does not place the host kernel outside the trust boundary. Is it
> >     still valuable?
> I don't currently see a good argument for preventing host userspace access
> to guest data while host kernel can still access it.
> But there is definitely strong benefit of mitigating kernel info-leaks
> exploitable from one guest to leak another guest data.
> > 
> >   - Can this approach be used to avoid cache-coherency problems with
> >     hardware encryption schemes that repurpose physical bits?
> > 
> >   - The guest kernel must be modified for this to work.  Is that a deal
> >     breaker, especially for public clouds?
> > 
> >   - Are the costs of removing pages from the direct map too high to be
> >     feasible?
> 
> If I remember correctly, this perf cost was too high when considering XPFO
> (eXclusive Page Frame Ownership) patch-series.
> This created two major perf costs:
> 1) Removing pages from direct-map prevented direct-map from simply be
> entirely mapped as 1GB huge-pages.
> 2) Frequent allocation/free of userspace pages resulted in frequent TLB
> invalidations.
> 
> Having said that, (1) can be mitigated in case guest data is completely
> allocated from 1GB hugetlbfs to guarantee it will not
> create smaller holes in direct-map. And (2) is not relevant for QEMU/KVM
> use-case.

I'm too invested into THP to give it up to the ugly hugetlbfs. I think we
can do better :)

> This makes me wonder:
> XPFO patch-series, applied to the context of QEMU/KVM, seems to provide
> exactly the functionality of this patch-series,
> with the exception of the additional "feature" of preventing guest data from
> also being accessible to host userspace VMM.
> i.e. XPFO will unmap guest pages from host kernel direct-map while still
> keeping them mapped in host userspace VMM page-tables.
> 
> If I understand correctly, this "feature" is what brings most of the extra
> complexity of this patch-series compared to XPFO.
> It requires guest modification to explicitly specify to host which pages can
> be accessed by userspace VMM, it requires
> changes to add new VM_KVM_PROTECTED VMA flag & FOLL_KVM for GUP, and it
> creates issues with Live-Migration support.
> 
> So if there is no strong convincing argument for the motivation to prevent
> userspace VMM access to guest data *while host kernel
> can still access guest data*, I don't see a good reason for using this
> approach.

Well, I disagree with you here. See few points above.

> Furthermore, I would like to point out that just unmapping guest data from
> kernel direct-map is not sufficient to prevent all
> guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This
> is because host kernel VA space have other regions
> which contains guest sensitive data. For example, KVM per-vCPU struct (which
> holds vCPU state) is allocated on slab and therefore
> still leakable.
> 
> I recommend you will have a look at my (and Alexandre Charte) KVM Forum 2019
> talk on KVM ASI which provides extensive background
> on the various attempts done by the community for mitigating host kernel
> memory info-leaks exploitable by guest to leak other guests data:
> https://static.sched.com/hosted_files/kvmforum2019/34/KVM%20Forum%202019%20KVM%20ASI.pdf

Thanks, I'll read it up.

> > == Series Overview ==
> > 
> > The hardware features protect guest data by encrypting it and then
> > ensuring that only the right guest can decrypt it.  This has the
> > side-effect of making the kernel direct map and userspace mapping
> > (QEMU et al) useless.  But, this teaches us something very useful:
> > neither the kernel or userspace mappings are really necessary for normal
> > guest operations.
> > 
> > Instead of using encryption, this series simply unmaps the memory. One
> > advantage compared to allowing access to ciphertext is that it allows bad
> > accesses to be caught instead of simply reading garbage.
> > 
> > Protection from physical attacks needs to be provided by some other means.
> > On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> > mitigation against physical attacks, such as DIMM interposers sniffing
> > memory bus traffic.
> > 
> > The patchset modifies both host and guest kernel. The guest OS must enable
> > the feature via hypercall and mark any memory range that has to be shared
> > with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> > bit in the guest’s page table while this approach uses a hypercall.
> > 
> > For removing the userspace mapping, use a trick similar to what NUMA
> > balancing does: convert memory that belongs to KVM memory slots to
> > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> > the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> > VMA must be treated in a special way in the GUP and fault paths. The flag
> > allows GUP to return the page even though it is mapped with PROT_NONE, but
> > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> > would result in -EFAULT.
> > 
> > Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> > the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> > flushes local TLB. I think it's a reasonable compromise between security and
> > perfromance.
> > 
> > Zapping the PTE would bring the page back to the direct mapping after clearing.
> > At least for now, we don't remove file-backed pages from the direct mapping.
> > File-backed pages could be accessed via read/write syscalls. It adds
> > complexity.
> > 
> > Occasionally, host kernel has to access guest memory that was not made
> > shared by the guest. For instance, it happens for instruction emulation.
> > Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> > now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> > helpers acquire the page via GUP, map it into kernel address space with
> > kmap_atomic()-style mechanism and only then copy the data.
> > 
> > For some instruction emulation copying is not good enough: cmpxchg
> > emulation has to have direct access to the guest memory. __kvm_map_gfn()
> > is modified to accommodate the case.
> > 
> > The patchset is on top of v5.7-rc6 plus this patch:
> > 
> > https://urldefense.com/v3/__https://lkml.kernel.org/r/20200402172507.2786-1-jimmyassarsson@gmail.com__;!!GqivPVa7Brio!MSTb9DzpOUJMLMaMq-J7QOkopsKIlAYXpIxiu5FwFYfRctwIyNi8zBJWvlt89j8$
> > 
> > == Open Issues ==
> > 
> > Unmapping the pages from direct mapping bring a few of issues that have
> > not rectified yet:
> > 
> >   - Touching direct mapping leads to fragmentation. We need to be able to
> >     recover from it. I have a buggy patch that aims at recovering 2M/1G page.
> >     It has to be fixed and tested properly
> As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs
> will lead to holes in kernel direct-map which force it to not be mapped
> anymore as a series of 1GB huge-pages.
> This have non-trivial performance cost. Thus, I am not sure addressing this
> use-case is valuable.

Here's the buggy patch I've referred to:

http://lore.kernel.org/r/20200416213229.19174-1-kirill.shutemov@linux.intel.com

I plan to get work right.

> > 
> >   - Page migration and KSM is not supported yet.
> > 
> >   - Live migration of a guest would require a new flow. Not sure yet how it
> >     would look like.
> 
> Note that Live-Migration issue is a result of not making guest data
> accessible to host userspace VMM.

Yes, I understand.
Liran Alon May 25, 2020, 3:56 p.m. UTC | #4
On 25/05/2020 17:46, Kirill A. Shutemov wrote:
> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
>> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
>>> == Background / Problem ==
>>>
>>> There are a number of hardware features (MKTME, SEV) which protect guest
>>> memory from some unauthorized host access. The patchset proposes a purely
>>> software feature that mitigates some of the same host-side read-only
>>> attacks.
>>>
>>>
>>> == What does this set mitigate? ==
>>>
>>>    - Host kernel ”accidental” access to guest data (think speculation)
>> Just to clarify: This is any host kernel memory info-leak vulnerability. Not
>> just speculative execution memory info-leaks. Also architectural ones.
>>
>> In addition, note that removing guest data from host kernel VA space also
>> makes guest<->host memory exploits more difficult.
>> E.g. Guest cannot use already available memory buffer in kernel VA space for
>> ROP or placing valuable guest-controlled code/data in general.
>>
>>>    - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>>>
>>>    - Host userspace access to guest data (compromised qemu)
>> I don't quite understand what is the benefit of preventing userspace VMM
>> access to guest data while the host kernel can still access it.
> Let me clarify: the guest memory mapped into host userspace is not
> accessible by both host kernel and userspace. Host still has way to access
> it via a new interface: GUP(FOLL_KVM). The GUP will give you struct page
> that kernel has to map (temporarily) if need to access the data. So only
> blessed codepaths would know how to deal with the memory.
Yes, I understood that. I meant explicit host kernel access.
>
> It can help preventing some host->guest attack on the compromised host.
> Like if an VM has successfully attacked the host it cannot attack other
> VMs as easy.

We have mechanisms to sandbox the userspace VMM process for that.

You need to be more specific on what is the attack scenario you attempt 
to address
here that is not covered by existing mechanisms. i.e. Be crystal clear 
on the extra value
of the feature of not exposing guest data to userspace VMM.

>
> It would also help to protect against guest->host attack by removing one
> more places where the guest's data is mapped on the host.
Because guest have explicit interface to request which guest pages can 
be mapped in userspace VMM, the value of this is very small.

Guest already have ability to map guest controlled code/data in 
userspace VMM either via this interface or via forcing userspace VMM
to create various objects during device emulation handling. The only 
extra property this patch-series provides, is that only a
small portion of guest pages will be mapped to host userspace instead of 
all of it. Resulting in smaller regions for exploits that require
guessing a virtual address. But: (a) Userspace VMM device emulation may 
still allow guest to spray userspace heap with objects containing
guest controlled data. (b) How is userspace VMM suppose to limit which 
guest pages should not be mapped to userspace VMM even though guest have
explicitly requested them to be mapped? (E.g. Because they are valid DMA 
sources/targets for virtual devices or because it's vGPU frame-buffer).
>> QEMU is more easily compromised than the host kernel because it's
>> guest<->host attack surface is larger (E.g. Various device emulation).
>> But this compromise comes from the guest itself. Not other guests. In
>> contrast to host kernel attack surface, which an info-leak there can
>> be exploited from one guest to leak another guest data.
> Consider the case when unprivileged guest user exploits bug in a QEMU
> device emulation to gain access to data it cannot normally have access
> within the guest. With the feature it would able to see only other shared
> regions of guest memory such as DMA and IO buffers, but not the rest.
This is a scenario where an unpriviledged guest userspace have direct 
access to a virtual device
and is able to exploit a bug in device emulation handling such that it 
will allow it to compromise
the security *inside* the guest. i.e. Leak guest kernel data or other 
guest userspace processes data.

That's true. Good point. This is a very important missing argument from 
the cover-letter.

Now it's crystal clear on the trade-off considered here:
Is the extra complication and perf cost provided by the mechanism of 
this patch-series worth
to protect against the scenario of a userspace VMM vulnerability that 
may be accessible to unpriviledged
guest userspace process to leak other *in-guest* data that is not 
otherwise accessible to that process?

-Liran
Mike Rapoport May 26, 2020, 6:17 a.m. UTC | #5
On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
> 
> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> 
> Furthermore, I would like to point out that just unmapping guest data from
> kernel direct-map is not sufficient to prevent all
> guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This
> is because host kernel VA space have other regions
> which contains guest sensitive data. For example, KVM per-vCPU struct (which
> holds vCPU state) is allocated on slab and therefore
> still leakable.

Objects allocated from slab use the direct map, vmalloc() is another story.

> >   - Touching direct mapping leads to fragmentation. We need to be able to
> >     recover from it. I have a buggy patch that aims at recovering 2M/1G page.
> >     It has to be fixed and tested properly
>
> As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs
> will lead to holes in kernel direct-map which force it to not be mapped
> anymore as a series of 1GB huge-pages.
> This have non-trivial performance cost. Thus, I am not sure addressing this
> use-case is valuable.

Out of curiosity, do we actually have some numbers for the "non-trivial
performance cost"? For instance for KVM usecase?
Liran Alon May 26, 2020, 10:16 a.m. UTC | #6
On 26/05/2020 9:17, Mike Rapoport wrote:
> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
>> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
>>
>> Furthermore, I would like to point out that just unmapping guest data from
>> kernel direct-map is not sufficient to prevent all
>> guest-to-guest info-leaks via a kernel memory info-leak vulnerability. This
>> is because host kernel VA space have other regions
>> which contains guest sensitive data. For example, KVM per-vCPU struct (which
>> holds vCPU state) is allocated on slab and therefore
>> still leakable.
> Objects allocated from slab use the direct map, vmalloc() is another story.
It doesn't matter. This patch series, like XPFO, only removes guest 
memory pages from direct-map.
Not things such as KVM per-vCPU structs. That's why Julian & Marius 
(AWS), created the "Process local kernel VA region" patch-series
that declare a single PGD entry, which maps a kernelspace region, to 
have different PFN between different tasks.
For more information, see my KVM Forum talk slides I gave in previous 
reply and related AWS patch-series:
https://patchwork.kernel.org/cover/10990403/
>
>>>    - Touching direct mapping leads to fragmentation. We need to be able to
>>>      recover from it. I have a buggy patch that aims at recovering 2M/1G page.
>>>      It has to be fixed and tested properly
>> As I've mentioned above, not mapping all guest memory from 1GB hugetlbfs
>> will lead to holes in kernel direct-map which force it to not be mapped
>> anymore as a series of 1GB huge-pages.
>> This have non-trivial performance cost. Thus, I am not sure addressing this
>> use-case is valuable.
> Out of curiosity, do we actually have some numbers for the "non-trivial
> performance cost"? For instance for KVM usecase?
>
Dig into XPFO mailing-list discussions to find out...
I just remember that this was one of the main concerns regarding XPFO.

-Liran
Mike Rapoport May 26, 2020, 11:38 a.m. UTC | #7
On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote:
> 
> On 26/05/2020 9:17, Mike Rapoport wrote:
> > On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
> > > On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> > > 
> > Out of curiosity, do we actually have some numbers for the "non-trivial
> > performance cost"? For instance for KVM usecase?
> > 
> Dig into XPFO mailing-list discussions to find out...
> I just remember that this was one of the main concerns regarding XPFO.

The XPFO benchmarks measure total XPFO cost, and huge share of it comes
from TLB shootdowns.

It's not exactly measurement of the imapct of the direct map
fragmentation to workload running inside a vitrual machine.

> -Liran
Dave Hansen May 27, 2020, 3:45 p.m. UTC | #8
On 5/26/20 4:38 AM, Mike Rapoport wrote:
> On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote:
>> On 26/05/2020 9:17, Mike Rapoport wrote:
>>> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
>>>> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
>>>>
>>> Out of curiosity, do we actually have some numbers for the "non-trivial
>>> performance cost"? For instance for KVM usecase?
>>>
>> Dig into XPFO mailing-list discussions to find out...
>> I just remember that this was one of the main concerns regarding XPFO.
> The XPFO benchmarks measure total XPFO cost, and huge share of it comes
> from TLB shootdowns.

Yes, TLB shootdown when pages transition between owners is huge.  The
XPFO folks did a lot of work to try to optimize some of this overhead
away.  But, it's still a concern.

The concern with XPFO was that it could affect *all* application page
allocation.  This approach cheats a bit and only goes after guest VM
pages.  It's significantly more work to allocate a page and map it into
a guest than it is to, for instance, allocate an anonymous user page.
That means that the *additional* overhead of things like this for guest
memory matter a lot less.

> It's not exactly measurement of the imapct of the direct map
> fragmentation to workload running inside a vitrual machine.

While the VM *itself* is running, there is zero overhead.  The host
direct map is not used at *all*.  The guest and host TLB entries share
the same space in the TLB so there could be some increased pressure on
the TLB, but that's a really secondary effect.  It would also only occur
if the guest exits and the host runs and starts evicting TLB entries.

The other effects I could think of would be when the guest exits and the
host is doing some work for the guest, like emulation or something.  The
host would see worse TLB behavior because the host is using the
(fragmented) direct map.

But, both of those things require VMEXITs.  The more exits, the more
overhead you _might_ observe.  What I've been hearing from KVM folks is
that exits are getting more and more rare and the hardware designers are
working hard to minimize them.

That's especially good news because it means that even if the situation
isn't perfect, it's only bound to get *better* over time, not worse.
Mike Rapoport May 27, 2020, 9:22 p.m. UTC | #9
On Wed, May 27, 2020 at 08:45:33AM -0700, Dave Hansen wrote:
> On 5/26/20 4:38 AM, Mike Rapoport wrote:
> > On Tue, May 26, 2020 at 01:16:14PM +0300, Liran Alon wrote:
> >> On 26/05/2020 9:17, Mike Rapoport wrote:
> >>> On Mon, May 25, 2020 at 04:47:18PM +0300, Liran Alon wrote:
> >>>> On 22/05/2020 15:51, Kirill A. Shutemov wrote:
> >>>>
> >>> Out of curiosity, do we actually have some numbers for the "non-trivial
> >>> performance cost"? For instance for KVM usecase?
> >>>
> >> Dig into XPFO mailing-list discussions to find out...
> >> I just remember that this was one of the main concerns regarding XPFO.
> >
> > The XPFO benchmarks measure total XPFO cost, and huge share of it comes
> > from TLB shootdowns.
> 
> Yes, TLB shootdown when pages transition between owners is huge.  The
> XPFO folks did a lot of work to try to optimize some of this overhead
> away.  But, it's still a concern.
> 
> The concern with XPFO was that it could affect *all* application page
> allocation.  This approach cheats a bit and only goes after guest VM
> pages.  It's significantly more work to allocate a page and map it into
> a guest than it is to, for instance, allocate an anonymous user page.
> That means that the *additional* overhead of things like this for guest
> memory matter a lot less.
> 
> > It's not exactly measurement of the imapct of the direct map
> > fragmentation to workload running inside a vitrual machine.
> 
> While the VM *itself* is running, there is zero overhead.  The host
> direct map is not used at *all*.  The guest and host TLB entries share
> the same space in the TLB so there could be some increased pressure on
> the TLB, but that's a really secondary effect.  It would also only occur
> if the guest exits and the host runs and starts evicting TLB entries.
> 
> The other effects I could think of would be when the guest exits and the
> host is doing some work for the guest, like emulation or something.  The
> host would see worse TLB behavior because the host is using the
> (fragmented) direct map.
> 
> But, both of those things require VMEXITs.  The more exits, the more
> overhead you _might_ observe.  What I've been hearing from KVM folks is
> that exits are getting more and more rare and the hardware designers are
> working hard to minimize them.

Right, when guest stays in the guest mode, there is no overhead. But
guests still exit sometimes and I was wondering if anybody had measured
difference in the overhead with different page size used for the host's
direct map. 

My guesstimate is that the overhead will not differ much for most
workloads. But still, it's still interesting to *know* what is it.

> That's especially good news because it means that even if the
> situation
> isn't perfect, it's only bound to get *better* over time, not worse.

The processors have been aggressively improving performance for decades
and see where are we know because of it ;-)
Marc Zyngier June 4, 2020, 3:15 p.m. UTC | #10
Hi Kirill,

Thanks for this.

On Fri, 22 May 2020 15:51:58 +0300
"Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> == Background / Problem ==
> 
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.
> 
> 
> == What does this set mitigate? ==
> 
>  - Host kernel ”accidental” access to guest data (think speculation)
> 
>  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> 
>  - Host userspace access to guest data (compromised qemu)
> 
> == What does this set NOT mitigate? ==
> 
>  - Full host kernel compromise.  Kernel will just map the pages again.
> 
>  - Hardware attacks

Just as a heads up, we (the Android kernel team) are currently
involved in something pretty similar for KVM/arm64 in order to bring
some level of confidentiality to guests.

The main idea is to de-privilege the host kernel by wrapping it in its
own nested set of page tables which allows us to remove memory
allocated to guests on a per-page basis. The core hypervisor runs more
or less independently at its own privilege level. It still is KVM
though, as we don't intend to reinvent the wheel.

Will has written a much more lingo-heavy description here:
https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/

This works for one of the virtualization modes that arm64 can use (what
we call non-VHE, or nVHE for short). The other mode (VHE), is much more
similar to what happens on other architectures, where the kernel and
the hypervisor are one single entity. In this case, we cannot use the
same trick with nested page tables, and have to rely on something that
would very much look like what you're proposing.

Note that the two modes of the architecture would benefit from this
work anyway, as I'd like the host to know that we've pulled memory
from under its feet. Since you have done most of the initial work, I
intend to give it a go on arm64 shortly and see what sticks.

Thanks,

	M.
Sean Christopherson June 4, 2020, 3:48 p.m. UTC | #11
+Jun

On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote:
> Hi Kirill,
> 
> Thanks for this.
> 
> On Fri, 22 May 2020 15:51:58 +0300
> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > == Background / Problem ==
> > 
> > There are a number of hardware features (MKTME, SEV) which protect guest
> > memory from some unauthorized host access. The patchset proposes a purely
> > software feature that mitigates some of the same host-side read-only
> > attacks.
> > 
> > 
> > == What does this set mitigate? ==
> > 
> >  - Host kernel ”accidental” access to guest data (think speculation)
> > 
> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> > 
> >  - Host userspace access to guest data (compromised qemu)
> > 
> > == What does this set NOT mitigate? ==
> > 
> >  - Full host kernel compromise.  Kernel will just map the pages again.
> > 
> >  - Hardware attacks
> 
> Just as a heads up, we (the Android kernel team) are currently
> involved in something pretty similar for KVM/arm64 in order to bring
> some level of confidentiality to guests.
> 
> The main idea is to de-privilege the host kernel by wrapping it in its
> own nested set of page tables which allows us to remove memory
> allocated to guests on a per-page basis. The core hypervisor runs more
> or less independently at its own privilege level. It still is KVM
> though, as we don't intend to reinvent the wheel.
> 
> Will has written a much more lingo-heavy description here:
> https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/

Pardon my arm64 ignorance...

IIUC, in this mode, the host kernel runs at EL1?  And to switch to a guest
it has to bounce through EL2, which is KVM, or at least a chunk of KVM?
I assume the EL1->EL2->EL1 switch is done by trapping an exception of some
form?

If all of the above are "yes", does KVM already have the necessary logic to
perform the EL1->EL2->EL1 switches, or is that being added as part of the
de-privileging effort?
 
> This works for one of the virtualization modes that arm64 can use (what
> we call non-VHE, or nVHE for short). The other mode (VHE), is much more
> similar to what happens on other architectures, where the kernel and
> the hypervisor are one single entity. In this case, we cannot use the
> same trick with nested page tables, and have to rely on something that
> would very much look like what you're proposing.
> 
> Note that the two modes of the architecture would benefit from this
> work anyway, as I'd like the host to know that we've pulled memory
> from under its feet. Since you have done most of the initial work, I
> intend to give it a go on arm64 shortly and see what sticks.
Marc Zyngier June 4, 2020, 4:27 p.m. UTC | #12
Hi Sean,

On 2020-06-04 16:48, Sean Christopherson wrote:
> +Jun
> 
> On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote:
>> Hi Kirill,
>> 
>> Thanks for this.
>> 
>> On Fri, 22 May 2020 15:51:58 +0300
>> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>> 
>> > == Background / Problem ==
>> >
>> > There are a number of hardware features (MKTME, SEV) which protect guest
>> > memory from some unauthorized host access. The patchset proposes a purely
>> > software feature that mitigates some of the same host-side read-only
>> > attacks.
>> >
>> >
>> > == What does this set mitigate? ==
>> >
>> >  - Host kernel ”accidental” access to guest data (think speculation)
>> >
>> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>> >
>> >  - Host userspace access to guest data (compromised qemu)
>> >
>> > == What does this set NOT mitigate? ==
>> >
>> >  - Full host kernel compromise.  Kernel will just map the pages again.
>> >
>> >  - Hardware attacks
>> 
>> Just as a heads up, we (the Android kernel team) are currently
>> involved in something pretty similar for KVM/arm64 in order to bring
>> some level of confidentiality to guests.
>> 
>> The main idea is to de-privilege the host kernel by wrapping it in its
>> own nested set of page tables which allows us to remove memory
>> allocated to guests on a per-page basis. The core hypervisor runs more
>> or less independently at its own privilege level. It still is KVM
>> though, as we don't intend to reinvent the wheel.
>> 
>> Will has written a much more lingo-heavy description here:
>> https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/
> 
> Pardon my arm64 ignorance...
> 
> IIUC, in this mode, the host kernel runs at EL1?  And to switch to a 
> guest
> it has to bounce through EL2, which is KVM, or at least a chunk of KVM?
> I assume the EL1->EL2->EL1 switch is done by trapping an exception of 
> some
> form?
> 
> If all of the above are "yes", does KVM already have the necessary 
> logic to
> perform the EL1->EL2->EL1 switches, or is that being added as part of 
> the
> de-privileging effort?

KVM already handles the EL1->EL2->EL1 madness, meaning that from
an exception level perspective, the host kernel is already a guest.
It's just that this guest can directly change the hypervisor's text,
its page tables, and muck with about everything else.

De-privileging the memory access to non host EL1 memory is where the
ongoing effort is.

          M.
Will Deacon June 4, 2020, 4:35 p.m. UTC | #13
Hi Sean,

On Thu, Jun 04, 2020 at 08:48:35AM -0700, Sean Christopherson wrote:
> On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote:
> > On Fri, 22 May 2020 15:51:58 +0300
> > "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> > 
> > > == Background / Problem ==
> > > 
> > > There are a number of hardware features (MKTME, SEV) which protect guest
> > > memory from some unauthorized host access. The patchset proposes a purely
> > > software feature that mitigates some of the same host-side read-only
> > > attacks.
> > > 
> > > 
> > > == What does this set mitigate? ==
> > > 
> > >  - Host kernel ”accidental” access to guest data (think speculation)
> > > 
> > >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> > > 
> > >  - Host userspace access to guest data (compromised qemu)
> > > 
> > > == What does this set NOT mitigate? ==
> > > 
> > >  - Full host kernel compromise.  Kernel will just map the pages again.
> > > 
> > >  - Hardware attacks
> > 
> > Just as a heads up, we (the Android kernel team) are currently
> > involved in something pretty similar for KVM/arm64 in order to bring
> > some level of confidentiality to guests.
> > 
> > The main idea is to de-privilege the host kernel by wrapping it in its
> > own nested set of page tables which allows us to remove memory
> > allocated to guests on a per-page basis. The core hypervisor runs more
> > or less independently at its own privilege level. It still is KVM
> > though, as we don't intend to reinvent the wheel.
> > 
> > Will has written a much more lingo-heavy description here:
> > https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/
> 
> Pardon my arm64 ignorance...

No, not at all!

> IIUC, in this mode, the host kernel runs at EL1?  And to switch to a guest
> it has to bounce through EL2, which is KVM, or at least a chunk of KVM?
> I assume the EL1->EL2->EL1 switch is done by trapping an exception of some
> form?

Yes, and this is actually the way that KVM works on some Arm CPUs today,
as the original virtualisation extensions in the Armv8 architecture do
not make it possible to run the kernel directly at EL2 (for example, there
is only one page-table base register). This was later addressed in the
architecture by the "Virtualisation Host Extensions (VHE)", and so KVM
supports both options.

With non-VHE today, there is a small amount of "world switch" code at
EL2 which is installed by the host kernel and provides a way to transition
between the host and the guest. If the host needs to do something at EL2
(e.g. privileged TLB invalidation), then it makes a hypercall (HVC instruction)
via the kvm_call_hyp() macro (and this ends up just being a function call
for VHE).

> If all of the above are "yes", does KVM already have the necessary logic to
> perform the EL1->EL2->EL1 switches, or is that being added as part of the
> de-privileging effort?

The logic is there as part of the non-VHE support code, but it's not great
from a security angle. For example, the guest stage-2 page-tables are still
allocated by the host, the host has complete access to guest and hypervisor
memory (including hypervisor text) and things like kvm_call_hyp() are a bit
of an open door. We're working on making the EL2 code more self contained,
so that after the host has initialised KVM, it can shut the door and the
hypervisor can install a stage-2 translation over the host, which limits its
access to hypervisor and guest memory. There will clearly be IOMMU work as
well to prevent DMA attacks.

Will
Nakajima, Jun June 4, 2020, 7:09 p.m. UTC | #14
> 
> On Jun 4, 2020, at 9:35 AM, Will Deacon <will@kernel.org> wrote:
> 
> Hi Sean,
> 
> On Thu, Jun 04, 2020 at 08:48:35AM -0700, Sean Christopherson wrote:
>> On Thu, Jun 04, 2020 at 04:15:23PM +0100, Marc Zyngier wrote:
>>> On Fri, 22 May 2020 15:51:58 +0300
>>> "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
>>> 
>>>> == Background / Problem ==
>>>> 
>>>> There are a number of hardware features (MKTME, SEV) which protect guest
>>>> memory from some unauthorized host access. The patchset proposes a purely
>>>> software feature that mitigates some of the same host-side read-only
>>>> attacks.
>>>> 
>>>> 
>>>> == What does this set mitigate? ==
>>>> 
>>>> - Host kernel ”accidental” access to guest data (think speculation)
>>>> 
>>>> - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>>>> 
>>>> - Host userspace access to guest data (compromised qemu)
>>>> 
>>>> == What does this set NOT mitigate? ==
>>>> 
>>>> - Full host kernel compromise.  Kernel will just map the pages again.
>>>> 
>>>> - Hardware attacks
>>> 
>>> Just as a heads up, we (the Android kernel team) are currently
>>> involved in something pretty similar for KVM/arm64 in order to bring
>>> some level of confidentiality to guests.
>>> 
>>> The main idea is to de-privilege the host kernel by wrapping it in its
>>> own nested set of page tables which allows us to remove memory
>>> allocated to guests on a per-page basis. The core hypervisor runs more
>>> or less independently at its own privilege level. It still is KVM
>>> though, as we don't intend to reinvent the wheel.
>>> 
>>> Will has written a much more lingo-heavy description here:
>>> https://lore.kernel.org/kvmarm/20200327165935.GA8048@willie-the-truck/
>> 

We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings.

To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though). 

When the control enters KVM, we go back to privileged (hypervisor or root) mode, and it works as does today. Once a VM exit happens, we will stay in the root mode as long as the exit can be handled within KVM. If we need to depend on the host kernel, we de-privilege the host kernel (i.e. VM enter). Yes, it sounds ugly.

There are cleaner (but more expensive) approaches, and we are collecting data at this point. For example, we could run the host kernel (like Xen dom0) on top of a thin? hypervisor that consists of KVM and minimally configured Linux.  

> 
>> IIUC, in this mode, the host kernel runs at EL1?  And to switch to a guest
>> it has to bounce through EL2, which is KVM, or at least a chunk of KVM?
>> I assume the EL1->EL2->EL1 switch is done by trapping an exception of some
>> form?
> 
> Yes, and this is actually the way that KVM works on some Arm CPUs today,
> as the original virtualisation extensions in the Armv8 architecture do
> not make it possible to run the kernel directly at EL2 (for example, there
> is only one page-table base register). This was later addressed in the
> architecture by the "Virtualisation Host Extensions (VHE)", and so KVM
> supports both options.
> 
> With non-VHE today, there is a small amount of "world switch" code at
> EL2 which is installed by the host kernel and provides a way to transition
> between the host and the guest. If the host needs to do something at EL2
> (e.g. privileged TLB invalidation), then it makes a hypercall (HVC instruction)
> via the kvm_call_hyp() macro (and this ends up just being a function call
> for VHE).
> 
>> If all of the above are "yes", does KVM already have the necessary logic to
>> perform the EL1->EL2->EL1 switches, or is that being added as part of the
>> de-privileging effort?
> 
> The logic is there as part of the non-VHE support code, but it's not great
> from a security angle. For example, the guest stage-2 page-tables are still
> allocated by the host, the host has complete access to guest and hypervisor
> memory (including hypervisor text) and things like kvm_call_hyp() are a bit
> of an open door. We're working on making the EL2 code more self contained,
> so that after the host has initialised KVM, it can shut the door and the
> hypervisor can install a stage-2 translation over the host, which limits its
> access to hypervisor and guest memory. There will clearly be IOMMU work as
> well to prevent DMA attacks.

Sounds interesting. 

--- 
Jun
Intel Open Source Technology Center
Jim Mattson June 4, 2020, 9:03 p.m. UTC | #15
On Thu, Jun 4, 2020 at 12:09 PM Nakajima, Jun <jun.nakajima@intel.com> wrote:

> We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings.
>
> To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though).

You're Intel. Can't you just change the CPUID intercept from required
to optional? It seems like this should be in the realm of a small
microcode patch.
Nakajima, Jun June 4, 2020, 11:29 p.m. UTC | #16
> 
> On Jun 4, 2020, at 2:03 PM, Jim Mattson <jmattson@google.com> wrote:
> 
> On Thu, Jun 4, 2020 at 12:09 PM Nakajima, Jun <jun.nakajima@intel.com> wrote:
> 
>> We (Intel virtualization team) are also working on a similar thing, prototyping to meet such requirements, i..e "some level of confidentiality to guests”. Linux/KVM is the host, and the Kirill’s patches are helpful when removing the mappings from the host to achieve memory isolation of a guest. But, it’s not easy to prove there are no other mappings.
>> 
>> To raise the level of security, our idea is to de-privilege the host kernel just to enforce memory isolation using EPT (Extended Page Table) that virtualizes guest (the host kernel in this case) physical memory; almost everything is passthrough. And the EPT for the host kernel excludes the memory for the guest(s) that has confidential info. So, the host kernel shouldn’t cause VM exits as long as it’s behaving well (CPUID still causes a VM exit, though).
> 
> You're Intel. Can't you just change the CPUID intercept from required
> to optional? It seems like this should be in the realm of a small
> microcode patch.

We’ll take a look. Probably it would be helpful even for the bare-metal kernel (e.g. debugging). 
Thanks for the suggestion.

--- 
Jun
Intel Open Source Technology Center