mbox series

[RFCv2,00/16] KVM protected memory extension

Message ID 20201020061859.18385-1-kirill.shutemov@linux.intel.com (mailing list archive)
Headers show
Series KVM protected memory extension | expand

Message

Kirill A. Shutemov Oct. 20, 2020, 6:18 a.m. UTC
== Background / Problem ==

There are a number of hardware features (MKTME, SEV) which protect guest
memory from some unauthorized host access. The patchset proposes a purely
software feature that mitigates some of the same host-side read-only
attacks.


== What does this set mitigate? ==

 - Host kernel ”accidental” access to guest data (think speculation)

 - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))

 - Host userspace access to guest data (compromised qemu)

 - Guest privilege escalation via compromised QEMU device emulation

== What does this set NOT mitigate? ==

 - Full host kernel compromise.  Kernel will just map the pages again.

 - Hardware attacks


The second RFC revision addresses /most/ of the feedback.

I still didn't found a good solution to reboot and kexec. Unprotect all
the memory on such operations defeat the goal of the feature. Clearing up
most of the memory before unprotecting what is required for reboot (or
kexec) is tedious and error-prone.
Maybe we should just declare them unsupported?

== Series Overview ==

The hardware features protect guest data by encrypting it and then
ensuring that only the right guest can decrypt it.  This has the
side-effect of making the kernel direct map and userspace mapping
(QEMU et al) useless.  But, this teaches us something very useful:
neither the kernel or userspace mappings are really necessary for normal
guest operations.

Instead of using encryption, this series simply unmaps the memory. One
advantage compared to allowing access to ciphertext is that it allows bad
accesses to be caught instead of simply reading garbage.

Protection from physical attacks needs to be provided by some other means.
On Intel platforms, (single-key) Total Memory Encryption (TME) provides
mitigation against physical attacks, such as DIMM interposers sniffing
memory bus traffic.

The patchset modifies both host and guest kernel. The guest OS must enable
the feature via hypercall and mark any memory range that has to be shared
with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
bit in the guest’s page table while this approach uses a hypercall.

For removing the userspace mapping, use a trick similar to what NUMA
balancing does: convert memory that belongs to KVM memory slots to
PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
VMA must be treated in a special way in the GUP and fault paths. The flag
allows GUP to return the page even though it is mapped with PROT_NONE, but
only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
would result in -EFAULT.

Removing userspace mapping of the guest memory from QEMU process can help
to address some guest privilege escalation attacks. Consider the case when
unprivileged guest user exploits bug in a QEMU device emulation to gain
access to data it cannot normally have access within the guest.

Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
flushes local TLB. I think it's a reasonable compromise between security and
perfromance.

Zapping the PTE would bring the page back to the direct mapping after clearing.
At least for now, we don't remove file-backed pages from the direct mapping.
File-backed pages could be accessed via read/write syscalls. It adds
complexity.

Occasionally, host kernel has to access guest memory that was not made
shared by the guest. For instance, it happens for instruction emulation.
Normally, it's done via copy_to/from_user() which would fail with -EFAULT
now. We introduced a new pair of helpers: copy_to/from_guest(). The new
helpers acquire the page via GUP, map it into kernel address space with
kmap_atomic()-style mechanism and only then copy the data.

For some instruction emulation copying is not good enough: cmpxchg
emulation has to have direct access to the guest memory. __kvm_map_gfn()
is modified to accommodate the case.

The patchset is on top of v5.9

Kirill A. Shutemov (16):
  x86/mm: Move force_dma_unencrypted() to common code
  x86/kvm: Introduce KVM memory protection feature
  x86/kvm: Make DMA pages shared
  x86/kvm: Use bounce buffers for KVM memory protection
  x86/kvm: Make VirtIO use DMA API in KVM guest
  x86/kvmclock: Share hvclock memory with the host
  x86/realmode: Share trampoline area if KVM memory protection enabled
  KVM: Use GUP instead of copy_from/to_user() to access guest memory
  KVM: mm: Introduce VM_KVM_PROTECTED
  KVM: x86: Use GUP for page walk instead of __get_user()
  KVM: Protected memory extension
  KVM: x86: Enabled protected memory extension
  KVM: Rework copy_to/from_guest() to avoid direct mapping
  KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
  KVM: Unmap protected pages from direct mapping
  mm: Do not use zero page for VM_KVM_PROTECTED VMAs

 arch/powerpc/kvm/book3s_64_mmu_hv.c    |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   2 +-
 arch/s390/include/asm/pgtable.h        |   2 +-
 arch/x86/Kconfig                       |  11 +-
 arch/x86/include/asm/cpufeatures.h     |   1 +
 arch/x86/include/asm/io.h              |   6 +-
 arch/x86/include/asm/kvm_para.h        |   5 +
 arch/x86/include/asm/pgtable_types.h   |   1 +
 arch/x86/include/uapi/asm/kvm_para.h   |   3 +-
 arch/x86/kernel/kvm.c                  |  20 +++
 arch/x86/kernel/kvmclock.c             |   2 +-
 arch/x86/kernel/pci-swiotlb.c          |   3 +-
 arch/x86/kvm/Kconfig                   |   1 +
 arch/x86/kvm/cpuid.c                   |   3 +-
 arch/x86/kvm/mmu/mmu.c                 |   6 +-
 arch/x86/kvm/mmu/paging_tmpl.h         |  10 +-
 arch/x86/kvm/x86.c                     |   9 +
 arch/x86/mm/Makefile                   |   2 +
 arch/x86/mm/ioremap.c                  |  16 +-
 arch/x86/mm/mem_encrypt.c              |  51 ------
 arch/x86/mm/mem_encrypt_common.c       |  62 +++++++
 arch/x86/mm/pat/set_memory.c           |   7 +
 arch/x86/realmode/init.c               |   7 +-
 drivers/virtio/virtio_ring.c           |   4 +
 include/linux/kvm_host.h               |  11 +-
 include/linux/kvm_types.h              |   1 +
 include/linux/mm.h                     |  21 ++-
 include/uapi/linux/kvm_para.h          |   5 +-
 mm/gup.c                               |  20 ++-
 mm/huge_memory.c                       |  31 +++-
 mm/ksm.c                               |   2 +
 mm/memory.c                            |  18 +-
 mm/mmap.c                              |   3 +
 mm/rmap.c                              |   4 +
 virt/kvm/Kconfig                       |   3 +
 virt/kvm/async_pf.c                    |   2 +-
 virt/kvm/kvm_main.c                    | 238 +++++++++++++++++++++----
 virt/lib/Makefile                      |   1 +
 virt/lib/mem_protected.c               | 193 ++++++++++++++++++++
 39 files changed, 666 insertions(+), 123 deletions(-)
 create mode 100644 arch/x86/mm/mem_encrypt_common.c
 create mode 100644 virt/lib/mem_protected.c

Comments

Vitaly Kuznetsov Oct. 20, 2020, 7:46 a.m. UTC | #1
"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> == Background / Problem ==
>
> There are a number of hardware features (MKTME, SEV) which protect guest
> memory from some unauthorized host access. The patchset proposes a purely
> software feature that mitigates some of the same host-side read-only
> attacks.
>
>
> == What does this set mitigate? ==
>
>  - Host kernel ”accidental” access to guest data (think speculation)
>
>  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>
>  - Host userspace access to guest data (compromised qemu)
>
>  - Guest privilege escalation via compromised QEMU device emulation
>
> == What does this set NOT mitigate? ==
>
>  - Full host kernel compromise.  Kernel will just map the pages again.
>
>  - Hardware attacks
>
>
> The second RFC revision addresses /most/ of the feedback.
>
> I still didn't found a good solution to reboot and kexec. Unprotect all
> the memory on such operations defeat the goal of the feature. Clearing up
> most of the memory before unprotecting what is required for reboot (or
> kexec) is tedious and error-prone.
> Maybe we should just declare them unsupported?

Making reboot unsupported is a hard sell. Could you please elaborate on
why you think that "unprotect all" hypercall (or rather a single
hypercall supporting both protecting/unprotecting) defeats the purpose
of the feature?

(Leaving kexec aside for a while) Yes, it is not easy for a guest to
clean up *all* its memory upon reboot, however:
- It may only clean up the most sensitive parts. This should probably be
done even without this new feature and even on bare metal (think about
next boot target being malicious).
- The attack window shrinks significantly. "Speculative" bugs require
time to exploit and it will only remain open until it boots up again
(few seconds).

>
> == Series Overview ==
>
> The hardware features protect guest data by encrypting it and then
> ensuring that only the right guest can decrypt it.  This has the
> side-effect of making the kernel direct map and userspace mapping
> (QEMU et al) useless.  But, this teaches us something very useful:
> neither the kernel or userspace mappings are really necessary for normal
> guest operations.
>
> Instead of using encryption, this series simply unmaps the memory. One
> advantage compared to allowing access to ciphertext is that it allows bad
> accesses to be caught instead of simply reading garbage.
>
> Protection from physical attacks needs to be provided by some other means.
> On Intel platforms, (single-key) Total Memory Encryption (TME) provides
> mitigation against physical attacks, such as DIMM interposers sniffing
> memory bus traffic.
>
> The patchset modifies both host and guest kernel. The guest OS must enable
> the feature via hypercall and mark any memory range that has to be shared
> with the host: DMA regions, bounce buffers, etc. SEV does this marking via a
> bit in the guest’s page table while this approach uses a hypercall.
>
> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
>
> Removing userspace mapping of the guest memory from QEMU process can help
> to address some guest privilege escalation attacks. Consider the case when
> unprivileged guest user exploits bug in a QEMU device emulation to gain
> access to data it cannot normally have access within the guest.
>
> Any anonymous page faulted into the VM_KVM_PROTECTED VMA gets removed from
> the direct mapping with kernel_map_pages(). Note that kernel_map_pages() only
> flushes local TLB. I think it's a reasonable compromise between security and
> perfromance.
>
> Zapping the PTE would bring the page back to the direct mapping after clearing.
> At least for now, we don't remove file-backed pages from the direct mapping.
> File-backed pages could be accessed via read/write syscalls. It adds
> complexity.
>
> Occasionally, host kernel has to access guest memory that was not made
> shared by the guest. For instance, it happens for instruction emulation.
> Normally, it's done via copy_to/from_user() which would fail with -EFAULT
> now. We introduced a new pair of helpers: copy_to/from_guest(). The new
> helpers acquire the page via GUP, map it into kernel address space with
> kmap_atomic()-style mechanism and only then copy the data.
>
> For some instruction emulation copying is not good enough: cmpxchg
> emulation has to have direct access to the guest memory. __kvm_map_gfn()
> is modified to accommodate the case.
>
> The patchset is on top of v5.9
>
> Kirill A. Shutemov (16):
>   x86/mm: Move force_dma_unencrypted() to common code
>   x86/kvm: Introduce KVM memory protection feature
>   x86/kvm: Make DMA pages shared
>   x86/kvm: Use bounce buffers for KVM memory protection
>   x86/kvm: Make VirtIO use DMA API in KVM guest
>   x86/kvmclock: Share hvclock memory with the host
>   x86/realmode: Share trampoline area if KVM memory protection enabled
>   KVM: Use GUP instead of copy_from/to_user() to access guest memory
>   KVM: mm: Introduce VM_KVM_PROTECTED
>   KVM: x86: Use GUP for page walk instead of __get_user()
>   KVM: Protected memory extension
>   KVM: x86: Enabled protected memory extension
>   KVM: Rework copy_to/from_guest() to avoid direct mapping
>   KVM: Handle protected memory in __kvm_map_gfn()/__kvm_unmap_gfn()
>   KVM: Unmap protected pages from direct mapping
>   mm: Do not use zero page for VM_KVM_PROTECTED VMAs
>
>  arch/powerpc/kvm/book3s_64_mmu_hv.c    |   2 +-
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |   2 +-
>  arch/s390/include/asm/pgtable.h        |   2 +-
>  arch/x86/Kconfig                       |  11 +-
>  arch/x86/include/asm/cpufeatures.h     |   1 +
>  arch/x86/include/asm/io.h              |   6 +-
>  arch/x86/include/asm/kvm_para.h        |   5 +
>  arch/x86/include/asm/pgtable_types.h   |   1 +
>  arch/x86/include/uapi/asm/kvm_para.h   |   3 +-
>  arch/x86/kernel/kvm.c                  |  20 +++
>  arch/x86/kernel/kvmclock.c             |   2 +-
>  arch/x86/kernel/pci-swiotlb.c          |   3 +-
>  arch/x86/kvm/Kconfig                   |   1 +
>  arch/x86/kvm/cpuid.c                   |   3 +-
>  arch/x86/kvm/mmu/mmu.c                 |   6 +-
>  arch/x86/kvm/mmu/paging_tmpl.h         |  10 +-
>  arch/x86/kvm/x86.c                     |   9 +
>  arch/x86/mm/Makefile                   |   2 +
>  arch/x86/mm/ioremap.c                  |  16 +-
>  arch/x86/mm/mem_encrypt.c              |  51 ------
>  arch/x86/mm/mem_encrypt_common.c       |  62 +++++++
>  arch/x86/mm/pat/set_memory.c           |   7 +
>  arch/x86/realmode/init.c               |   7 +-
>  drivers/virtio/virtio_ring.c           |   4 +
>  include/linux/kvm_host.h               |  11 +-
>  include/linux/kvm_types.h              |   1 +
>  include/linux/mm.h                     |  21 ++-
>  include/uapi/linux/kvm_para.h          |   5 +-
>  mm/gup.c                               |  20 ++-
>  mm/huge_memory.c                       |  31 +++-
>  mm/ksm.c                               |   2 +
>  mm/memory.c                            |  18 +-
>  mm/mmap.c                              |   3 +
>  mm/rmap.c                              |   4 +
>  virt/kvm/Kconfig                       |   3 +
>  virt/kvm/async_pf.c                    |   2 +-
>  virt/kvm/kvm_main.c                    | 238 +++++++++++++++++++++----
>  virt/lib/Makefile                      |   1 +
>  virt/lib/mem_protected.c               | 193 ++++++++++++++++++++
>  39 files changed, 666 insertions(+), 123 deletions(-)
>  create mode 100644 arch/x86/mm/mem_encrypt_common.c
>  create mode 100644 virt/lib/mem_protected.c
Kirill A. Shutemov Oct. 20, 2020, 1:49 p.m. UTC | #2
On Tue, Oct 20, 2020 at 09:46:11AM +0200, Vitaly Kuznetsov wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > == Background / Problem ==
> >
> > There are a number of hardware features (MKTME, SEV) which protect guest
> > memory from some unauthorized host access. The patchset proposes a purely
> > software feature that mitigates some of the same host-side read-only
> > attacks.
> >
> >
> > == What does this set mitigate? ==
> >
> >  - Host kernel ”accidental” access to guest data (think speculation)
> >
> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> >
> >  - Host userspace access to guest data (compromised qemu)
> >
> >  - Guest privilege escalation via compromised QEMU device emulation
> >
> > == What does this set NOT mitigate? ==
> >
> >  - Full host kernel compromise.  Kernel will just map the pages again.
> >
> >  - Hardware attacks
> >
> >
> > The second RFC revision addresses /most/ of the feedback.
> >
> > I still didn't found a good solution to reboot and kexec. Unprotect all
> > the memory on such operations defeat the goal of the feature. Clearing up
> > most of the memory before unprotecting what is required for reboot (or
> > kexec) is tedious and error-prone.
> > Maybe we should just declare them unsupported?
> 
> Making reboot unsupported is a hard sell. Could you please elaborate on
> why you think that "unprotect all" hypercall (or rather a single
> hypercall supporting both protecting/unprotecting) defeats the purpose
> of the feature?

If guest has some data that it prefers not to leak to the host and use the
feature for the purpose, share all the memory to get through reboot is a
very weak point.

> 
> clean up *all* its memory upon reboot, however:
> - It may only clean up the most sensitive parts. This should probably be
> done even without this new feature and even on bare metal (think about
> next boot target being malicious).
> - The attack window shrinks significantly. "Speculative" bugs require
> time to exploit and it will only remain open until it boots up again
> (few seconds).

Maybe it would be cleaner to handle reboot in userspace? If we got the VM
rebooted, just reconstruct it from scratch as if it would be new boot.
Vitaly Kuznetsov Oct. 21, 2020, 2:46 p.m. UTC | #3
"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Tue, Oct 20, 2020 at 09:46:11AM +0200, Vitaly Kuznetsov wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> 
>> > == Background / Problem ==
>> >
>> > There are a number of hardware features (MKTME, SEV) which protect guest
>> > memory from some unauthorized host access. The patchset proposes a purely
>> > software feature that mitigates some of the same host-side read-only
>> > attacks.
>> >
>> >
>> > == What does this set mitigate? ==
>> >
>> >  - Host kernel ”accidental” access to guest data (think speculation)
>> >
>> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
>> >
>> >  - Host userspace access to guest data (compromised qemu)
>> >
>> >  - Guest privilege escalation via compromised QEMU device emulation
>> >
>> > == What does this set NOT mitigate? ==
>> >
>> >  - Full host kernel compromise.  Kernel will just map the pages again.
>> >
>> >  - Hardware attacks
>> >
>> >
>> > The second RFC revision addresses /most/ of the feedback.
>> >
>> > I still didn't found a good solution to reboot and kexec. Unprotect all
>> > the memory on such operations defeat the goal of the feature. Clearing up
>> > most of the memory before unprotecting what is required for reboot (or
>> > kexec) is tedious and error-prone.
>> > Maybe we should just declare them unsupported?
>> 
>> Making reboot unsupported is a hard sell. Could you please elaborate on
>> why you think that "unprotect all" hypercall (or rather a single
>> hypercall supporting both protecting/unprotecting) defeats the purpose
>> of the feature?
>
> If guest has some data that it prefers not to leak to the host and use the
> feature for the purpose, share all the memory to get through reboot is a
> very weak point.
>

My point that if it knows that there's something sensitive in its
memory it should clean it up even today without your feature before
rebooting to an unknown target.

>> 
>> clean up *all* its memory upon reboot, however:
>> - It may only clean up the most sensitive parts. This should probably be
>> done even without this new feature and even on bare metal (think about
>> next boot target being malicious).
>> - The attack window shrinks significantly. "Speculative" bugs require
>> time to exploit and it will only remain open until it boots up again
>> (few seconds).
>
> Maybe it would be cleaner to handle reboot in userspace? If we got the VM
> rebooted, just reconstruct it from scratch as if it would be new boot.

We are definitely not trying to protect against malicious KVM so maybe
we can do the cleanup there (when protection was enabled) so we can
unprotect everything without risk of a leak?
Andy Lutomirski Oct. 21, 2020, 6:20 p.m. UTC | #4
> On Oct 19, 2020, at 11:19 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:

> For removing the userspace mapping, use a trick similar to what NUMA
> balancing does: convert memory that belongs to KVM memory slots to
> PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> VMA must be treated in a special way in the GUP and fault paths. The flag
> allows GUP to return the page even though it is mapped with PROT_NONE, but
> only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> would result in -EFAULT.
>

I definitely like the direction this patchset is going in, and I think
that allowing KVM guests to have memory that is inaccessible to QEMU
is a great idea.

I do wonder, though: do we really want to do this with these PROT_NONE
tricks, or should we actually come up with a way to have KVM guest map
memory that isn't mapped into QEMU's mm_struct at all?  As an example
of the latter, I mean something a bit like this:

https://lkml.kernel.org/r/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com

I don't mean to say that this is a requirement of any kind of
protected memory like this, but I do think we should understand the
tradeoffs, in terms of what a full implementation looks like, the
effort and time frames involved, and the maintenance burden of
supporting whatever gets merged going forward.
Kirill A. Shutemov Oct. 23, 2020, 11:35 a.m. UTC | #5
On Wed, Oct 21, 2020 at 04:46:48PM +0200, Vitaly Kuznetsov wrote:
> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> 
> > On Tue, Oct 20, 2020 at 09:46:11AM +0200, Vitaly Kuznetsov wrote:
> >> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
> >> 
> >> > == Background / Problem ==
> >> >
> >> > There are a number of hardware features (MKTME, SEV) which protect guest
> >> > memory from some unauthorized host access. The patchset proposes a purely
> >> > software feature that mitigates some of the same host-side read-only
> >> > attacks.
> >> >
> >> >
> >> > == What does this set mitigate? ==
> >> >
> >> >  - Host kernel ”accidental” access to guest data (think speculation)
> >> >
> >> >  - Host kernel induced access to guest data (write(fd, &guest_data_ptr, len))
> >> >
> >> >  - Host userspace access to guest data (compromised qemu)
> >> >
> >> >  - Guest privilege escalation via compromised QEMU device emulation
> >> >
> >> > == What does this set NOT mitigate? ==
> >> >
> >> >  - Full host kernel compromise.  Kernel will just map the pages again.
> >> >
> >> >  - Hardware attacks
> >> >
> >> >
> >> > The second RFC revision addresses /most/ of the feedback.
> >> >
> >> > I still didn't found a good solution to reboot and kexec. Unprotect all
> >> > the memory on such operations defeat the goal of the feature. Clearing up
> >> > most of the memory before unprotecting what is required for reboot (or
> >> > kexec) is tedious and error-prone.
> >> > Maybe we should just declare them unsupported?
> >> 
> >> Making reboot unsupported is a hard sell. Could you please elaborate on
> >> why you think that "unprotect all" hypercall (or rather a single
> >> hypercall supporting both protecting/unprotecting) defeats the purpose
> >> of the feature?
> >
> > If guest has some data that it prefers not to leak to the host and use the
> > feature for the purpose, share all the memory to get through reboot is a
> > very weak point.
> >
> 
> My point that if it knows that there's something sensitive in its
> memory it should clean it up even today without your feature before
> rebooting to an unknown target.

It's unrealistic to expect everybody to do the right thing.

> >> clean up *all* its memory upon reboot, however:
> >> - It may only clean up the most sensitive parts. This should probably be
> >> done even without this new feature and even on bare metal (think about
> >> next boot target being malicious).
> >> - The attack window shrinks significantly. "Speculative" bugs require
> >> time to exploit and it will only remain open until it boots up again
> >> (few seconds).
> >
> > Maybe it would be cleaner to handle reboot in userspace? If we got the VM
> > rebooted, just reconstruct it from scratch as if it would be new boot.
> 
> We are definitely not trying to protect against malicious KVM so maybe
> we can do the cleanup there (when protection was enabled) so we can
> unprotect everything without risk of a leak?

Do you have any particular codepath in mind? I didn't find anything
suitable so far.
Vitaly Kuznetsov Oct. 23, 2020, 12:01 p.m. UTC | #6
"Kirill A. Shutemov" <kirill@shutemov.name> writes:

> On Wed, Oct 21, 2020 at 04:46:48PM +0200, Vitaly Kuznetsov wrote:
>> "Kirill A. Shutemov" <kirill@shutemov.name> writes:
>> 
>> > Maybe it would be cleaner to handle reboot in userspace? If we got the VM
>> > rebooted, just reconstruct it from scratch as if it would be new boot.
>> 
>> We are definitely not trying to protect against malicious KVM so maybe
>> we can do the cleanup there (when protection was enabled) so we can
>> unprotect everything without risk of a leak?
>
> Do you have any particular codepath in mind? I didn't find anything
> suitable so far.

I didn't put much thought in it but e.g. on x86, what if we put this to
kvm_vcpu_reset() under 'if (kvm_vcpu_is_bsp())' condition? 

The main problem I see is that we can't clean up *all* memory,
e.g. firmware related stuff should stay intact and this contraducts your
KVM_HC_ENABLE_MEM_PROTECTED which protects everything. We can, probably,
get rid of it leaving KVM_HC_MEM_SHARE/KVM_HC_MEM_UNSHARE only shifting
responsibility to define what can be cleaned up on the guest kernel
(stating in the doc that all protected memory will get whiped out on
reboot).
Kirill A. Shutemov Oct. 26, 2020, 3:29 p.m. UTC | #7
On Wed, Oct 21, 2020 at 11:20:56AM -0700, Andy Lutomirski wrote:
> > On Oct 19, 2020, at 11:19 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> 
> > For removing the userspace mapping, use a trick similar to what NUMA
> > balancing does: convert memory that belongs to KVM memory slots to
> > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> > the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> > VMA must be treated in a special way in the GUP and fault paths. The flag
> > allows GUP to return the page even though it is mapped with PROT_NONE, but
> > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> > would result in -EFAULT.
> >
> 
> I definitely like the direction this patchset is going in, and I think
> that allowing KVM guests to have memory that is inaccessible to QEMU
> is a great idea.
> 
> I do wonder, though: do we really want to do this with these PROT_NONE
> tricks, or should we actually come up with a way to have KVM guest map
> memory that isn't mapped into QEMU's mm_struct at all?  As an example
> of the latter, I mean something a bit like this:
> 
> https://lkml.kernel.org/r/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com
> 
> I don't mean to say that this is a requirement of any kind of
> protected memory like this, but I do think we should understand the
> tradeoffs, in terms of what a full implementation looks like, the
> effort and time frames involved, and the maintenance burden of
> supporting whatever gets merged going forward.

I considered the PROT_NONE trick neat. Complete removing of the mapping
from QEMU would require more changes into KVM and I'm not really familiar
with it.

About tradeoffs: the trick interferes with AutoNUMA. I didn't put much
thought into how we can get it work together. Need to look into it.

Do you see other tradeoffs?
Andy Lutomirski Oct. 26, 2020, 11:58 p.m. UTC | #8
On Mon, Oct 26, 2020 at 8:29 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Wed, Oct 21, 2020 at 11:20:56AM -0700, Andy Lutomirski wrote:
> > > On Oct 19, 2020, at 11:19 PM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > > For removing the userspace mapping, use a trick similar to what NUMA
> > > balancing does: convert memory that belongs to KVM memory slots to
> > > PROT_NONE: all existing entries converted to PROT_NONE with mprotect() and
> > > the newly faulted in pages get PROT_NONE from the updated vm_page_prot.
> > > The new VMA flag -- VM_KVM_PROTECTED -- indicates that the pages in the
> > > VMA must be treated in a special way in the GUP and fault paths. The flag
> > > allows GUP to return the page even though it is mapped with PROT_NONE, but
> > > only if the new GUP flag -- FOLL_KVM -- is specified. Any userspace access
> > > to the memory would result in SIGBUS. Any GUP access without FOLL_KVM
> > > would result in -EFAULT.
> > >
> >
> > I definitely like the direction this patchset is going in, and I think
> > that allowing KVM guests to have memory that is inaccessible to QEMU
> > is a great idea.
> >
> > I do wonder, though: do we really want to do this with these PROT_NONE
> > tricks, or should we actually come up with a way to have KVM guest map
> > memory that isn't mapped into QEMU's mm_struct at all?  As an example
> > of the latter, I mean something a bit like this:
> >
> > https://lkml.kernel.org/r/CALCETrUSUp_7svg8EHNTk3nQ0x9sdzMCU=h8G-Sy6=SODq5GHg@mail.gmail.com
> >
> > I don't mean to say that this is a requirement of any kind of
> > protected memory like this, but I do think we should understand the
> > tradeoffs, in terms of what a full implementation looks like, the
> > effort and time frames involved, and the maintenance burden of
> > supporting whatever gets merged going forward.
>
> I considered the PROT_NONE trick neat. Complete removing of the mapping
> from QEMU would require more changes into KVM and I'm not really familiar
> with it.

I think it's neat.  The big tradeoff I'm concerned about is that it
will likely become ABI once it lands.  That is, if this series lands,
then we will always have to support the case in which QEMU has a
special non-present mapping that is nonetheless reflected as present
in a guest.  This is a bizarre state of affairs, it may become
obsolete if a better API ever shows up, and it might end up placing
constraints on the Linux VM that we don't love going forward.

I don't think my proposal in the referenced thread above is that crazy
or that difficult to implement.  The basic idea is to have a way to
create an mm_struct that is not loaded in CR3 anywhere.  Instead, KVM
will reference it, much as it currently references QEMU's mm_struct,
to mirror mappings into the guest.  This means it would be safe to
have "protected" memory mapped into the special mm_struct because
nothing other than KVM will ever reference the PTEs.  But I think that
someone who really understands the KVM memory mapping code should
chime in.

>
> About tradeoffs: the trick interferes with AutoNUMA. I didn't put much
> thought into how we can get it work together. Need to look into it.
>
> Do you see other tradeoffs?
>
> --
>  Kirill A. Shutemov