mbox series

[00/16] KVM: arm64: MMIO guard PV services

Message ID 20210715163159.1480168-1-maz@kernel.org (mailing list archive)
Headers show
Series KVM: arm64: MMIO guard PV services | expand

Message

Marc Zyngier July 15, 2021, 4:31 p.m. UTC
KVM/arm64 currently considers that any memory access outside of a
memslot is a MMIO access. This so far has served us very well, but
obviously relies on the guest trusting the host, and especially
userspace to do the right thing.

As we keep on hacking away at pKVM, it becomes obvious that this trust
model is not really fit for a confidential computing environment, and
that the guest would require some guarantees that emulation only
occurs on portions of the address space that have clearly been
identified for this purpose.

This series aims at providing the two sides of the above coin:

- a set of PV services (collectively called 'MMIO guard' -- better
  name required!) where the guest can flag portion of its address
  space that it considers as MMIO, with map/unmap semantics. Any
  attempt to access a MMIO range outside of these regions will result
  in an external abort being injected.

- a set of hooks into the ioremap code allowing a Linux guest to tell
  KVM about things it want to consider as MMIO. I definitely hate this
  part of the series, as it feels clumsy and brittle.

For now, the enrolment in this scheme is controlled by a guest kernel
command-line parameters, but it is expected that KVM will enforce this
for protected VMs.

Note that this crucially misses a save/restore interface for non
protected VMs, and I currently don't have a good solution for
that. Ideas welcome.

I also plan to use this series as a base for some other purposes,
namely to trick the guest in telling us how it maps things like
prefetchable BARs (see the discussion at [1]). That part is not
implemented yet, but there is already some provision to pass the MAIR
index across.

Patches on top of 5.14-rc1, branch pushed at the usual location.

[1] 20210429162906.32742-1-sdonthineni@nvidia.com

Marc Zyngier (16):
  KVM: arm64: Generalise VM features into a set of flags
  KVM: arm64: Don't issue CMOs when the physical address is invalid
  KVM: arm64: Turn kvm_pgtable_stage2_set_owner into
    kvm_pgtable_stage2_annotate
  KVM: arm64: Add MMIO checking infrastructure
  KVM: arm64: Plumb MMIO checking into the fault handling
  KVM: arm64: Force a full unmap on vpcu reinit
  KVM: arm64: Wire MMIO guard hypercalls
  KVM: arm64: Add tracepoint for failed MMIO guard check
  KVM: arm64: Advertise a capability for MMIO guard
  KVM: arm64: Add some documentation for the MMIO guard feature
  firmware/smccc: Call arch-specific hook on discovering KVM services
  mm/ioremap: Add arch-specific callbacks on ioremap/iounmap calls
  arm64: Implement ioremap/iounmap hooks calling into KVM's MMIO guard
  arm64: Enroll into KVM's MMIO guard if required
  arm64: Add a helper to retrieve the PTE of a fixmap
  arm64: Register earlycon fixmap with the MMIO guard

 .../admin-guide/kernel-parameters.txt         |   3 +
 Documentation/virt/kvm/arm/index.rst          |   1 +
 Documentation/virt/kvm/arm/mmio-guard.rst     |  73 +++++++++++
 arch/arm/include/asm/hypervisor.h             |   1 +
 arch/arm64/include/asm/fixmap.h               |   2 +
 arch/arm64/include/asm/hypervisor.h           |   2 +
 arch/arm64/include/asm/kvm_host.h             |  14 ++-
 arch/arm64/include/asm/kvm_mmu.h              |   5 +
 arch/arm64/include/asm/kvm_pgtable.h          |  12 +-
 arch/arm64/kernel/setup.c                     |   6 +
 arch/arm64/kvm/arm.c                          |  14 ++-
 arch/arm64/kvm/hyp/nvhe/mem_protect.c         |  14 ++-
 arch/arm64/kvm/hyp/pgtable.c                  |  36 +++---
 arch/arm64/kvm/hypercalls.c                   |  20 +++
 arch/arm64/kvm/mmio.c                         |  13 +-
 arch/arm64/kvm/mmu.c                          | 117 ++++++++++++++++++
 arch/arm64/kvm/trace_arm.h                    |  17 +++
 arch/arm64/mm/ioremap.c                       | 107 ++++++++++++++++
 arch/arm64/mm/mmu.c                           |  15 +++
 drivers/firmware/smccc/kvm_guest.c            |   4 +
 include/linux/arm-smccc.h                     |  28 +++++
 include/linux/io.h                            |   3 +
 include/uapi/linux/kvm.h                      |   1 +
 mm/ioremap.c                                  |  13 +-
 mm/vmalloc.c                                  |   8 ++
 25 files changed, 492 insertions(+), 37 deletions(-)
 create mode 100644 Documentation/virt/kvm/arm/mmio-guard.rst

Comments

Andrew Jones July 21, 2021, 9:42 p.m. UTC | #1
On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> KVM/arm64 currently considers that any memory access outside of a
> memslot is a MMIO access. This so far has served us very well, but
> obviously relies on the guest trusting the host, and especially
> userspace to do the right thing.
> 
> As we keep on hacking away at pKVM, it becomes obvious that this trust
> model is not really fit for a confidential computing environment, and
> that the guest would require some guarantees that emulation only
> occurs on portions of the address space that have clearly been
> identified for this purpose.

This trust model is hard for me to reason about. userspace is trusted to
control the life cycle of the VM, to prepare the memslots for the VM,
and [presumably] identify what MMIO ranges are valid, yet it's not
trusted to handle invalid MMIO accesses. I'd like to learn more about
this model and the userspace involved.

> 
> This series aims at providing the two sides of the above coin:
> 
> - a set of PV services (collectively called 'MMIO guard' -- better
>   name required!) where the guest can flag portion of its address
>   space that it considers as MMIO, with map/unmap semantics. Any
>   attempt to access a MMIO range outside of these regions will result
>   in an external abort being injected.
> 
> - a set of hooks into the ioremap code allowing a Linux guest to tell
>   KVM about things it want to consider as MMIO. I definitely hate this
>   part of the series, as it feels clumsy and brittle.
> 
> For now, the enrolment in this scheme is controlled by a guest kernel
> command-line parameters, but it is expected that KVM will enforce this
> for protected VMs.
> 
> Note that this crucially misses a save/restore interface for non
> protected VMs, and I currently don't have a good solution for
> that. Ideas welcome.
> 
> I also plan to use this series as a base for some other purposes,
> namely to trick the guest in telling us how it maps things like
> prefetchable BARs (see the discussion at [1]). That part is not
> implemented yet, but there is already some provision to pass the MAIR
> index across.
> 
> Patches on top of 5.14-rc1, branch pushed at the usual location.
> 
> [1] 20210429162906.32742-1-sdonthineni@nvidia.com

The fun never stops.

Thanks,
drew
Marc Zyngier July 22, 2021, 10 a.m. UTC | #2
On Wed, 21 Jul 2021 22:42:43 +0100,
Andrew Jones <drjones@redhat.com> wrote:
> 
> On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> > KVM/arm64 currently considers that any memory access outside of a
> > memslot is a MMIO access. This so far has served us very well, but
> > obviously relies on the guest trusting the host, and especially
> > userspace to do the right thing.
> > 
> > As we keep on hacking away at pKVM, it becomes obvious that this trust
> > model is not really fit for a confidential computing environment, and
> > that the guest would require some guarantees that emulation only
> > occurs on portions of the address space that have clearly been
> > identified for this purpose.
> 
> This trust model is hard for me to reason about. userspace is trusted to
> control the life cycle of the VM, to prepare the memslots for the VM,
> and [presumably] identify what MMIO ranges are valid, yet it's not
> trusted to handle invalid MMIO accesses. I'd like to learn more about
> this model and the userspace involved.

Imagine the following scenario:

On top of the normal memory described as memslots (which pKVM will
ensure that userspace cannot access), a malicious userspace describes
to the guest another memory region in a firmware table and does not
back it with a memslot.

The hypervisor cannot validate this firmware description (imagine
doing ACPI and DT parsing at EL2...), so the guest starts using this
"memory" for something, and data slowly trickles all the way to EL0.
Not what you wanted.

To ensure that this doesn't happen, we reverse the problem: userspace
(and ultimately the EL1 kernel) doesn't get involved on a translation
fault outside of a memslot *unless* the guest has explicitly asked for
that page to be handled as a MMIO. With that, we have a full
description of the IPA space contained in the S2 page tables:

- memory described via a memslot,
- directly mapped device (GICv2, for exmaple),
- MMIO exposed for emulation

and anything else is an invalid access that results in an abort.

Does this make sense to you?

Thanks,

	M.
Andrew Jones July 22, 2021, 1:25 p.m. UTC | #3
On Thu, Jul 22, 2021 at 11:00:26AM +0100, Marc Zyngier wrote:
> On Wed, 21 Jul 2021 22:42:43 +0100,
> Andrew Jones <drjones@redhat.com> wrote:
> > 
> > On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> > > KVM/arm64 currently considers that any memory access outside of a
> > > memslot is a MMIO access. This so far has served us very well, but
> > > obviously relies on the guest trusting the host, and especially
> > > userspace to do the right thing.
> > > 
> > > As we keep on hacking away at pKVM, it becomes obvious that this trust
> > > model is not really fit for a confidential computing environment, and
> > > that the guest would require some guarantees that emulation only
> > > occurs on portions of the address space that have clearly been
> > > identified for this purpose.
> > 
> > This trust model is hard for me to reason about. userspace is trusted to
> > control the life cycle of the VM, to prepare the memslots for the VM,
> > and [presumably] identify what MMIO ranges are valid, yet it's not
> > trusted to handle invalid MMIO accesses. I'd like to learn more about
> > this model and the userspace involved.
> 
> Imagine the following scenario:
> 
> On top of the normal memory described as memslots (which pKVM will
> ensure that userspace cannot access),

Ah, I didn't know that part.

> a malicious userspace describes
> to the guest another memory region in a firmware table and does not
> back it with a memslot.
> 
> The hypervisor cannot validate this firmware description (imagine
> doing ACPI and DT parsing at EL2...), so the guest starts using this
> "memory" for something, and data slowly trickles all the way to EL0.
> Not what you wanted.

Yes, I see that now, in light of the above.

> 
> To ensure that this doesn't happen, we reverse the problem: userspace
> (and ultimately the EL1 kernel) doesn't get involved on a translation
> fault outside of a memslot *unless* the guest has explicitly asked for
> that page to be handled as a MMIO. With that, we have a full
> description of the IPA space contained in the S2 page tables:
> 
> - memory described via a memslot,
> - directly mapped device (GICv2, for exmaple),
> - MMIO exposed for emulation
> 
> and anything else is an invalid access that results in an abort.
> 
> Does this make sense to you?

Now I understand better, but if we're worried about malicious userspaces,
then how do we protect the guest from "bad" MMIO devices that have been
described to it? The guest can request access to those using this new
mechanism.

Thanks,
drew
Marc Zyngier July 22, 2021, 3:30 p.m. UTC | #4
On Thu, 22 Jul 2021 14:25:15 +0100,
Andrew Jones <drjones@redhat.com> wrote:
> 
> On Thu, Jul 22, 2021 at 11:00:26AM +0100, Marc Zyngier wrote:
> > On Wed, 21 Jul 2021 22:42:43 +0100,
> > Andrew Jones <drjones@redhat.com> wrote:
> > > 
> > > On Thu, Jul 15, 2021 at 05:31:43PM +0100, Marc Zyngier wrote:
> > > > KVM/arm64 currently considers that any memory access outside of a
> > > > memslot is a MMIO access. This so far has served us very well, but
> > > > obviously relies on the guest trusting the host, and especially
> > > > userspace to do the right thing.
> > > > 
> > > > As we keep on hacking away at pKVM, it becomes obvious that this trust
> > > > model is not really fit for a confidential computing environment, and
> > > > that the guest would require some guarantees that emulation only
> > > > occurs on portions of the address space that have clearly been
> > > > identified for this purpose.
> > > 
> > > This trust model is hard for me to reason about. userspace is trusted to
> > > control the life cycle of the VM, to prepare the memslots for the VM,
> > > and [presumably] identify what MMIO ranges are valid, yet it's not
> > > trusted to handle invalid MMIO accesses. I'd like to learn more about
> > > this model and the userspace involved.
> > 
> > Imagine the following scenario:
> > 
> > On top of the normal memory described as memslots (which pKVM will
> > ensure that userspace cannot access),
> 
> Ah, I didn't know that part.

Yeah, that's the crucial bit. By default, pKVM guests do not share any
memory with anyone, so the memslots are made inaccessible from both
the VMM and the host kernel. The guest has to explicitly change the
state of the memory it wants to share back with the host for things
like IO.

> 
> > a malicious userspace describes
> > to the guest another memory region in a firmware table and does not
> > back it with a memslot.
> > 
> > The hypervisor cannot validate this firmware description (imagine
> > doing ACPI and DT parsing at EL2...), so the guest starts using this
> > "memory" for something, and data slowly trickles all the way to EL0.
> > Not what you wanted.
> 
> Yes, I see that now, in light of the above.
> 
> > 
> > To ensure that this doesn't happen, we reverse the problem: userspace
> > (and ultimately the EL1 kernel) doesn't get involved on a translation
> > fault outside of a memslot *unless* the guest has explicitly asked for
> > that page to be handled as a MMIO. With that, we have a full
> > description of the IPA space contained in the S2 page tables:
> > 
> > - memory described via a memslot,
> > - directly mapped device (GICv2, for exmaple),
> > - MMIO exposed for emulation
> > 
> > and anything else is an invalid access that results in an abort.
> > 
> > Does this make sense to you?
> 
> Now I understand better, but if we're worried about malicious userspaces,
> then how do we protect the guest from "bad" MMIO devices that have been
> described to it? The guest can request access to those using this new
> mechanism.

We don't try to do anything about a malicious IO device. Any IO should
be considered as malicious, and you don't want to give it anything in
clear-text if it is supposed to be secret.

Eventually, you'd probably want directly assigned devices that can
attest to the guest that they are what they pretend to be, but that's
a long way away. For now, we only want to enable virtio with a reduced
level of trust (bounce buffering via shared pages for DMA, and reduced
MMIO exposure).

Thanks,

	M.