[RFC,00/13] XOM for KVM guest userspace
mbox series

Message ID 20191003212400.31130-1-rick.p.edgecombe@intel.com
Headers show
Series
  • XOM for KVM guest userspace
Related show

Message

Edgecombe, Rick P Oct. 3, 2019, 9:23 p.m. UTC
This patchset enables the ability for KVM guests to create execute-only (XO)
memory by utilizing EPT based XO permissions. XO memory is currently supported
on Intel hardware natively for CPU's with PKU, but this enables it on older
platforms, and can support XO for kernel memory as well.

In the guest, this patchset enables XO memory for userspace, using the existing
interface (mprotect PROT_EXEC && !PROT_READ) used for arm64 and x86 PKU HW. A
larger follow on to this enables setting the kernel text as XO, but this is just
the KVM pieces and guest userspace. The yet un-posted QEMU patches to work with
these changes are here:
https://github.com/redgecombe/qemu/

Guest Interface
===============
The way XO is exposed to the guest is by creating a virtual XO permission bit in
the guest page tables.

There are normally four kinds of page table bits:
1. Bits ignored by the hardware
2. Bits that must be 0 or else the hardware throws a RSVD page fault
3. Bits used by the hardware for addresses
4. Bits used by the hardware for permissions and other features

We want to find a bit in the guest page tables to use to mean execute-only
memory so that guest can map the same physical memory with different permissions
simultaneously like other permission bits. We also want the translations to be
done by the hardware, which means we can't use ignored or reserved bits. We also
can't easily re-purpose a feature bit. This leaves address bits. The idea here
is we will take an address bit and re-purpose it as a feature bit.

The first thing we have to do is tell the guest that it can't use the address
bit we are stealing. Luckily there is an existing CPUID leaf that conveys the
number of physical address bits which is already intercepted by KVM, and so we
can reduce it as needed. This puts what was previously the top physical address
bit into what is defined as the "reserved area" of the PTE.

Here is how the PTE would be transformed, where M is the number of physical bits
exposed by the CPUID leaf.

Normal:
|--------------------------------------------------------|
| .. |     RSVD (51 to M)     |   PFN (M-1 to 12)   | .. |
|--------------------------------------------------------|

KVM XO (with M reduced by 1):
|--------------------------------------------------------|
| .. |  RSVD (51 to M+1) | XO |   PFN (M-1 to 12)   | .. |
|--------------------------------------------------------|

So the way XOM is exposed to the guest is by having the VMM provide two aliases
in the guest physical address space for the same memory. The first half has
normal EPT permissions, and the second half has XO permissions. This way the
high PFN bit in the guest page tables acts like an XO permission bit. The VMM
reports to the guest a number of physical address bits that exclude the XO bit,
so from the guest perspective the XO bit is in the region that would be
"reserved", and from the CPU's perspective the bit is still a normal PFN bit.

Backwards Compatibility
-----------------------
Since software would have previously received a #PF with the RSVD error code
set, when the HW encountered any set bits in the region 51 to M, there was some
internal discussion on whether this should have a virtual MSR for the OS to turn
it on only if the OS knows it isn't relying on this behavior for bit M. The
argument against needing an MSR is this blurb from the Intel SDM about reserved
bits:
"Bits reserved in the paging-structure entries are reserved for future
functionality. Software developers should be aware that such bits may be used in
the future and that a paging-structure entry that causes a page-fault exception
on one processor might not do so in the future."

So in the current patchset there is no MSR write required for the guest to turn
on this feature. It will have this behavior whenever qemu is run with
"-cpu +xo".

KVM XO CPUID Feature Bit
------------------------
Althrough this patchset targets KVM, the idea is that this interface might be
implemented by other hypervisors. Especially since as it appears especially like
a normal CPU feature it would be nice if there was a single CPUID bit to check
for different implementations like there often is for real CPU features. In the
past there was a proposal for "generic leaves" [1], where regions are assigned
for VMMs to define, but where the behavior will not change across VMMs. This
patchset follows this proposal and defines a bit in a new leaf to expose the
presense of the above described behavior. I'm hoping to get some suggestions on
the right way to expose it by this RFC.

Injecting Page Faults
---------------------
When there is an attempt to read memory from an XO address range, a #PF is
injected into the guest with P=1, W/R=0, RSVD=0, I/D=0. When there is an attempt
to write, it is P=1, W/R=1, RSVD=0, I/D=0.

Implementation
==============
In KVM this patchset adds a new memslot, KVM_MEM_EXECONLY, which maps memory as
execute-only via EPT permissions, and will inject a PF to the guest if there is
a violation. The x86 emulator is also made aware of XO memory perissions, and
virtualized features that act on PFN's are made aware that VTs view of the GFN
includes the permission bit (and so needs to be masked to get the guests view of
the PFN).

QEMU manipulates the physical address bits exposed to the guest and adds an
extra KVM_MEM_EXECONLY memslot that points to the same userspace memory in the
XO range for every memslot added in the normal range.

The violating linear address is determined from the EPT feature that provides
the linear address of the violation if availible, and if not availible emulates
the violating instruction to determine which linear address to use in the
injected fault.

Performance
===========
The performance impact is not fully characterized yet. In the larger patchset
that sets kernel text to be XO, there wasn't any measurable impact compiling
the kernel. The hope is that there will not be a large impact, but more testing
is needed.

Status
======
Regression testing is still needed including the nested virtualization case and
impact of XO in the other memslot address spaces. This is based on 5.3.

[1] https://lwn.net/Articles/301888/


Rick Edgecombe (13):
  kvm: Enable MTRR to work with GFNs with perm bits
  kvm: Add support for X86_FEATURE_KVM_XO
  kvm: Add XO memslot type
  kvm, vmx: Add support for gva exit qualification
  kvm: Add #PF injection for KVM XO
  kvm: Add KVM_CAP_EXECONLY_MEM
  kvm: Add docs for KVM_CAP_EXECONLY_MEM
  x86/boot: Rename USE_EARLY_PGTABLE_L5
  x86/cpufeature: Add detection of KVM XO
  x86/mm: Add NR page bit for KVM XO
  x86, ptdump: Add NR bit to page table dump
  mmap: Add XO support for KVM XO
  x86/Kconfig: Add Kconfig for KVM based XO

 Documentation/virt/kvm/api.txt                | 16 ++--
 arch/x86/Kconfig                              | 13 +++
 arch/x86/boot/compressed/misc.h               |  2 +-
 arch/x86/include/asm/cpufeature.h             |  7 +-
 arch/x86/include/asm/cpufeatures.h            |  5 +-
 arch/x86/include/asm/disabled-features.h      |  3 +-
 arch/x86/include/asm/kvm_host.h               |  7 ++
 arch/x86/include/asm/pgtable_32_types.h       |  1 +
 arch/x86/include/asm/pgtable_64_types.h       | 30 ++++++-
 arch/x86/include/asm/pgtable_types.h          | 13 +++
 arch/x86/include/asm/required-features.h      |  3 +-
 arch/x86/include/asm/sparsemem.h              |  4 +-
 arch/x86/include/asm/vmx.h                    |  1 +
 arch/x86/include/uapi/asm/kvm_para.h          |  3 +
 arch/x86/kernel/cpu/common.c                  |  7 +-
 arch/x86/kernel/head64.c                      | 43 +++++++++-
 arch/x86/kvm/cpuid.c                          |  7 ++
 arch/x86/kvm/cpuid.h                          |  1 +
 arch/x86/kvm/mmu.c                            | 79 +++++++++++++++++--
 arch/x86/kvm/mtrr.c                           |  8 ++
 arch/x86/kvm/paging_tmpl.h                    | 29 +++++--
 arch/x86/kvm/svm.c                            |  6 ++
 arch/x86/kvm/vmx/vmx.c                        |  6 ++
 arch/x86/kvm/x86.c                            |  9 ++-
 arch/x86/mm/dump_pagetables.c                 |  6 +-
 arch/x86/mm/init.c                            |  3 +
 arch/x86/mm/kasan_init_64.c                   |  2 +-
 include/uapi/linux/kvm.h                      |  2 +
 mm/mmap.c                                     | 30 +++++--
 .../arch/x86/include/asm/disabled-features.h  |  3 +-
 tools/include/uapi/linux/kvm.h                |  1 +
 virt/kvm/kvm_main.c                           | 15 +++-
 32 files changed, 322 insertions(+), 43 deletions(-)

Comments

Paolo Bonzini Oct. 4, 2019, 7:22 a.m. UTC | #1
On 03/10/19 23:23, Rick Edgecombe wrote:
> Since software would have previously received a #PF with the RSVD error code
> set, when the HW encountered any set bits in the region 51 to M, there was some
> internal discussion on whether this should have a virtual MSR for the OS to turn
> it on only if the OS knows it isn't relying on this behavior for bit M. The
> argument against needing an MSR is this blurb from the Intel SDM about reserved
> bits:
> "Bits reserved in the paging-structure entries are reserved for future
> functionality. Software developers should be aware that such bits may be used in
> the future and that a paging-structure entry that causes a page-fault exception
> on one processor might not do so in the future."
> 
> So in the current patchset there is no MSR write required for the guest to turn
> on this feature. It will have this behavior whenever qemu is run with
> "-cpu +xo".

I think the part of the manual that you quote is out of date.  Whenever
Intel has "unreserved" bits in the page tables they have done that only
if specific bits in CR4 or EFER or VMCS execution controls are set; this
is a good thing, and I'd really like it to be codified in the SDM.

The only bits for which this does not (and should not) apply are indeed
bits 51:MAXPHYADDR.  But the SDM makes it clear that bits 51:MAXPHYADDR
are reserved, hence "unreserving" bits based on just a QEMU command line
option would be against the specification.  So, please don't do this and
introduce an MSR that enables the feature.

Paolo
Andy Lutomirski Oct. 4, 2019, 2:56 p.m. UTC | #2
On Thu, Oct 3, 2019 at 2:38 PM Rick Edgecombe
<rick.p.edgecombe@intel.com> wrote:
>
> This patchset enables the ability for KVM guests to create execute-only (XO)
> memory by utilizing EPT based XO permissions. XO memory is currently supported
> on Intel hardware natively for CPU's with PKU, but this enables it on older
> platforms, and can support XO for kernel memory as well.

The patchset seems to sometimes call this feature "XO" and sometimes
call it "NR".  To me, XO implies no-read and no-write, whereas NR
implies just no-read.  Can you please clarify *exactly* what the new
bit does and be consistent?

I suggest that you make it NR, which allows for PROT_EXEC and
PROT_EXEC|PROT_WRITE and plain PROT_WRITE.  WX is of dubious value,
but I can imagine plain W being genuinely useful for logging and for
JITs that could maintain a W and a separate X mapping of some code.
In other words, with an NR bit, all 8 logical access modes are
possible.  Also, keeping the paging bits more orthogonal seems nice --
we already have a bit that controls write access.
Edgecombe, Rick P Oct. 4, 2019, 7:03 p.m. UTC | #3
On Fri, 2019-10-04 at 09:22 +0200, Paolo Bonzini wrote:
> On 03/10/19 23:23, Rick Edgecombe wrote:
> > Since software would have previously received a #PF with the RSVD error code
> > set, when the HW encountered any set bits in the region 51 to M, there was
> > some
> > internal discussion on whether this should have a virtual MSR for the OS to
> > turn
> > it on only if the OS knows it isn't relying on this behavior for bit M. The
> > argument against needing an MSR is this blurb from the Intel SDM about
> > reserved
> > bits:
> > "Bits reserved in the paging-structure entries are reserved for future
> > functionality. Software developers should be aware that such bits may be
> > used in
> > the future and that a paging-structure entry that causes a page-fault
> > exception
> > on one processor might not do so in the future."
> > 
> > So in the current patchset there is no MSR write required for the guest to
> > turn
> > on this feature. It will have this behavior whenever qemu is run with
> > "-cpu +xo".
> 
> I think the part of the manual that you quote is out of date.  Whenever
> Intel has "unreserved" bits in the page tables they have done that only
> if specific bits in CR4 or EFER or VMCS execution controls are set; this
> is a good thing, and I'd really like it to be codified in the SDM.
> 
> The only bits for which this does not (and should not) apply are indeed
> bits 51:MAXPHYADDR.  But the SDM makes it clear that bits 51:MAXPHYADDR
> are reserved, hence "unreserving" bits based on just a QEMU command line
> option would be against the specification.  So, please don't do this and
> introduce an MSR that enables the feature.
> 
> Paolo
> 
Hi Paolo,

Thanks for taking a look!

Fair enough, MSR it is.

Rick
Edgecombe, Rick P Oct. 4, 2019, 8:09 p.m. UTC | #4
On Fri, 2019-10-04 at 07:56 -0700, Andy Lutomirski wrote:
> On Thu, Oct 3, 2019 at 2:38 PM Rick Edgecombe
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > This patchset enables the ability for KVM guests to create execute-only (XO)
> > memory by utilizing EPT based XO permissions. XO memory is currently
> > supported
> > on Intel hardware natively for CPU's with PKU, but this enables it on older
> > platforms, and can support XO for kernel memory as well.
> 
> The patchset seems to sometimes call this feature "XO" and sometimes
> call it "NR".  To me, XO implies no-read and no-write, whereas NR
> implies just no-read.  Can you please clarify *exactly* what the new
> bit does and be consistent?
> 
> I suggest that you make it NR, which allows for PROT_EXEC and
> PROT_EXEC|PROT_WRITE and plain PROT_WRITE.  WX is of dubious value,
> but I can imagine plain W being genuinely useful for logging and for
> JITs that could maintain a W and a separate X mapping of some code.
> In other words, with an NR bit, all 8 logical access modes are
> possible.  Also, keeping the paging bits more orthogonal seems nice --
> we already have a bit that controls write access.

Sorry, yes the behavior of this bit needs to be documented a lot better. I will
definitely do this for the next version.

To clarify, since the EPT permissions in the XO/NR range are executable, and not
readable or writeable the new bit really means XO, but only when NX is 0 since
the guest page tables are being checked as well. When NR=1, W=1, and NX=0, the
memory is still XO.

NR was picked over XO because as you say. The idea is that it can be defined
that in the case of KVM XO, NR and writable is not a valid combination, like
writeable but not readable is defined as not valid for the EPT.

I *think* whenever NX=1, NR=1 it should be similar to not present in that it
can't be used for anything or have its translation cached. I am not 100% sure on
the cached part and was thinking of just making the "spec" that the translation
caching behavior is undefined. I can look into this if anyone thinks we need to
know. In the current patchset it shouldn't be possible to create this
combination.

Since write-only memory isn't supported in EPT we can't do the same trick to
create a new HW permission. But I guess if we emulate it, we could make the new
bit mean just NR, and support write-only by allowing emulation when KVM gets a
write EPT violations to NR memory. It might still be useful for the JIT case you
mentioned, or a shared memory mailbox. On the other hand, userspace might be
surprised to encounter that memory is different speeds depending on the
permission. I also wonder if any userspace apps are asking for just PROT_WRITE
and expecting readable memory.

Thanks,

Rick
Andy Lutomirski Oct. 5, 2019, 1:33 a.m. UTC | #5
On Fri, Oct 4, 2019 at 1:10 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Fri, 2019-10-04 at 07:56 -0700, Andy Lutomirski wrote:
> > On Thu, Oct 3, 2019 at 2:38 PM Rick Edgecombe
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > This patchset enables the ability for KVM guests to create execute-only (XO)
> > > memory by utilizing EPT based XO permissions. XO memory is currently
> > > supported
> > > on Intel hardware natively for CPU's with PKU, but this enables it on older
> > > platforms, and can support XO for kernel memory as well.
> >
> > The patchset seems to sometimes call this feature "XO" and sometimes
> > call it "NR".  To me, XO implies no-read and no-write, whereas NR
> > implies just no-read.  Can you please clarify *exactly* what the new
> > bit does and be consistent?
> >
> > I suggest that you make it NR, which allows for PROT_EXEC and
> > PROT_EXEC|PROT_WRITE and plain PROT_WRITE.  WX is of dubious value,
> > but I can imagine plain W being genuinely useful for logging and for
> > JITs that could maintain a W and a separate X mapping of some code.
> > In other words, with an NR bit, all 8 logical access modes are
> > possible.  Also, keeping the paging bits more orthogonal seems nice --
> > we already have a bit that controls write access.
>
> Sorry, yes the behavior of this bit needs to be documented a lot better. I will
> definitely do this for the next version.
>
> To clarify, since the EPT permissions in the XO/NR range are executable, and not
> readable or writeable the new bit really means XO, but only when NX is 0 since
> the guest page tables are being checked as well. When NR=1, W=1, and NX=0, the
> memory is still XO.
>
> NR was picked over XO because as you say. The idea is that it can be defined
> that in the case of KVM XO, NR and writable is not a valid combination, like
> writeable but not readable is defined as not valid for the EPT.
>

Ugh, I see, this is an "EPT Misconfiguration".  Oh, well.  I guess
just keep things as they are and document things better, please.
Don't try to emulate.

I don't suppose Intel could be convinced to get rid of that in a
future CPU and allow write-only memory?

BTW, is your patch checking for support in IA32_VMX_EPT_VPID_CAP?  I
didn't notice it, but I didn't look that hard.
Edgecombe, Rick P Oct. 7, 2019, 6:14 p.m. UTC | #6
On Fri, 2019-10-04 at 18:33 -0700, Andy Lutomirski wrote:
> On Fri, Oct 4, 2019 at 1:10 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> > 
> > On Fri, 2019-10-04 at 07:56 -0700, Andy Lutomirski wrote:
> > > On Thu, Oct 3, 2019 at 2:38 PM Rick Edgecombe
> > > <rick.p.edgecombe@intel.com> wrote:
> > > > 
> > > > This patchset enables the ability for KVM guests to create execute-only
> > > > (XO)
> > > > memory by utilizing EPT based XO permissions. XO memory is currently
> > > > supported
> > > > on Intel hardware natively for CPU's with PKU, but this enables it on
> > > > older
> > > > platforms, and can support XO for kernel memory as well.
> > > 
> > > The patchset seems to sometimes call this feature "XO" and sometimes
> > > call it "NR".  To me, XO implies no-read and no-write, whereas NR
> > > implies just no-read.  Can you please clarify *exactly* what the new
> > > bit does and be consistent?
> > > 
> > > I suggest that you make it NR, which allows for PROT_EXEC and
> > > PROT_EXEC|PROT_WRITE and plain PROT_WRITE.  WX is of dubious value,
> > > but I can imagine plain W being genuinely useful for logging and for
> > > JITs that could maintain a W and a separate X mapping of some code.
> > > In other words, with an NR bit, all 8 logical access modes are
> > > possible.  Also, keeping the paging bits more orthogonal seems nice --
> > > we already have a bit that controls write access.
> > 
> > Sorry, yes the behavior of this bit needs to be documented a lot better. I
> > will
> > definitely do this for the next version.
> > 
> > To clarify, since the EPT permissions in the XO/NR range are executable, and
> > not
> > readable or writeable the new bit really means XO, but only when NX is 0
> > since
> > the guest page tables are being checked as well. When NR=1, W=1, and NX=0,
> > the
> > memory is still XO.
> > 
> > NR was picked over XO because as you say. The idea is that it can be defined
> > that in the case of KVM XO, NR and writable is not a valid combination, like
> > writeable but not readable is defined as not valid for the EPT.
> > 
> 
> Ugh, I see, this is an "EPT Misconfiguration".  Oh, well.  I guess
> just keep things as they are and document things better, please.
> Don't try to emulate.

Ah, I see what you were thinking. Ok will do.

> I don't suppose Intel could be convinced to get rid of that in a
> future CPU and allow write-only memory?

Hmm, I'm not sure. I can try to pass it along.

> BTW, is your patch checking for support in IA32_VMX_EPT_VPID_CAP?  I
> didn't notice it, but I didn't look that hard.

Yep, there was already a helper: cpu_has_vmx_ept_execute_only().
Kees Cook Oct. 29, 2019, 11:40 p.m. UTC | #7
On Thu, Oct 03, 2019 at 02:23:47PM -0700, Rick Edgecombe wrote:
> larger follow on to this enables setting the kernel text as XO, but this is just

Is the kernel side series visible somewhere public yet?
Edgecombe, Rick P Oct. 30, 2019, 12:27 a.m. UTC | #8
On Tue, 2019-10-29 at 16:40 -0700, Kees Cook wrote:
> On Thu, Oct 03, 2019 at 02:23:47PM -0700, Rick Edgecombe wrote:
> > larger follow on to this enables setting the kernel text as XO, but this is
> > just
> 
> Is the kernel side series visible somewhere public yet?
> 
The POC from my Plumber's talk is up here:
https://github.com/redgecombe/linux/commits/exec_only

It doesn't work with this KVM series though as I made changes on the KVM side. I
don't consider it ready for posting on the list yet. Luckily though, PeterZ's
switching of ftrace to text_poke(), and your exception table patchset will make
it easier when the time comes.

Right now I am re-doing the KVM pieces to get rid of the memslot duplication. I
am ending up having to touch a lot more KVM mmu code, and it's taken some time
to work through. Then I wanted get some more performance numbers before dropping
the RFC tag. So it may still be a bit before I can pick up the kernel text piece
again.