diff mbox series

[2/4] KVM: x86: Introduce paravirt feature CR0/CR4 pinning

Message ID 20200617190757.27081-3-john.s.andersen@intel.com
State New
Headers show
Series Paravirtualized Control Register pinning | expand

Commit Message

Andersen, John June 17, 2020, 7:07 p.m. UTC
Add a CR pin feature bit to the KVM cpuid. Add read only MSRs to KVM
which guests use to identify which bits they may request be pinned. Add
CR pinned MSRs to KVM. Allow guests to request that KVM pin certain
bits within control register 0 or 4 via the CR pinned MSRs. Writes to
the MSRs fail if they include bits which aren't allowed. Host userspace
may clear or modify pinned bits at any time. Once pinned bits are set,
the guest may pin additional allowed bits, but not clear any. Clear
pinning on vCPU reset.

In the event that the guest vCPU attempts to disable any of the pinned
bits, send that vCPU a general protection fault, and leave the register
unchanged. Clear pinning on vCPU reset to avoid faulting non-boot
CPUs when they are disabled and then re-enabled, which is done when
hibernating.

Pinning is not active when running in SMM. Entering SMM disables pinned
bits. Writes to control registers within SMM would therefore trigger
general protection faults if pinning was enforced. Upon exit from SMM,
SMRAM is modified to ensure the values of CR0/4 that will be restored
contain the correct values for pinned bits. CR0/4 values are then
restored from SMRAM as usual.

When running with nested virtualization, should pinned bits be cleared
from host VMCS / VMCB, on VM-Exit, they will be silently restored.

Should userspace expose the CR pinning CPUID feature bit, it must zero
CR pinned MSRs on reboot. If it does not, it runs the risk of having the
guest enable pinning and subsequently cause general protection faults on
next boot due to early boot code setting control registers to values
which do not contain the pinned bits. Userspace is responsible for
migrating the contents of the CR* pinned MSRs. If userspace fails to
migrate the MSRs the protection will no longer be active.

Pinning of sensitive CR bits has already been implemented to protect
against exploits directly calling native_write_cr*(). The current
protection cannot stop ROP attacks which jump directly to a MOV CR
instruction.

https://web.archive.org/web/20171029060939/http://www.blackbunny.io/linux-kernel-x86-64-bypass-smep-kaslr-kptr_restric/

Guests running with paravirtualized CR pinning can now be protected
against the use of ROP to disable CR bits. The same bits that are being
pinned natively may be pinned via the CR pinned MSRs. These bits are WP
in CR0, and SMEP, SMAP, and UMIP in CR4.

Other hypervisors such as HyperV have implemented similar protections
for Control Registers and MSRs; which security researchers have found
effective.

https://www.abatchy.com/2018/01/kernel-exploitation-4

Future work could implement similar MSRs to protect bits elsewhere, such
as MSRs. The NXE bit of the EFER MSR is a prime candidate.

Changes for QEMU are required to expose the CR pin cpuid feature bit. As
well as clear the MSRs on reboot and enable migration.

https://github.com/qemu/qemu/commit/1b26f03653669c97dd8729f9f59be561d68e2b2d
https://github.com/qemu/qemu/commit/3af36d57457892c3088ee88de759d4f258c159a7

Signed-off-by: John Andersen <john.s.andersen@intel.com>
---
 Documentation/virt/kvm/msr.rst       |  53 ++++++++++++++
 arch/x86/include/asm/kvm_host.h      |   7 ++
 arch/x86/include/uapi/asm/kvm_para.h |   7 ++
 arch/x86/kvm/cpuid.c                 |   3 +-
 arch/x86/kvm/emulate.c               |   3 +-
 arch/x86/kvm/kvm_emulate.h           |   2 +-
 arch/x86/kvm/svm/nested.c            |  11 ++-
 arch/x86/kvm/vmx/nested.c            |  10 ++-
 arch/x86/kvm/x86.c                   | 106 ++++++++++++++++++++++++++-
 9 files changed, 193 insertions(+), 9 deletions(-)

Comments

Dave Hansen June 18, 2020, 2:18 p.m. UTC | #1
On 6/17/20 12:07 PM, John Andersen wrote:
> +#define KVM_CR0_PIN_ALLOWED	(X86_CR0_WP)
> +#define KVM_CR4_PIN_ALLOWED	(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP)

Why *is* there an allowed set?  Why don't we just allow everything?

Shouldn't we also pin any unknown bits?  The CR4.FSGSBASE bit is an
example of something that showed up CPUs without Linux knowing about it.
 If set, it causes problems.  This set couldn't have helped FSGSBASE
because it is not in the allowed set.

Let's say Intel loses its marbles and adds a CR4 bit that lets userspace
write to kernel memory.  Linux won't set it, but an attacker would go
after it, first thing.
Andersen, John June 18, 2020, 2:43 p.m. UTC | #2
On Thu, Jun 18, 2020 at 07:18:09AM -0700, Dave Hansen wrote:
> On 6/17/20 12:07 PM, John Andersen wrote:
> > +#define KVM_CR0_PIN_ALLOWED	(X86_CR0_WP)
> > +#define KVM_CR4_PIN_ALLOWED	(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP)
> 
> Why *is* there an allowed set?  Why don't we just allow everything?
> 
> Shouldn't we also pin any unknown bits?  The CR4.FSGSBASE bit is an
> example of something that showed up CPUs without Linux knowing about it.
>  If set, it causes problems.  This set couldn't have helped FSGSBASE
> because it is not in the allowed set.
> 
> Let's say Intel loses its marbles and adds a CR4 bit that lets userspace
> write to kernel memory.  Linux won't set it, but an attacker would go
> after it, first thing.

The allowed set came about because there were comments from internal review
where it was said that allowing the guest to pin TS and MP adds unnecessary
complexity.

Also because KVM always intercepts these bits via the CR0/4_GUEST_HOST_MASK. If
we allow setting of any bits, then we have to add some infrastructure for
modifying the mask when pinned bits are updated. I have a patch for that if we
want to go that route, but it doesn't account for the added complexity
mentioned above.
Dave Hansen June 18, 2020, 2:51 p.m. UTC | #3
On 6/18/20 7:43 AM, Andersen, John wrote:
> On Thu, Jun 18, 2020 at 07:18:09AM -0700, Dave Hansen wrote:
>> On 6/17/20 12:07 PM, John Andersen wrote:
>>> +#define KVM_CR0_PIN_ALLOWED	(X86_CR0_WP)
>>> +#define KVM_CR4_PIN_ALLOWED	(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP)
>>
>> Why *is* there an allowed set?  Why don't we just allow everything?
>>
>> Shouldn't we also pin any unknown bits?  The CR4.FSGSBASE bit is an
>> example of something that showed up CPUs without Linux knowing about it.
>>  If set, it causes problems.  This set couldn't have helped FSGSBASE
>> because it is not in the allowed set.
>>
>> Let's say Intel loses its marbles and adds a CR4 bit that lets userspace
>> write to kernel memory.  Linux won't set it, but an attacker would go
>> after it, first thing.
> 
> The allowed set came about because there were comments from internal review
> where it was said that allowing the guest to pin TS and MP adds unnecessary
> complexity.

That would have been a great design point to include in the changelog.
 Could you make sure it shows up in future versions.

> Also because KVM always intercepts these bits via the CR0/4_GUEST_HOST_MASK. If
> we allow setting of any bits, then we have to add some infrastructure for
> modifying the mask when pinned bits are updated. I have a patch for that if we
> want to go that route, but it doesn't account for the added complexity
> mentioned above.

Well, we have a current, known issue (FSGSBASE) which shows how dealing
with guest-unknown bits is required.  To me, that overrules complexity
arguments to a large degree.
Sean Christopherson July 7, 2020, 9:12 p.m. UTC | #4
On Thu, Jun 18, 2020 at 07:51:10AM -0700, Dave Hansen wrote:
> On 6/18/20 7:43 AM, Andersen, John wrote:
> > On Thu, Jun 18, 2020 at 07:18:09AM -0700, Dave Hansen wrote:
> >> On 6/17/20 12:07 PM, John Andersen wrote:
> >>> +#define KVM_CR0_PIN_ALLOWED	(X86_CR0_WP)
> >>> +#define KVM_CR4_PIN_ALLOWED	(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP)
> >>
> >> Why *is* there an allowed set?  Why don't we just allow everything?
> >>
> >> Shouldn't we also pin any unknown bits?  The CR4.FSGSBASE bit is an
> >> example of something that showed up CPUs without Linux knowing about it.
> >>  If set, it causes problems.  This set couldn't have helped FSGSBASE
> >> because it is not in the allowed set.
> >>
> >> Let's say Intel loses its marbles and adds a CR4 bit that lets userspace
> >> write to kernel memory.  Linux won't set it, but an attacker would go
> >> after it, first thing.

That's an orthogonal to pinning.  KVM never lets the guest set CR4 bits that
are unknown to KVM.  Supporting CR4.NO_MARBLES would require an explicit KVM
change to allow it to be set by the guest, and would also require a userspace
VMM to expose NO_MARBLES to the guest.

That being said, this series should supporting pinning as much as possible,
i.e. if the bit can be exposed to the guest and doesn't require special
handling in KVM, allow it to be pinned.  E.g. TS is a special case because
pinning would require additional emulator support and IMO isn't interesting
enough to justify the extra complexity.  At a glance, I don't see anything
that would prevent pinning FSGSBASE.
Dave Hansen July 7, 2020, 9:48 p.m. UTC | #5
On 7/7/20 2:12 PM, Sean Christopherson wrote:
>>>> Let's say Intel loses its marbles and adds a CR4 bit that lets userspace
>>>> write to kernel memory.  Linux won't set it, but an attacker would go
>>>> after it, first thing.
> That's an orthogonal to pinning.  KVM never lets the guest set CR4 bits that
> are unknown to KVM.  Supporting CR4.NO_MARBLES would require an explicit KVM
> change to allow it to be set by the guest, and would also require a userspace
> VMM to expose NO_MARBLES to the guest.
> 
> That being said, this series should supporting pinning as much as possible,
> i.e. if the bit can be exposed to the guest and doesn't require special
> handling in KVM, allow it to be pinned.  E.g. TS is a special case because
> pinning would require additional emulator support and IMO isn't interesting
> enough to justify the extra complexity.  At a glance, I don't see anything
> that would prevent pinning FSGSBASE.

Thanks for filling in the KVM picture.

If we're supporting as much pinning as possible, can we also add
something to make it inconvenient for someone to both make a CR4 bit
known to KVM *and* ignore the pinning aspects?

We should really make folks think about it.  Something like:

#define KVM_CR4_KNOWN 0xff
#define KVM_CR4_PIN_ALLOWED 0xf0
#define KVM_CR4_PIN_NOT_ALLOWED 0x0f

BUILD_BUG_ON(KVM_CR4_KNOWN !=
             (KVM_CR4_PIN_ALLOWED|KVM_CR4_PIN_NOT_ALLOWED));

So someone *MUST* make an active declaration about new bits being pinned
or not?
Paolo Bonzini July 7, 2020, 9:51 p.m. UTC | #6
On 07/07/20 23:48, Dave Hansen wrote:
> On 7/7/20 2:12 PM, Sean Christopherson wrote:
>>>>> Let's say Intel loses its marbles and adds a CR4 bit that lets userspace
>>>>> write to kernel memory.  Linux won't set it, but an attacker would go
>>>>> after it, first thing.
>> That's an orthogonal to pinning.  KVM never lets the guest set CR4 bits that
>> are unknown to KVM.  Supporting CR4.NO_MARBLES would require an explicit KVM
>> change to allow it to be set by the guest, and would also require a userspace
>> VMM to expose NO_MARBLES to the guest.
>>
>> That being said, this series should supporting pinning as much as possible,
>> i.e. if the bit can be exposed to the guest and doesn't require special
>> handling in KVM, allow it to be pinned.  E.g. TS is a special case because
>> pinning would require additional emulator support and IMO isn't interesting
>> enough to justify the extra complexity.  At a glance, I don't see anything
>> that would prevent pinning FSGSBASE.
> 
> Thanks for filling in the KVM picture.
> 
> If we're supporting as much pinning as possible, can we also add
> something to make it inconvenient for someone to both make a CR4 bit
> known to KVM *and* ignore the pinning aspects?
> 
> We should really make folks think about it.  Something like:
> 
> #define KVM_CR4_KNOWN 0xff
> #define KVM_CR4_PIN_ALLOWED 0xf0
> #define KVM_CR4_PIN_NOT_ALLOWED 0x0f
> 
> BUILD_BUG_ON(KVM_CR4_KNOWN !=
>              (KVM_CR4_PIN_ALLOWED|KVM_CR4_PIN_NOT_ALLOWED));
> 
> So someone *MUST* make an active declaration about new bits being pinned
> or not?

I would just make all unknown bits pinnable (or perhaps all CR4 bits in
general).

Paolo
Andersen, John July 9, 2020, 3:44 p.m. UTC | #7
On Tue, Jul 07, 2020 at 11:51:54PM +0200, Paolo Bonzini wrote:
> On 07/07/20 23:48, Dave Hansen wrote:
> > On 7/7/20 2:12 PM, Sean Christopherson wrote:
> >>>>> Let's say Intel loses its marbles and adds a CR4 bit that lets userspace
> >>>>> write to kernel memory.  Linux won't set it, but an attacker would go
> >>>>> after it, first thing.
> >> That's an orthogonal to pinning.  KVM never lets the guest set CR4 bits that
> >> are unknown to KVM.  Supporting CR4.NO_MARBLES would require an explicit KVM
> >> change to allow it to be set by the guest, and would also require a userspace
> >> VMM to expose NO_MARBLES to the guest.
> >>
> >> That being said, this series should supporting pinning as much as possible,
> >> i.e. if the bit can be exposed to the guest and doesn't require special
> >> handling in KVM, allow it to be pinned.  E.g. TS is a special case because
> >> pinning would require additional emulator support and IMO isn't interesting
> >> enough to justify the extra complexity.  At a glance, I don't see anything
> >> that would prevent pinning FSGSBASE.
> > 
> > Thanks for filling in the KVM picture.
> > 
> > If we're supporting as much pinning as possible, can we also add
> > something to make it inconvenient for someone to both make a CR4 bit
> > known to KVM *and* ignore the pinning aspects?
> > 
> > We should really make folks think about it.  Something like:
> > 
> > #define KVM_CR4_KNOWN 0xff
> > #define KVM_CR4_PIN_ALLOWED 0xf0
> > #define KVM_CR4_PIN_NOT_ALLOWED 0x0f
> > 
> > BUILD_BUG_ON(KVM_CR4_KNOWN !=
> >              (KVM_CR4_PIN_ALLOWED|KVM_CR4_PIN_NOT_ALLOWED));
> > 
> > So someone *MUST* make an active declaration about new bits being pinned
> > or not?
> 
> I would just make all unknown bits pinnable (or perhaps all CR4 bits in
> general).
> 

Sounds good. I'll make it this way in the next revision. I'll do the same for
CR0 (unless I hear otherwise). I've added the last paragraph here under the
ALLOWED MSRs data section.

data:
        Bits which may be pinned.

        Attempting to pin bits other than these will result in a failure when
        writing to the respective CR pinned MSR.

        Bits which are allowed to be pinned default to WP for CR0 and SMEP,
        SMAP, and UMIP for CR4.

        The host VMM may modify the set of allowed bits. However, only the above
        have been tested to work. Allowing the guest to pin other bits may or
        may not be compatible with KVM.
Dave Hansen July 9, 2020, 3:56 p.m. UTC | #8
On 7/9/20 8:44 AM, Andersen, John wrote:
> 
>         Bits which are allowed to be pinned default to WP for CR0 and SMEP,
>         SMAP, and UMIP for CR4.

I think it also makes sense to have FSGSBASE in this set.

I know it hasn't been tested, but I think we should do the legwork to
test it.  If not in this set, can we agree that it's a logical next step?
Dave Hansen July 9, 2020, 4:22 p.m. UTC | #9
On 7/9/20 9:07 AM, Andy Lutomirski wrote:
> On Thu, Jul 9, 2020 at 8:56 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> On 7/9/20 8:44 AM, Andersen, John wrote:
>>>         Bits which are allowed to be pinned default to WP for CR0 and SMEP,
>>>         SMAP, and UMIP for CR4.
>> I think it also makes sense to have FSGSBASE in this set.
>>
>> I know it hasn't been tested, but I think we should do the legwork to
>> test it.  If not in this set, can we agree that it's a logical next step?
> I have no objection to pinning FSGSBASE, but is there a clear
> description of the threat model that this whole series is meant to
> address?  The idea is to provide a degree of protection against an
> attacker who is able to convince a guest kernel to write something
> inappropriate to CR4, right?  How realistic is this?

If a quick search can find this:

> https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html

I'd pretty confident that the guys doing actual bad things have it in
their toolbox too.
Kees Cook July 9, 2020, 11:37 p.m. UTC | #10
On Thu, Jul 09, 2020 at 09:22:09AM -0700, Dave Hansen wrote:
> On 7/9/20 9:07 AM, Andy Lutomirski wrote:
> > On Thu, Jul 9, 2020 at 8:56 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >> On 7/9/20 8:44 AM, Andersen, John wrote:
> >>>         Bits which are allowed to be pinned default to WP for CR0 and SMEP,
> >>>         SMAP, and UMIP for CR4.
> >> I think it also makes sense to have FSGSBASE in this set.
> >>
> >> I know it hasn't been tested, but I think we should do the legwork to
> >> test it.  If not in this set, can we agree that it's a logical next step?
> > I have no objection to pinning FSGSBASE, but is there a clear
> > description of the threat model that this whole series is meant to
> > address?  The idea is to provide a degree of protection against an
> > attacker who is able to convince a guest kernel to write something
> > inappropriate to CR4, right?  How realistic is this?
> 
> If a quick search can find this:
> 
> > https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html
> 
> I'd pretty confident that the guys doing actual bad things have it in
> their toolbox too.

Right, it's common (see my commit log in 873d50d58f67), and having this
enforced by the hypervisor is WAY better since it'll block gadgets or
ROP.
Andersen, John July 14, 2020, 5:36 a.m. UTC | #11
On Thu, Jul 09, 2020 at 09:27:43AM -0700, Andy Lutomirski wrote:
> On Thu, Jul 9, 2020 at 9:22 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 7/9/20 9:07 AM, Andy Lutomirski wrote:
> > > On Thu, Jul 9, 2020 at 8:56 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > >> On 7/9/20 8:44 AM, Andersen, John wrote:
> > >>>         Bits which are allowed to be pinned default to WP for CR0 and SMEP,
> > >>>         SMAP, and UMIP for CR4.
> > >> I think it also makes sense to have FSGSBASE in this set.
> > >>
> > >> I know it hasn't been tested, but I think we should do the legwork to
> > >> test it.  If not in this set, can we agree that it's a logical next step?
> > > I have no objection to pinning FSGSBASE, but is there a clear
> > > description of the threat model that this whole series is meant to
> > > address?  The idea is to provide a degree of protection against an
> > > attacker who is able to convince a guest kernel to write something
> > > inappropriate to CR4, right?  How realistic is this?
> >
> > If a quick search can find this:
> >
> > > https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html
> >
> > I'd pretty confident that the guys doing actual bad things have it in
> > their toolbox too.
> >
> 
> True, but we have the existing software CR4 pinning.  I suppose the
> virtualization version is stronger.
> 

Yes, as Kees said this will be stronger because it stops ROP and other gadget
based techniques which avoid the use of native_write_cr0/4().

With regards to what should be done in this patchset and what in other
patchsets. I have a fix for kexec thanks to Arvind's note about
TRAMPOLINE_32BIT_CODE_SIZE. The physical host boots fine now and the virtual
one can kexec fine.

What remains to be done on that front is to add some identifying information to
the kernel image to declare that it supports paravirtualized control register
pinning or not.

Liran suggested adding a section to the built image acting as a flag to signify
support for being kexec'd by a kernel with pinning enabled. If anyone has any
opinions on how they'd like to see this implemented please let me know.
Otherwise I'll just take a stab at it and you'll all see it hopefully in the
next version.

With regards to FSGSBASE, are we open to validating and adding that to the
DEFAULT set as a part of a separate patchset? This patchset is focused on
replicating the functionality we already have natively.
Andersen, John July 14, 2020, 5:39 a.m. UTC | #12
On Thu, Jul 09, 2020 at 09:27:43AM -0700, Andy Lutomirski wrote:
> On Thu, Jul 9, 2020 at 9:22 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 7/9/20 9:07 AM, Andy Lutomirski wrote:
> > > On Thu, Jul 9, 2020 at 8:56 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > >> On 7/9/20 8:44 AM, Andersen, John wrote:
> > >>>         Bits which are allowed to be pinned default to WP for CR0 and SMEP,
> > >>>         SMAP, and UMIP for CR4.
> > >> I think it also makes sense to have FSGSBASE in this set.
> > >>
> > >> I know it hasn't been tested, but I think we should do the legwork to
> > >> test it.  If not in this set, can we agree that it's a logical next step?
> > > I have no objection to pinning FSGSBASE, but is there a clear
> > > description of the threat model that this whole series is meant to
> > > address?  The idea is to provide a degree of protection against an
> > > attacker who is able to convince a guest kernel to write something
> > > inappropriate to CR4, right?  How realistic is this?
> >
> > If a quick search can find this:
> >
> > > https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html
> >
> > I'd pretty confident that the guys doing actual bad things have it in
> > their toolbox too.
> >
> 
> True, but we have the existing software CR4 pinning.  I suppose the
> virtualization version is stronger.
> 

Yes, as Kees said this will be stronger because it stops ROP and other gadget
based techniques which avoid the use of native_write_cr0/4().

With regards to what should be done in this patchset and what in other
patchsets. I have a fix for kexec thanks to Arvind's note about
TRAMPOLINE_32BIT_CODE_SIZE. The physical host boots fine now and the virtual
one can kexec fine.

What remains to be done on that front is to add some identifying information to
the kernel image to declare that it supports paravirtualized control register
pinning or not.

Liran suggested adding a section to the built image acting as a flag to signify
support for being kexec'd by a kernel with pinning enabled. If anyone has any
opinions on how they'd like to see this implemented please let me know.
Otherwise I'll just take a stab at it and you'll all see it hopefully in the
next version.

With regards to FSGSBASE, are we open to validating and adding that to the
DEFAULT set as a part of a separate patchset? This patchset is focused on
replicating the functionality we already have natively.


(If anyone got this email twice, sorry I messed up the From: field the first
time around)
Sean Christopherson July 15, 2020, 4:41 a.m. UTC | #13
On Tue, Jul 14, 2020 at 05:39:30AM +0000, Andersen, John wrote:
> With regards to FSGSBASE, are we open to validating and adding that to the
> DEFAULT set as a part of a separate patchset? This patchset is focused on
> replicating the functionality we already have natively.

Kees added FSGSBASE pinning in commit a13b9d0b97211 ("x86/cpu: Use pinning
mask for CR4 bits needing to be 0"), so I believe it's a done deal already.
Andersen, John July 15, 2020, 7:58 p.m. UTC | #14
On Tue, Jul 14, 2020 at 09:41:29PM -0700, Sean Christopherson wrote:
> On Tue, Jul 14, 2020 at 05:39:30AM +0000, Andersen, John wrote:
> > With regards to FSGSBASE, are we open to validating and adding that to the
> > DEFAULT set as a part of a separate patchset? This patchset is focused on
> > replicating the functionality we already have natively.
> 
> Kees added FSGSBASE pinning in commit a13b9d0b97211 ("x86/cpu: Use pinning
> mask for CR4 bits needing to be 0"), so I believe it's a done deal already.

Ah my bad. Thanks, I'll look into it.
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/msr.rst b/Documentation/virt/kvm/msr.rst
index e37a14c323d2..9fa43c4a5895 100644
--- a/Documentation/virt/kvm/msr.rst
+++ b/Documentation/virt/kvm/msr.rst
@@ -376,3 +376,56 @@  data:
 	write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
 	and check if there are more notifications pending. The MSR is available
 	if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
+
+MSR_KVM_CR0_PIN_ALLOWED:
+	0x4b564d08
+MSR_KVM_CR4_PIN_ALLOWED:
+	0x4b564d09
+
+	Read only registers informing the guest which bits may be pinned for
+	each control register respectively via the CR pinned MSRs.
+
+data:
+	Bits which may be pinned.
+
+	Attempting to pin bits other than these will result in a failure when
+	writing to the respective CR pinned MSR.
+
+	Bits which are allowed to be pinned are WP for CR0 and SMEP, SMAP, and
+	UMIP for CR4.
+
+MSR_KVM_CR0_PINNED_LOW:
+	0x4b564d0a
+MSR_KVM_CR0_PINNED_HIGH:
+	0x4b564d0b
+MSR_KVM_CR4_PINNED_LOW:
+	0x4b564d0c
+MSR_KVM_CR4_PINNED_HIGH:
+	0x4b564d0d
+
+	Used to configure pinned bits in control registers
+
+data:
+	Bits to be pinned.
+
+	Fails if data contains bits which are not allowed to be pinned. Or if
+	attempting to pin bits high that are already pinned low, or vice versa.
+	Bits which are allowed to be pinned can be found by reading the CR pin
+	allowed MSRs.
+
+	The MSRs are read/write for host userspace, and write-only for the
+	guest.
+
+	Once set to a non-zero value, the guest cannot clear any of the bits
+	that have been pinned. The guest can pin more bits, so long as those
+	bits appear in the allowed MSR, and are not already pinned to the
+	opposite value.
+
+	Host userspace may clear or change pinned bits at any point. Host
+	userspace must clear pinned bits on reboot.
+
+	The MSR enables bit pinning for control registers. Pinning is active
+	when the guest is not in SMM. If an exit from SMM results in pinned
+	bits becoming unpinned. The guest will exit. If the guest attempts to
+	write values to cr* where bits differ from pinned bits, the write will
+	fail and the guest will be sent a general protection fault.
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8998e97457f..962cb48535d4 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -565,6 +565,11 @@  struct kvm_vcpu_hv {
 	cpumask_t tlb_flush;
 };
 
+struct kvm_vcpu_cr_pinning {
+	unsigned long high;
+	unsigned long low;
+};
+
 struct kvm_vcpu_arch {
 	/*
 	 * rip and regs accesses must go through
@@ -576,10 +581,12 @@  struct kvm_vcpu_arch {
 
 	unsigned long cr0;
 	unsigned long cr0_guest_owned_bits;
+	struct kvm_vcpu_cr_pinning cr0_pinned;
 	unsigned long cr2;
 	unsigned long cr3;
 	unsigned long cr4;
 	unsigned long cr4_guest_owned_bits;
+	struct kvm_vcpu_cr_pinning cr4_pinned;
 	unsigned long cr8;
 	u32 host_pkru;
 	u32 pkru;
diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 812e9b4c1114..91241e0d9691 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -32,6 +32,7 @@ 
 #define KVM_FEATURE_POLL_CONTROL	12
 #define KVM_FEATURE_PV_SCHED_YIELD	13
 #define KVM_FEATURE_ASYNC_PF_INT	14
+#define KVM_FEATURE_CR_PIN		15
 
 #define KVM_HINTS_REALTIME      0
 
@@ -53,6 +54,12 @@ 
 #define MSR_KVM_POLL_CONTROL	0x4b564d05
 #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
 #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
+#define MSR_KVM_CR0_PIN_ALLOWED	0x4b564d08
+#define MSR_KVM_CR4_PIN_ALLOWED	0x4b564d09
+#define MSR_KVM_CR0_PINNED_LOW	0x4b564d0a
+#define MSR_KVM_CR0_PINNED_HIGH	0x4b564d0b
+#define MSR_KVM_CR4_PINNED_LOW	0x4b564d0c
+#define MSR_KVM_CR4_PINNED_HIGH	0x4b564d0d
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8a294f9747aa..bb0ed645324d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -716,7 +716,8 @@  static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 			     (1 << KVM_FEATURE_PV_SEND_IPI) |
 			     (1 << KVM_FEATURE_POLL_CONTROL) |
 			     (1 << KVM_FEATURE_PV_SCHED_YIELD) |
-			     (1 << KVM_FEATURE_ASYNC_PF_INT);
+			     (1 << KVM_FEATURE_ASYNC_PF_INT) |
+			     (1 << KVM_FEATURE_CR_PIN);
 
 		if (sched_info_on())
 			entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index d0e2825ae617..95780422765b 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2685,7 +2685,8 @@  static int em_rsm(struct x86_emulate_ctxt *ctxt)
 		return X86EMUL_UNHANDLEABLE;
 	}
 
-	ctxt->ops->post_leave_smm(ctxt);
+	if (ctxt->ops->post_leave_smm(ctxt))
+		return X86EMUL_UNHANDLEABLE;
 
 	return X86EMUL_CONTINUE;
 }
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index 43c93ffa76ed..e92dd7605e48 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -232,7 +232,7 @@  struct x86_emulate_ops {
 	void (*set_hflags)(struct x86_emulate_ctxt *ctxt, unsigned hflags);
 	int (*pre_leave_smm)(struct x86_emulate_ctxt *ctxt,
 			     const char *smstate);
-	void (*post_leave_smm)(struct x86_emulate_ctxt *ctxt);
+	int (*post_leave_smm)(struct x86_emulate_ctxt *ctxt);
 	int (*set_xcr)(struct x86_emulate_ctxt *ctxt, u32 index, u64 xcr);
 };
 
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 6bceafb19108..245bdff4b052 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -583,8 +583,15 @@  int nested_svm_vmexit(struct vcpu_svm *svm)
 	svm->vmcb->save.idtr = hsave->save.idtr;
 	kvm_set_rflags(&svm->vcpu, hsave->save.rflags);
 	svm_set_efer(&svm->vcpu, hsave->save.efer);
-	svm_set_cr0(&svm->vcpu, hsave->save.cr0 | X86_CR0_PE);
-	svm_set_cr4(&svm->vcpu, hsave->save.cr4);
+	svm_set_cr0(&svm->vcpu,
+		    (hsave->save.cr0 |
+		    X86_CR0_PE |
+		    svm->vcpu.arch.cr0_pinned.high) &
+		    ~svm->vcpu.arch.cr0_pinned.low);
+	svm_set_cr4(&svm->vcpu,
+		    (hsave->save.cr4 |
+		    svm->vcpu.arch.cr4_pinned.high) &
+		    ~svm->vcpu.arch.cr4_pinned.low);
 	if (npt_enabled) {
 		svm->vmcb->save.cr3 = hsave->save.cr3;
 		svm->vcpu.arch.cr3 = hsave->save.cr3;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index adb11b504d5c..a12bac57b374 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4110,11 +4110,17 @@  static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	 * (KVM doesn't change it);
 	 */
 	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
-	vmx_set_cr0(vcpu, vmcs12->host_cr0);
+	vmx_set_cr0(vcpu,
+		    (vmcs12->host_cr0 |
+		     vcpu->arch.cr0_pinned.high) &
+		    ~vcpu->arch.cr0_pinned.low);
 
 	/* Same as above - no reason to call set_cr4_guest_host_mask().  */
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
-	vmx_set_cr4(vcpu, vmcs12->host_cr4);
+	vmx_set_cr4(vcpu,
+		   (vmcs12->host_cr4 |
+		    vcpu->arch.cr4_pinned.high) &
+		    ~vcpu->arch.cr4_pinned.low);
 
 	nested_ept_uninit_mmu_context(vcpu);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00c88c2f34e4..940de9a968bd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -772,6 +772,9 @@  bool pdptrs_changed(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(pdptrs_changed);
 
+#define KVM_CR0_PIN_ALLOWED	(X86_CR0_WP)
+#define KVM_CR4_PIN_ALLOWED	(X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_UMIP)
+
 int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 {
 	unsigned long old_cr0 = kvm_read_cr0(vcpu);
@@ -792,6 +795,11 @@  int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	if ((cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PE))
 		return 1;
 
+	if (!is_smm(vcpu) && !is_guest_mode(vcpu) &&
+	    (((cr0 ^ vcpu->arch.cr0_pinned.high) & vcpu->arch.cr0_pinned.high) ||
+	    ((~cr0 ^ vcpu->arch.cr0_pinned.low) & vcpu->arch.cr0_pinned.low)))
+		return 1;
+
 	if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
 #ifdef CONFIG_X86_64
 		if ((vcpu->arch.efer & EFER_LME)) {
@@ -972,6 +980,11 @@  int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 	if (kvm_valid_cr4(vcpu, cr4))
 		return 1;
 
+	if (!is_smm(vcpu) && !is_guest_mode(vcpu) &&
+	    (((cr4 ^ vcpu->arch.cr4_pinned.high) & vcpu->arch.cr4_pinned.high) ||
+	    ((~cr4 ^ vcpu->arch.cr4_pinned.low) & vcpu->arch.cr4_pinned.low)))
+		return 1;
+
 	if (is_long_mode(vcpu)) {
 		if (!(cr4 & X86_CR4_PAE))
 			return 1;
@@ -1291,6 +1304,12 @@  static const u32 emulated_msrs_all[] = {
 
 	MSR_K7_HWCR,
 	MSR_KVM_POLL_CONTROL,
+	MSR_KVM_CR0_PIN_ALLOWED,
+	MSR_KVM_CR4_PIN_ALLOWED,
+	MSR_KVM_CR0_PINNED_LOW,
+	MSR_KVM_CR0_PINNED_HIGH,
+	MSR_KVM_CR4_PINNED_LOW,
+	MSR_KVM_CR4_PINNED_HIGH,
 };
 
 static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -2986,6 +3005,54 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		vcpu->arch.msr_kvm_poll_control = data;
 		break;
 
+	case MSR_KVM_CR0_PIN_ALLOWED:
+		return (data != KVM_CR0_PIN_ALLOWED);
+	case MSR_KVM_CR4_PIN_ALLOWED:
+		return (data != KVM_CR4_PIN_ALLOWED);
+	case MSR_KVM_CR0_PINNED_LOW:
+		if ((data & ~KVM_CR0_PIN_ALLOWED) ||
+		    (data & vcpu->arch.cr0_pinned.high))
+			return 1;
+
+		if (!msr_info->host_initiated &&
+		    (~data & vcpu->arch.cr0_pinned.low))
+			return 1;
+
+		vcpu->arch.cr0_pinned.low = data;
+		break;
+	case MSR_KVM_CR0_PINNED_HIGH:
+		if ((data & ~KVM_CR0_PIN_ALLOWED) ||
+		    (data & vcpu->arch.cr0_pinned.low))
+			return 1;
+
+		if (!msr_info->host_initiated &&
+		    (~data & vcpu->arch.cr0_pinned.high))
+			return 1;
+
+		vcpu->arch.cr0_pinned.high = data;
+		break;
+	case MSR_KVM_CR4_PINNED_LOW:
+		if ((data & ~KVM_CR4_PIN_ALLOWED) ||
+		    (data & vcpu->arch.cr4_pinned.high))
+			return 1;
+
+		if (!msr_info->host_initiated &&
+		    (~data & vcpu->arch.cr4_pinned.low))
+			return 1;
+
+		vcpu->arch.cr4_pinned.low = data;
+		break;
+	case MSR_KVM_CR4_PINNED_HIGH:
+		if ((data & ~KVM_CR4_PIN_ALLOWED) ||
+		    (data & vcpu->arch.cr4_pinned.low))
+			return 1;
+
+		if (!msr_info->host_initiated &&
+		    (~data & vcpu->arch.cr4_pinned.high))
+			return 1;
+
+		vcpu->arch.cr4_pinned.high = data;
+		break;
 	case MSR_IA32_MCG_CTL:
 	case MSR_IA32_MCG_STATUS:
 	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
@@ -3250,6 +3317,24 @@  int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KVM_POLL_CONTROL:
 		msr_info->data = vcpu->arch.msr_kvm_poll_control;
 		break;
+	case MSR_KVM_CR0_PIN_ALLOWED:
+		msr_info->data = KVM_CR0_PIN_ALLOWED;
+		break;
+	case MSR_KVM_CR4_PIN_ALLOWED:
+		msr_info->data = KVM_CR4_PIN_ALLOWED;
+		break;
+	case MSR_KVM_CR0_PINNED_LOW:
+		msr_info->data = vcpu->arch.cr0_pinned.low;
+		break;
+	case MSR_KVM_CR0_PINNED_HIGH:
+		msr_info->data = vcpu->arch.cr0_pinned.high;
+		break;
+	case MSR_KVM_CR4_PINNED_LOW:
+		msr_info->data = vcpu->arch.cr4_pinned.low;
+		break;
+	case MSR_KVM_CR4_PINNED_HIGH:
+		msr_info->data = vcpu->arch.cr4_pinned.high;
+		break;
 	case MSR_IA32_P5_MC_ADDR:
 	case MSR_IA32_P5_MC_TYPE:
 	case MSR_IA32_MCG_CAP:
@@ -6414,9 +6499,23 @@  static int emulator_pre_leave_smm(struct x86_emulate_ctxt *ctxt,
 	return kvm_x86_ops.pre_leave_smm(emul_to_vcpu(ctxt), smstate);
 }
 
-static void emulator_post_leave_smm(struct x86_emulate_ctxt *ctxt)
+static int emulator_post_leave_smm(struct x86_emulate_ctxt *ctxt)
 {
-	kvm_smm_changed(emul_to_vcpu(ctxt));
+	struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
+	unsigned long cr0 = kvm_read_cr0(vcpu);
+	unsigned long cr4 = kvm_read_cr4(vcpu);
+
+	if (((cr0 ^ vcpu->arch.cr0_pinned.high) & vcpu->arch.cr0_pinned.high) ||
+	    ((~cr0 ^ vcpu->arch.cr0_pinned.low) & vcpu->arch.cr0_pinned.low))
+		return 1;
+
+	if (((cr4 ^ vcpu->arch.cr4_pinned.high) & vcpu->arch.cr4_pinned.high) ||
+	    ((~cr4 ^ vcpu->arch.cr4_pinned.low) & vcpu->arch.cr4_pinned.low))
+		return 1;
+
+	kvm_smm_changed(vcpu);
+
+	return 0;
 }
 
 static int emulator_set_xcr(struct x86_emulate_ctxt *ctxt, u32 index, u64 xcr)
@@ -9640,6 +9739,9 @@  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 
 	vcpu->arch.ia32_xss = 0;
 
+	memset(&vcpu->arch.cr0_pinned, 0, sizeof(vcpu->arch.cr0_pinned));
+	memset(&vcpu->arch.cr4_pinned, 0, sizeof(vcpu->arch.cr4_pinned));
+
 	kvm_x86_ops.vcpu_reset(vcpu, init_event);
 }