[RFC,00/35] SEV-ES hypervisor support

Message ID	cover.1600114548.git.thomas.lendacky@amd.com (mailing list archive)
Headers	show Return-Path: <SRS0=+Hxv=CX=vger.kernel.org=kvm-owner@kernel.org> From: Tom Lendacky <thomas.lendacky@amd.com> To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org, x86@kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com>, Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>, Sean Christopherson <sean.j.christopherson@intel.com>, Vitaly Kuznetsov <vkuznets@redhat.com>, Wanpeng Li <wanpengli@tencent.com>, Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@redhat.com>, Thomas Gleixner <tglx@linutronix.de>, Brijesh Singh <brijesh.singh@amd.com> Subject: [RFC PATCH 00/35] SEV-ES hypervisor support Date: Mon, 14 Sep 2020 15:15:14 -0500 Message-Id: <cover.1600114548.git.thomas.lendacky@amd.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	SEV-ES hypervisor support \| expand [RFC,00/35] SEV-ES hypervisor support [RFC,01/35] KVM: SVM: Remove the call to sev_platform_status() during setup [RFC,02/35] KVM: SVM: Add support for SEV-ES capability in KVM [RFC,03/35] KVM: SVM: Add indirect access to the VM save area [RFC,04/35] KVM: SVM: Make GHCB accessor functions available to the hypervisor [RFC,05/35] KVM: SVM: Add initial support for SEV-ES GHCB access to KVM [RFC,06/35] KVM: SVM: Add required changes to support intercepts under SEV-ES [RFC,07/35] KVM: SVM: Modify DRx register intercepts for an SEV-ES guest [RFC,08/35] KVM: SVM: Prevent debugging under SEV-ES [RFC,09/35] KVM: SVM: Do not emulate MMIO under SEV-ES [RFC,10/35] KVM: SVM: Cannot re-initialize the VMCB after shutdown with SEV-ES [RFC,11/35] KVM: SVM: Prepare for SEV-ES exit handling in the sev.c file [RFC,12/35] KVM: SVM: Add initial support for a VMGEXIT VMEXIT [RFC,13/35] KVM: SVM: Create trace events for VMGEXIT processing [RFC,14/35] KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x002 [RFC,15/35] KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x004 [RFC,16/35] KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x100 [RFC,17/35] KVM: SVM: Create trace events for VMGEXIT MSR protocol processing [RFC,18/35] KVM: SVM: Support MMIO for an SEV-ES guest [RFC,19/35] KVM: SVM: Support port IO operations for an SEV-ES guest [RFC,20/35] KVM: SVM: Add SEV/SEV-ES support for intercepting INVD [RFC,21/35] KVM: SVM: Add support for EFER write traps for an SEV-ES guest [RFC,22/35] KVM: SVM: Add support for CR0 write traps for an SEV-ES guest [RFC,23/35] KVM: SVM: Add support for CR4 write traps for an SEV-ES guest [RFC,24/35] KVM: SVM: Add support for CR8 write traps for an SEV-ES guest [RFC,25/35] KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES [RFC,26/35] KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest [RFC,27/35] KVM: SVM: Add support for booting APs for an SEV-ES guest [RFC,28/35] KVM: X86: Update kvm_skip_emulated_instruction() for an SEV-ES guest [RFC,29/35] KVM: SVM: Add NMI support for an SEV-ES guest [RFC,30/35] KVM: SVM: Set the encryption mask for the SVM host save area [RFC,31/35] KVM: SVM: Update ASID allocation to support SEV-ES guests [RFC,32/35] KVM: SVM: Provide support for SEV-ES vCPU creation/loading [RFC,33/35] KVM: SVM: Provide support for SEV-ES vCPU loading [RFC,34/35] KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests [RFC,35/35] KVM: SVM: Provide support to launch and run an SEV-ES guest

Tom Lendacky Sept. 14, 2020, 8:15 p.m. UTC

From: Tom Lendacky <thomas.lendacky@amd.com>

This patch series provides support for running SEV-ES guests under KVM.

Secure Encrypted Virtualization - Encrypted State (SEV-ES) expands on the
SEV support to protect the guest register state from the hypervisor. See
"AMD64 Architecture Programmer's Manual Volume 2: System Programming",
section "15.35 Encrypted State (SEV-ES)" [1].

In order to allow a hypervisor to perform functions on behalf of a guest,
there is architectural support for notifying a guest's operating system
when certain types of VMEXITs are about to occur. This allows the guest to
selectively share information with the hypervisor to satisfy the requested
function. The notification is performed using a new exception, the VMM
Communication exception (#VC). The information is shared through the
Guest-Hypervisor Communication Block (GHCB) using the VMGEXIT instruction.
The GHCB format and the protocol for using it is documented in "SEV-ES
Guest-Hypervisor Communication Block Standardization" [2].

Under SEV-ES, a vCPU save area (VMSA) must be encrypted. SVM is updated to
build the initial VMSA and then encrypt it before running the guest. Once
encrypted, it must not be modified by the hypervisor. Modification of the
VMSA will result in the VMRUN instruction failing with a SHUTDOWN exit
code. KVM must support the VMGEXIT exit code in order to perform the
necessary functions required of the guest. The GHCB is used to exchange
the information needed by both the hypervisor and the guest.

To simplify access to the VMSA and the GHCB, SVM uses an accessor function
to obtain the address of the either the VMSA or the GHCB, depending on the
stage of execution of the guest.

There are changes to some of the intercepts that are needed under SEV-ES.
For example, CR0 writes cannot be intercepted, so the code needs to ensure
that the intercept is not enabled during execution or that the hypervisor
does not try to read the register as part of exit processing. Another
example is shutdown processing, where the vCPU cannot be directly reset.

Support is added to handle VMGEXIT events and implement the GHCB protocol.
This includes supporting standard exit events, like a CPUID instruction
intercept, to new support, for things like AP processor booting. Much of
the existing SVM intercept support can be re-used by setting the exit
code information from the VMGEXIT and calling the appropriate intercept
handlers.

Finally, to launch and run an SEV-ES guest requires changes to the vCPU
initialization, loading and execution.

[1] https://www.amd.com/system/files/TechDocs/24593.pdf
[2] https://developer.amd.com/wp-content/resources/56421.pdf

---

These patches are based on a commit of the KVM next branch. However, I had
to backport recent SEV-ES guest patches (a previous series to the actual
patches that are now in the tip tree) into my development branch, since
there are prereq patches needed by this series. As a result, this patch
series will not successfully build or apply to the KVM next branch as is.

A version of the tree can be found at:
https://github.com/AMDESE/linux/tree/sev-es-5.8-v3

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>

Tom Lendacky (35):
  KVM: SVM: Remove the call to sev_platform_status() during setup
  KVM: SVM: Add support for SEV-ES capability in KVM
  KVM: SVM: Add indirect access to the VM save area
  KVM: SVM: Make GHCB accessor functions available to the hypervisor
  KVM: SVM: Add initial support for SEV-ES GHCB access to KVM
  KVM: SVM: Add required changes to support intercepts under SEV-ES
  KVM: SVM: Modify DRx register intercepts for an SEV-ES guest
  KVM: SVM: Prevent debugging under SEV-ES
  KVM: SVM: Do not emulate MMIO under SEV-ES
  KVM: SVM: Cannot re-initialize the VMCB after shutdown with SEV-ES
  KVM: SVM: Prepare for SEV-ES exit handling in the sev.c file
  KVM: SVM: Add initial support for a VMGEXIT VMEXIT
  KVM: SVM: Create trace events for VMGEXIT processing
  KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x002
  KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x004
  KVM: SVM: Add support for SEV-ES GHCB MSR protocol function 0x100
  KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
  KVM: SVM: Support MMIO for an SEV-ES guest
  KVM: SVM: Support port IO operations for an SEV-ES guest
  KVM: SVM: Add SEV/SEV-ES support for intercepting INVD
  KVM: SVM: Add support for EFER write traps for an SEV-ES guest
  KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
  KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
  KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
  KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
  KVM: SVM: Add support for booting APs for an SEV-ES guest
  KVM: X86: Update kvm_skip_emulated_instruction() for an SEV-ES guest
  KVM: SVM: Add NMI support for an SEV-ES guest
  KVM: SVM: Set the encryption mask for the SVM host save area
  KVM: SVM: Update ASID allocation to support SEV-ES guests
  KVM: SVM: Provide support for SEV-ES vCPU creation/loading
  KVM: SVM: Provide support for SEV-ES vCPU loading
  KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
  KVM: SVM: Provide support to launch and run an SEV-ES guest

 arch/x86/include/asm/kvm_host.h  |  16 +
 arch/x86/include/asm/msr-index.h |   1 +
 arch/x86/include/asm/svm.h       |  35 +-
 arch/x86/include/uapi/asm/svm.h  |  28 ++
 arch/x86/kernel/cpu/vmware.c     |  12 +-
 arch/x86/kvm/Kconfig             |   3 +-
 arch/x86/kvm/cpuid.c             |   1 +
 arch/x86/kvm/kvm_cache_regs.h    |  30 +-
 arch/x86/kvm/mmu/mmu.c           |   7 +
 arch/x86/kvm/svm/nested.c        | 125 +++---
 arch/x86/kvm/svm/sev.c           | 590 ++++++++++++++++++++++++--
 arch/x86/kvm/svm/svm.c           | 704 ++++++++++++++++++++++++-------
 arch/x86/kvm/svm/svm.h           | 357 ++++++++++++++--
 arch/x86/kvm/svm/vmenter.S       |  50 +++
 arch/x86/kvm/trace.h             |  99 +++++
 arch/x86/kvm/vmx/vmx.c           |   7 +
 arch/x86/kvm/x86.c               | 357 ++++++++++++++--
 arch/x86/kvm/x86.h               |   8 +
 18 files changed, 2070 insertions(+), 360 deletions(-)

Sean Christopherson Sept. 14, 2020, 10:59 p.m. UTC | #1

On Mon, Sep 14, 2020 at 03:15:14PM -0500, Tom Lendacky wrote:
> From: Tom Lendacky <thomas.lendacky@amd.com>
> 
> This patch series provides support for running SEV-ES guests under KVM.

From the x86/VMX side of things, the GPR hooks are the only changes that I
strongly dislike.

For the vmsa_encrypted flag and related things like allow_debug(), I'd
really like to aim for a common implementation between SEV-ES and TDX[*] from
the get go, within reason obviously.  From a code perspective, I don't think
it will be too onerous as the basic tenets are quite similar, e.g. guest
state is off limits, FPU state is autoswitched, etc..., but I suspect (or
maybe worry?) that there are enough minor differences that we'll want a more
generic way of marking ioctls() as disallowed to avoid having one-off checks
all over the place.

That being said, it may also be that there are some ioctls() that should be
disallowed under SEV-ES, but aren't in this series.  E.g. I assume
kvm_vcpu_ioctl_smi() should be rejected as KVM can't do the necessary
emulation (I assume this applies to vanilla SEV as well?).

One thought to try and reconcile the differences between SEV-ES and TDX would
be expicitly list which ioctls() are and aren't supported and go from there?
E.g. if there is 95% overlap than we probably don't need to get fancy with
generic allow/deny.

Given that we don't yet have publicly available KVM code for TDX, what if I
generate and post a list of ioctls() that are denied by either SEV-ES or TDX,
organized by the denier(s)?  Then for the ioctls() that are denied by one and
not the other, we add a brief explanation of why it's denied?

If that sounds ok, I'll get the list and the TDX side of things posted
tomorrow.

Thanks!

[*] https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html

Tom Lendacky Sept. 15, 2020, 5:22 p.m. UTC | #2

On 9/14/20 5:59 PM, Sean Christopherson wrote:
> On Mon, Sep 14, 2020 at 03:15:14PM -0500, Tom Lendacky wrote:
>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>
>> This patch series provides support for running SEV-ES guests under KVM.
> 
> From the x86/VMX side of things, the GPR hooks are the only changes that I
> strongly dislike.
> 
> For the vmsa_encrypted flag and related things like allow_debug(), I'd
> really like to aim for a common implementation between SEV-ES and TDX[*] from
> the get go, within reason obviously.  From a code perspective, I don't think
> it will be too onerous as the basic tenets are quite similar, e.g. guest
> state is off limits, FPU state is autoswitched, etc..., but I suspect (or
> maybe worry?) that there are enough minor differences that we'll want a more
> generic way of marking ioctls() as disallowed to avoid having one-off checks
> all over the place.
> 
> That being said, it may also be that there are some ioctls() that should be
> disallowed under SEV-ES, but aren't in this series.  E.g. I assume
> kvm_vcpu_ioctl_smi() should be rejected as KVM can't do the necessary
> emulation (I assume this applies to vanilla SEV as well?).

Right, SMM isn't currently supported under SEV-ES. SEV does support SMM,
though, since the register state can be altered to change over to the SMM
register state. So the SMI ioctl() is ok for SEV.

> 
> One thought to try and reconcile the differences between SEV-ES and TDX would
> be expicitly list which ioctls() are and aren't supported and go from there?
> E.g. if there is 95% overlap than we probably don't need to get fancy with
> generic allow/deny.
> 
> Given that we don't yet have publicly available KVM code for TDX, what if I
> generate and post a list of ioctls() that are denied by either SEV-ES or TDX,
> organized by the denier(s)?  Then for the ioctls() that are denied by one and
> not the other, we add a brief explanation of why it's denied?
> 
> If that sounds ok, I'll get the list and the TDX side of things posted
> tomorrow.

That sounds good.

Thanks,
Tom

> 
> Thanks!
> 
> 
> [*] https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsoftware.intel.com%2Fcontent%2Fwww%2Fus%2Fen%2Fdevelop%2Farticles%2Fintel-trust-domain-extensions.html&amp;data=02%7C01%7Cthomas.lendacky%40amd.com%7C000b3d355429471694fa08d85901e575%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637357211966578452&amp;sdata=nEhXcrxY7KmQVCsVJrX20bagZLbzwVqlT%2BYvhSYCjHI%3D&amp;reserved=0
>

Sean Christopherson Sept. 15, 2020, 5:32 p.m. UTC | #3

On Tue, Sep 15, 2020 at 12:22:05PM -0500, Tom Lendacky wrote:
> On 9/14/20 5:59 PM, Sean Christopherson wrote:
> > On Mon, Sep 14, 2020 at 03:15:14PM -0500, Tom Lendacky wrote:
> >> From: Tom Lendacky <thomas.lendacky@amd.com>
> >>
> >> This patch series provides support for running SEV-ES guests under KVM.
> > 
> > From the x86/VMX side of things, the GPR hooks are the only changes that I
> > strongly dislike.
> > 
> > For the vmsa_encrypted flag and related things like allow_debug(), I'd
> > really like to aim for a common implementation between SEV-ES and TDX[*] from
> > the get go, within reason obviously.  From a code perspective, I don't think
> > it will be too onerous as the basic tenets are quite similar, e.g. guest
> > state is off limits, FPU state is autoswitched, etc..., but I suspect (or
> > maybe worry?) that there are enough minor differences that we'll want a more
> > generic way of marking ioctls() as disallowed to avoid having one-off checks
> > all over the place.
> > 
> > That being said, it may also be that there are some ioctls() that should be
> > disallowed under SEV-ES, but aren't in this series.  E.g. I assume
> > kvm_vcpu_ioctl_smi() should be rejected as KVM can't do the necessary
> > emulation (I assume this applies to vanilla SEV as well?).
> 
> Right, SMM isn't currently supported under SEV-ES. SEV does support SMM,
> though, since the register state can be altered to change over to the SMM
> register state. So the SMI ioctl() is ok for SEV.

But isn't guest memory inaccessible for SEV?  E.g. how does KVM emulate the
save/restore to/from SMRAM?

Brijesh Singh Sept. 15, 2020, 8:05 p.m. UTC | #4

On 9/15/20 12:32 PM, Sean Christopherson wrote:
> On Tue, Sep 15, 2020 at 12:22:05PM -0500, Tom Lendacky wrote:
>> On 9/14/20 5:59 PM, Sean Christopherson wrote:
>>> On Mon, Sep 14, 2020 at 03:15:14PM -0500, Tom Lendacky wrote:
>>>> From: Tom Lendacky <thomas.lendacky@amd.com>
>>>>
>>>> This patch series provides support for running SEV-ES guests under KVM.
>>> From the x86/VMX side of things, the GPR hooks are the only changes that I
>>> strongly dislike.
>>>
>>> For the vmsa_encrypted flag and related things like allow_debug(), I'd
>>> really like to aim for a common implementation between SEV-ES and TDX[*] from
>>> the get go, within reason obviously.  From a code perspective, I don't think
>>> it will be too onerous as the basic tenets are quite similar, e.g. guest
>>> state is off limits, FPU state is autoswitched, etc..., but I suspect (or
>>> maybe worry?) that there are enough minor differences that we'll want a more
>>> generic way of marking ioctls() as disallowed to avoid having one-off checks
>>> all over the place.
>>>
>>> That being said, it may also be that there are some ioctls() that should be
>>> disallowed under SEV-ES, but aren't in this series.  E.g. I assume
>>> kvm_vcpu_ioctl_smi() should be rejected as KVM can't do the necessary
>>> emulation (I assume this applies to vanilla SEV as well?).
>> Right, SMM isn't currently supported under SEV-ES. SEV does support SMM,
>> though, since the register state can be altered to change over to the SMM
>> register state. So the SMI ioctl() is ok for SEV.
> But isn't guest memory inaccessible for SEV?  E.g. how does KVM emulate the
> save/restore to/from SMRAM?


In SEV, to support the SMM, the guest BIOS (Ovmf) maps the SMM state
save area as unencrypted. This allows the KVM to access the SMM state
saved area as unencrypted.  SVM also provides intercepts for the RSM, so
KVM does not need to fetch and decode the instruction bytes to know
whether the VMEXIT was due to exiting from the SMM mode.

Sean Christopherson Sept. 16, 2020, 12:19 a.m. UTC | #5

On Tue, Sep 15, 2020 at 12:22:05PM -0500, Tom Lendacky wrote:
> On 9/14/20 5:59 PM, Sean Christopherson wrote:
> > Given that we don't yet have publicly available KVM code for TDX, what if I
> > generate and post a list of ioctls() that are denied by either SEV-ES or TDX,
> > organized by the denier(s)?  Then for the ioctls() that are denied by one and
> > not the other, we add a brief explanation of why it's denied?
> > 
> > If that sounds ok, I'll get the list and the TDX side of things posted
> > tomorrow.
> 
> That sounds good.

TDX completely blocks the following ioctl()s:

  kvm_vcpu_ioctl_interrupt
  kvm_vcpu_ioctl_smi
  kvm_vcpu_ioctl_x86_setup_mce
  kvm_vcpu_ioctl_x86_set_mce
  kvm_vcpu_ioctl_x86_get_debugregs
  kvm_vcpu_ioctl_x86_set_debugregs
  kvm_vcpu_ioctl_x86_get_xsave
  kvm_vcpu_ioctl_x86_set_xsave
  kvm_vcpu_ioctl_x86_get_xcrs
  kvm_vcpu_ioctl_x86_set_xcrs
  kvm_arch_vcpu_ioctl_get_regs
  kvm_arch_vcpu_ioctl_set_regs
  kvm_arch_vcpu_ioctl_get_sregs
  kvm_arch_vcpu_ioctl_set_sregs
  kvm_arch_vcpu_ioctl_set_guest_debug
  kvm_arch_vcpu_ioctl_get_fpu
  kvm_arch_vcpu_ioctl_set_fpu

Looking through the code, I think kvm_arch_vcpu_ioctl_get_mpstate() and
kvm_arch_vcpu_ioctl_set_mpstate() should also be disallowed, we just haven't
actually done so.

There are also two helper functions that are "blocked".
dm_request_for_irq_injection() returns false if guest_state_protected, and
post_kvm_run_save() shoves dummy state.

TDX also selectively blocks/skips portions of other ioctl()s so that the
TDX code itself can yell loudly if e.g. .get_cpl() is invoked.  The event
injection restrictions are due to direct injection not being allowed (except
for NMIs); all IRQs have to be routed through APICv (posted interrupts) and
exception injection is completely disallowed.

  kvm_vcpu_ioctl_x86_get_vcpu_events:
	if (!vcpu->kvm->arch.guest_state_protected)
        	events->interrupt.shadow = kvm_x86_ops.get_interrupt_shadow(vcpu);

  kvm_arch_vcpu_put:
        if (vcpu->preempted && !vcpu->kvm->arch.guest_state_protected)
                vcpu->arch.preempted_in_kernel = !kvm_x86_ops.get_cpl(vcpu);

  kvm_vcpu_ioctl_x86_set_vcpu_events:
	u32 allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING |
			    KVM_VCPUEVENT_VALID_SIPI_VECTOR |
			    KVM_VCPUEVENT_VALID_SHADOW |
			    KVM_VCPUEVENT_VALID_SMM |
			    KVM_VCPUEVENT_VALID_PAYLOAD;

	if (vcpu->kvm->arch.guest_state_protected)
		allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING;


  kvm_arch_vcpu_ioctl_run:
	if (vcpu->kvm->arch.guest_state_protected)
		kvm_sync_valid_fields = KVM_SYNC_X86_EVENTS;
	else
		kvm_sync_valid_fields = KVM_SYNC_X86_VALID_FIELDS;


In addition to the more generic guest_state_protected, we also (obviously
tentatively) have a few other flags to deal with aspects of TDX that I'm
fairly certain don't apply to SEV-ES:

  tsc_immutable - KVM doesn't have write access to the TSC offset of the
                  guest.

  eoi_intercept_unsupported - KVM can't intercept EOIs (doesn't have access
                              to EOI bitmaps) and so can't support level
                              triggered interrupts, at least not without
                              extra pain.

  readonly_mem_unsupported - Secure EPT (analagous to SNP) requires RWX
                             permissions for all private/encrypted memory.
                             S-EPT isn't optional, so we get the joy of
                             adding this right off the bat...

Tom Lendacky Oct. 13, 2020, 8:26 p.m. UTC | #6

Apologies, Sean.

I thought I had replied to this but found it instead in my drafts folder...

I've taken much of your feedback and incorporated that into the next
version of the patches that I submitted and updated this response based on
that, too.

On 9/15/20 7:19 PM, Sean Christopherson wrote:
> On Tue, Sep 15, 2020 at 12:22:05PM -0500, Tom Lendacky wrote:
>> On 9/14/20 5:59 PM, Sean Christopherson wrote:
>>> Given that we don't yet have publicly available KVM code for TDX, what if I
>>> generate and post a list of ioctls() that are denied by either SEV-ES or TDX,
>>> organized by the denier(s)?  Then for the ioctls() that are denied by one and
>>> not the other, we add a brief explanation of why it's denied?
>>>
>>> If that sounds ok, I'll get the list and the TDX side of things posted
>>> tomorrow.
>>
>> That sounds good.
> 
> TDX completely blocks the following ioctl()s:

SEV-ES doesn't need to completely block these ioctls. SEV-SNP is likely to
do more of that. SEV-ES will still allow interrupts to be injected, or
registers to be retrieved (which will only contain what was provided in
the GHCB exchange), etc.

> 
>   kvm_vcpu_ioctl_interrupt
>   kvm_vcpu_ioctl_smi
>   kvm_vcpu_ioctl_x86_setup_mce
>   kvm_vcpu_ioctl_x86_set_mce
>   kvm_vcpu_ioctl_x86_get_debugregs
>   kvm_vcpu_ioctl_x86_set_debugregs
>   kvm_vcpu_ioctl_x86_get_xsave
>   kvm_vcpu_ioctl_x86_set_xsave
>   kvm_vcpu_ioctl_x86_get_xcrs
>   kvm_vcpu_ioctl_x86_set_xcrs
>   kvm_arch_vcpu_ioctl_get_regs
>   kvm_arch_vcpu_ioctl_set_regs
>   kvm_arch_vcpu_ioctl_get_sregs
>   kvm_arch_vcpu_ioctl_set_sregs
>   kvm_arch_vcpu_ioctl_set_guest_debug
>   kvm_arch_vcpu_ioctl_get_fpu
>   kvm_arch_vcpu_ioctl_set_fpu

Of the listed ioctls, really the only ones I've updated are:

  kvm_vcpu_ioctl_x86_get_xsave
  kvm_vcpu_ioctl_x86_set_xsave

  kvm_arch_vcpu_ioctl_get_sregs
    This allows reading of the tracking value registers
  kvm_arch_vcpu_ioctl_set_sregs
    This prevents setting of register values

  kvm_arch_vcpu_ioctl_set_guest_debug

  kvm_arch_vcpu_ioctl_get_fpu
  kvm_arch_vcpu_ioctl_set_fpu

> 
> Looking through the code, I think kvm_arch_vcpu_ioctl_get_mpstate() and
> kvm_arch_vcpu_ioctl_set_mpstate() should also be disallowed, we just haven't
> actually done so.

I haven't done anything with these either.

> 
> There are also two helper functions that are "blocked".
> dm_request_for_irq_injection() returns false if guest_state_protected, and
> post_kvm_run_save() shoves dummy state.

... and these.

> 
> TDX also selectively blocks/skips portions of other ioctl()s so that the
> TDX code itself can yell loudly if e.g. .get_cpl() is invoked.  The event
> injection restrictions are due to direct injection not being allowed (except
> for NMIs); all IRQs have to be routed through APICv (posted interrupts) and
> exception injection is completely disallowed.

For SEV-ES, we don't have those restrictions.

> 
>   kvm_vcpu_ioctl_x86_get_vcpu_events:
> 	if (!vcpu->kvm->arch.guest_state_protected)
>         	events->interrupt.shadow = kvm_x86_ops.get_interrupt_shadow(vcpu);
> 
>   kvm_arch_vcpu_put:
>         if (vcpu->preempted && !vcpu->kvm->arch.guest_state_protected)
>                 vcpu->arch.preempted_in_kernel = !kvm_x86_ops.get_cpl(vcpu);
> 
>   kvm_vcpu_ioctl_x86_set_vcpu_events:
> 	u32 allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING |
> 			    KVM_VCPUEVENT_VALID_SIPI_VECTOR |
> 			    KVM_VCPUEVENT_VALID_SHADOW |
> 			    KVM_VCPUEVENT_VALID_SMM |
> 			    KVM_VCPUEVENT_VALID_PAYLOAD;
> 
> 	if (vcpu->kvm->arch.guest_state_protected)
> 		allowed_flags = KVM_VCPUEVENT_VALID_NMI_PENDING;
> 
> 
>   kvm_arch_vcpu_ioctl_run:
> 	if (vcpu->kvm->arch.guest_state_protected)
> 		kvm_sync_valid_fields = KVM_SYNC_X86_EVENTS;
> 	else
> 		kvm_sync_valid_fields = KVM_SYNC_X86_VALID_FIELDS;
> 
> 
> In addition to the more generic guest_state_protected, we also (obviously
> tentatively) have a few other flags to deal with aspects of TDX that I'm
> fairly certain don't apply to SEV-ES:
> 
>   tsc_immutable - KVM doesn't have write access to the TSC offset of the
>                   guest.
> 
>   eoi_intercept_unsupported - KVM can't intercept EOIs (doesn't have access
>                               to EOI bitmaps) and so can't support level
>                               triggered interrupts, at least not without
>                               extra pain.
> 
>   readonly_mem_unsupported - Secure EPT (analagous to SNP) requires RWX
>                              permissions for all private/encrypted memory.
>                              S-EPT isn't optional, so we get the joy of
>                              adding this right off the bat...

Yes, most of the above stuff doesn't apply to SEV-ES.

Thanks,
Tom

>

Paolo Bonzini Nov. 30, 2020, 3:31 p.m. UTC | #7

On 16/09/20 02:19, Sean Christopherson wrote:
> 
> TDX also selectively blocks/skips portions of other ioctl()s so that the
> TDX code itself can yell loudly if e.g. .get_cpl() is invoked.  The event
> injection restrictions are due to direct injection not being allowed (except
> for NMIs); all IRQs have to be routed through APICv (posted interrupts) and
> exception injection is completely disallowed.
> 
>    kvm_vcpu_ioctl_x86_get_vcpu_events:
> 	if (!vcpu->kvm->arch.guest_state_protected)
>          	events->interrupt.shadow = kvm_x86_ops.get_interrupt_shadow(vcpu);

Perhaps an alternative implementation can enter the vCPU with immediate 
exit until no events are pending, and then return all zeroes?

Paolo

Tom Lendacky Nov. 30, 2020, 4:06 p.m. UTC | #8

On 11/30/20 9:31 AM, Paolo Bonzini wrote:
> On 16/09/20 02:19, Sean Christopherson wrote:
>>
>> TDX also selectively blocks/skips portions of other ioctl()s so that the
>> TDX code itself can yell loudly if e.g. .get_cpl() is invoked.  The event
>> injection restrictions are due to direct injection not being allowed
>> (except
>> for NMIs); all IRQs have to be routed through APICv (posted interrupts) and
>> exception injection is completely disallowed.
>>
>>    kvm_vcpu_ioctl_x86_get_vcpu_events:
>>     if (!vcpu->kvm->arch.guest_state_protected)
>>              events->interrupt.shadow =
>> kvm_x86_ops.get_interrupt_shadow(vcpu);
> 
> Perhaps an alternative implementation can enter the vCPU with immediate
> exit until no events are pending, and then return all zeroes?

SEV-SNP has support for restricting injections, but SEV-ES does not.
Perhaps a new boolean, guest_restricted_injection, can be used instead of
basing it on guest_state_protected.

Thanks,
Tom

> 
> Paolo
>

Sean Christopherson Nov. 30, 2020, 6:14 p.m. UTC | #9

On Mon, Nov 30, 2020, Paolo Bonzini wrote:
> On 16/09/20 02:19, Sean Christopherson wrote:
> > 
> > TDX also selectively blocks/skips portions of other ioctl()s so that the
> > TDX code itself can yell loudly if e.g. .get_cpl() is invoked.  The event
> > injection restrictions are due to direct injection not being allowed (except
> > for NMIs); all IRQs have to be routed through APICv (posted interrupts) and
> > exception injection is completely disallowed.
> > 
> >    kvm_vcpu_ioctl_x86_get_vcpu_events:
> > 	if (!vcpu->kvm->arch.guest_state_protected)
> >          	events->interrupt.shadow = kvm_x86_ops.get_interrupt_shadow(vcpu);
> 
> Perhaps an alternative implementation can enter the vCPU with immediate exit
> until no events are pending, and then return all zeroes?

This can't work.  If the guest has STI blocking, e.g. it did STI->TDVMCALL with
a valid vIRQ in GUEST_RVI, then events->interrupt.shadow should technically be
non-zero to reflect the STI blocking.  But, the immediate exit (a hardware IRQ
for TDX guests) will cause VM-Exit before the guest can execute any instructions
and thus the guest will never clear STI blocking and never consume the pending
event.  Or there could be a valid vIRQ, but GUEST_RFLAGS.IF=0, in which case KVM
would need to run the guest for an indeterminate amount of time to wait for the
vIRQ to be consumed.

Tangentially related, I haven't looked through the official external TDX docs,
but I suspect that vmcs.GUEST_RVI is listed as inaccessible for production TDs.
This will be changed as the VMM needs access to GUEST_RVI to handle
STI->TDVMCALL(HLT), otherwise the VMM may incorrectly put the vCPU into a
blocked (not runnable) state even though it has a pending wake event.

Sean Christopherson Nov. 30, 2020, 6:18 p.m. UTC | #10

On Mon, Nov 30, 2020, Tom Lendacky wrote:
> On 11/30/20 9:31 AM, Paolo Bonzini wrote:
> > On 16/09/20 02:19, Sean Christopherson wrote:
> >>
> >> TDX also selectively blocks/skips portions of other ioctl()s so that the
> >> TDX code itself can yell loudly if e.g. .get_cpl() is invoked.  The event
> >> injection restrictions are due to direct injection not being allowed
> >> (except
> >> for NMIs); all IRQs have to be routed through APICv (posted interrupts) and
> >> exception injection is completely disallowed.
> >>
> >>    kvm_vcpu_ioctl_x86_get_vcpu_events:
> >>     if (!vcpu->kvm->arch.guest_state_protected)
> >>              events->interrupt.shadow =
> >> kvm_x86_ops.get_interrupt_shadow(vcpu);
> > 
> > Perhaps an alternative implementation can enter the vCPU with immediate
> > exit until no events are pending, and then return all zeroes?
> 
> SEV-SNP has support for restricting injections, but SEV-ES does not.
> Perhaps a new boolean, guest_restricted_injection, can be used instead of
> basing it on guest_state_protected.

Ya, that probably makes sense.  I suspect the easiest way to resolve these
conflicts will be to land the SEV-ES series and then tweak things as needed for
TDX.  Easiest in the sense that it should be fairly obvious what can be covered
by guest_state_protected and what needs a dedicated flag.

Paolo Bonzini Nov. 30, 2020, 6:35 p.m. UTC | #11

On 30/11/20 19:14, Sean Christopherson wrote:
>>> TDX also selectively blocks/skips portions of other ioctl()s so that the
>>> TDX code itself can yell loudly if e.g. .get_cpl() is invoked.  The event
>>> injection restrictions are due to direct injection not being allowed (except
>>> for NMIs); all IRQs have to be routed through APICv (posted interrupts) and
>>> exception injection is completely disallowed.
>>>
>>>     kvm_vcpu_ioctl_x86_get_vcpu_events:
>>> 	if (!vcpu->kvm->arch.guest_state_protected)
>>>           	events->interrupt.shadow = kvm_x86_ops.get_interrupt_shadow(vcpu);
>> Perhaps an alternative implementation can enter the vCPU with immediate exit
>> until no events are pending, and then return all zeroes?
>
> This can't work.  If the guest has STI blocking, e.g. it did STI->TDVMCALL with
> a valid vIRQ in GUEST_RVI, then events->interrupt.shadow should technically be
> non-zero to reflect the STI blocking.  But, the immediate exit (a hardware IRQ
> for TDX guests) will cause VM-Exit before the guest can execute any instructions
> and thus the guest will never clear STI blocking and never consume the pending
> event.  Or there could be a valid vIRQ, but GUEST_RFLAGS.IF=0, in which case KVM
> would need to run the guest for an indeterminate amount of time to wait for the
> vIRQ to be consumed.

Delayed interrupts are fine, since they are injected according to RVI 
and the posted interrupt descriptor.  I'm thinking more of events 
(exceptions and interrupts) that caused an EPT violation exit and were 
recorded in the IDT-vectored info field.

Paolo

> Tangentially related, I haven't looked through the official external TDX docs,
> but I suspect that vmcs.GUEST_RVI is listed as inaccessible for production TDs.
> This will be changed as the VMM needs access to GUEST_RVI to handle
> STI->TDVMCALL(HLT), otherwise the VMM may incorrectly put the vCPU into a
> blocked (not runnable) state even though it has a pending wake event.
>

Sean Christopherson Nov. 30, 2020, 7:35 p.m. UTC | #12

+Isaku and Xiaoyao

On Mon, Nov 30, 2020, Paolo Bonzini wrote:
> On 30/11/20 19:14, Sean Christopherson wrote:
> > > > TDX also selectively blocks/skips portions of other ioctl()s so that the
> > > > TDX code itself can yell loudly if e.g. .get_cpl() is invoked.  The event
> > > > injection restrictions are due to direct injection not being allowed (except
> > > > for NMIs); all IRQs have to be routed through APICv (posted interrupts) and
> > > > exception injection is completely disallowed.
> > > > 
> > > >     kvm_vcpu_ioctl_x86_get_vcpu_events:
> > > > 	if (!vcpu->kvm->arch.guest_state_protected)
> > > >           	events->interrupt.shadow = kvm_x86_ops.get_interrupt_shadow(vcpu);
> > > Perhaps an alternative implementation can enter the vCPU with immediate exit
> > > until no events are pending, and then return all zeroes?
> > 
> > This can't work.  If the guest has STI blocking, e.g. it did STI->TDVMCALL with
> > a valid vIRQ in GUEST_RVI, then events->interrupt.shadow should technically be
> > non-zero to reflect the STI blocking.  But, the immediate exit (a hardware IRQ
> > for TDX guests) will cause VM-Exit before the guest can execute any instructions
> > and thus the guest will never clear STI blocking and never consume the pending
> > event.  Or there could be a valid vIRQ, but GUEST_RFLAGS.IF=0, in which case KVM
> > would need to run the guest for an indeterminate amount of time to wait for the
> > vIRQ to be consumed.
> 
> Delayed interrupts are fine, since they are injected according to RVI and
> the posted interrupt descriptor.  I'm thinking more of events (exceptions
> and interrupts) that caused an EPT violation exit and were recorded in the
> IDT-vectored info field.

Ah.  As is, I don't believe KVM has access to this information.  TDX-Module
handles the actual EPT violation, as well as event reinjection.  The EPT
violation reported by SEAMRET is synthesized, and IIRC the IDT-vectoring field
is not readable.

Regardless, is there an actual a problem with having a "pending" exception that
isn't reported to userspace?  Obviously the info needs to be migrated, but that
will be taken care of by virtue of migrating the VMCS.

Paolo Bonzini Nov. 30, 2020, 8:24 p.m. UTC | #13

On 30/11/20 20:35, Sean Christopherson wrote:
>> Delayed interrupts are fine, since they are injected according to RVI and
>> the posted interrupt descriptor.  I'm thinking more of events (exceptions
>> and interrupts) that caused an EPT violation exit and were recorded in the
>> IDT-vectored info field.
> Ah.  As is, I don't believe KVM has access to this information.  TDX-Module
> handles the actual EPT violation, as well as event reinjection.  The EPT
> violation reported by SEAMRET is synthesized, and IIRC the IDT-vectoring field
> is not readable.
> 
> Regardless, is there an actual a problem with having a "pending" exception that
> isn't reported to userspace?  Obviously the info needs to be migrated, but that
> will be taken care of by virtue of migrating the VMCS.

No problem, I suppose we would just have to get used to not being able 
to look into the state of migrated VMs.

Paolo

[RFC,00/35] SEV-ES hypervisor support

Message

Comments