kvm: nVMX: flush TLB when decoded insn != VM-exit reason

Message ID	20200616224305.44242-1-oupton@google.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=jnd6=75=vger.kernel.org=kvm-owner@kernel.org> Date: Tue, 16 Jun 2020 22:43:05 +0000 Message-Id: <20200616224305.44242-1-oupton@google.com> Mime-Version: 1.0 Subject: [PATCH] kvm: nVMX: flush TLB when decoded insn != VM-exit reason From: Oliver Upton <oupton@google.com> To: kvm@vger.kernel.org Cc: Oliver Upton <oupton@google.com>, Paolo Bonzini <pbonzini@redhat.com>, Sean Christopherson <sean.j.christopherson@intel.com>, Jim Mattson <jmattson@google.com>, Peter Shier <pshier@google.com> Content-Type: text/plain; charset="UTF-8" Sender: kvm-owner@vger.kernel.org Precedence: bulk
Series	kvm: nVMX: flush TLB when decoded insn != VM-exit reason \| expand kvm: nVMX: flush TLB when decoded insn != VM-exit reason

Oliver Upton June 16, 2020, 10:43 p.m. UTC

It is possible for the instruction emulator to decode a different
instruction from what was implied by the VM-exit information provided by
hardware in vmcs02. Such is the case when the TLB entry for the guest's
IP is out of sync with the appropriate page-table mapping if page
installation isn't followed with a TLB flush.

Currently, KVM refuses to emulate in these scenarios, instead injecting
a #UD into L2. While this does address the security risk of
CVE-2020-2732, it could result in spurious #UDs to the L2 guest. Fix
this by instead flushing the TLB then resuming L2, allowing hardware to
generate the appropriate VM-exit to be reflected into L1.

Exceptional handling is also required for RSM and RDTSCP instructions.
RDTSCP could be emulated on hardware which doesn't support it,
therefore hardware will not generate a RDTSCP VM-exit on L2 resume. The
dual-monitor treatment of SMM is not supported in nVMX, which implies
that L0 should never handle a RSM instruction. Resuming the guest will
only result in another #UD. Avoid getting stuck in a loop with these
instructions by injecting a #UD for RSM and the appropriate VM-exit for
RDTSCP.

Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
---
 arch/x86/kvm/emulate.c     |  2 ++
 arch/x86/kvm/kvm_emulate.h |  1 +
 arch/x86/kvm/vmx/vmx.c     | 68 ++++++++++++++++++++++++++++----------
 arch/x86/kvm/x86.c         |  2 +-
 4 files changed, 55 insertions(+), 18 deletions(-)

Oliver Upton June 22, 2020, 10:38 p.m. UTC | #1

On Tue, Jun 16, 2020 at 3:43 PM Oliver Upton <oupton@google.com> wrote:
>
> It is possible for the instruction emulator to decode a different
> instruction from what was implied by the VM-exit information provided by
> hardware in vmcs02. Such is the case when the TLB entry for the guest's
> IP is out of sync with the appropriate page-table mapping if page
> installation isn't followed with a TLB flush.
>
> Currently, KVM refuses to emulate in these scenarios, instead injecting
> a #UD into L2. While this does address the security risk of
> CVE-2020-2732, it could result in spurious #UDs to the L2 guest. Fix
> this by instead flushing the TLB then resuming L2, allowing hardware to
> generate the appropriate VM-exit to be reflected into L1.
>
> Exceptional handling is also required for RSM and RDTSCP instructions.
> RDTSCP could be emulated on hardware which doesn't support it,
> therefore hardware will not generate a RDTSCP VM-exit on L2 resume. The
> dual-monitor treatment of SMM is not supported in nVMX, which implies
> that L0 should never handle a RSM instruction. Resuming the guest will
> only result in another #UD. Avoid getting stuck in a loop with these
> instructions by injecting a #UD for RSM and the appropriate VM-exit for
> RDTSCP.
>
> Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> Reviewed-by: Jim Mattson <jmattson@google.com>
> Reviewed-by: Peter Shier <pshier@google.com>
> ---
>  arch/x86/kvm/emulate.c     |  2 ++
>  arch/x86/kvm/kvm_emulate.h |  1 +
>  arch/x86/kvm/vmx/vmx.c     | 68 ++++++++++++++++++++++++++++----------
>  arch/x86/kvm/x86.c         |  2 +-
>  4 files changed, 55 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index d0e2825ae617..6e56e7a29ba1 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -5812,6 +5812,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
>         }
>         if (rc == X86EMUL_INTERCEPTED)
>                 return EMULATION_INTERCEPTED;
> +       if (rc == X86EMUL_RETRY_INSTR)
> +               return EMULATION_RETRY_INSTR;
>
>         if (rc == X86EMUL_CONTINUE)
>                 writeback_registers(ctxt);
> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> index 43c93ffa76ed..5bfab8d65cd1 100644
> --- a/arch/x86/kvm/kvm_emulate.h
> +++ b/arch/x86/kvm/kvm_emulate.h
> @@ -496,6 +496,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
>  #define EMULATION_OK 0
>  #define EMULATION_RESTART 1
>  #define EMULATION_INTERCEPTED 2
> +#define EMULATION_RETRY_INSTR 3
>  void init_decode_cache(struct x86_emulate_ctxt *ctxt);
>  int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
>  int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 08e26a9518c2..ebfafd7837ba 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7329,12 +7329,11 @@ static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
>         to_vmx(vcpu)->req_immediate_exit = true;
>  }
>
> -static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> -                                 struct x86_instruction_info *info)
> +static bool vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> +                                  struct x86_instruction_info *info)
>  {
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>         unsigned short port;
> -       bool intercept;
>         int size;
>
>         if (info->intercept == x86_intercept_in ||
> @@ -7354,13 +7353,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
>          * Otherwise, IO instruction VM-exits are controlled by the IO bitmaps.
>          */
>         if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
> -               intercept = nested_cpu_has(vmcs12,
> -                                          CPU_BASED_UNCOND_IO_EXITING);
> -       else
> -               intercept = nested_vmx_check_io_bitmaps(vcpu, port, size);
> +               return nested_cpu_has(vmcs12,
> +                                     CPU_BASED_UNCOND_IO_EXITING);
>
> -       /* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> -       return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
> +       return nested_vmx_check_io_bitmaps(vcpu, port, size);
>  }
>
>  static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> @@ -7369,6 +7365,7 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>                                struct x86_exception *exception)
>  {
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +       bool intercepted;
>
>         switch (info->intercept) {
>         /*
> @@ -7381,13 +7378,27 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>                         exception->error_code_valid = false;
>                         return X86EMUL_PROPAGATE_FAULT;
>                 }
> +
> +               intercepted = nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
> +
> +               /*
> +                * RDTSCP could be emulated on a CPU which doesn't support it.
> +                * As such, flushing the TLB and resuming L2 will result in
> +                * another #UD rather than a VM-exit to reflect into L1.
> +                * Instead, synthesize the VM-exit here.
> +                */
> +               if (intercepted) {
> +                       nested_vmx_vmexit(vcpu, EXIT_REASON_RDTSCP, 0, 0);
> +                       return X86EMUL_INTERCEPTED;
> +               }
>                 break;
>
>         case x86_intercept_in:
>         case x86_intercept_ins:
>         case x86_intercept_out:
>         case x86_intercept_outs:
> -               return vmx_check_intercept_io(vcpu, info);
> +               intercepted = vmx_check_intercept_io(vcpu, info);
> +               break;
>
>         case x86_intercept_lgdt:
>         case x86_intercept_lidt:
> @@ -7397,18 +7408,41 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>         case x86_intercept_sidt:
>         case x86_intercept_sldt:
>         case x86_intercept_str:
> -               if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC))
> -                       return X86EMUL_CONTINUE;
> -
> -               /* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> +               intercepted = nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC);
>                 break;
>
> -       /* TODO: check more intercepts... */
> +       /*
> +        * The dual-monitor treatment of SMM is not supported in nVMX. As such,
> +        * L0 will never handle the RSM instruction nor should it retry
> +        * instruction execution. Instead, a #UD should be injected into the
> +        * guest for the execution of RSM outside of SMM.
> +        */
> +       case x86_intercept_rsm:
> +               exception->vector = UD_VECTOR;
> +               exception->error_code_valid = false;
> +               return X86EMUL_PROPAGATE_FAULT;
> +
>         default:
> -               break;
> +               intercepted = true;
>         }
>
> -       return X86EMUL_UNHANDLEABLE;
> +       if (!intercepted)
> +               return X86EMUL_CONTINUE;
> +
> +       /*
> +        * The only uses of the emulator in VMX for instructions which may be
> +        * intercepted are port IO instructions, descriptor-table accesses, and
> +        * the RDTSCP instruction. As such, if the emulator has decoded an
> +        * instruction that is different from the VM-exit provided by hardware
> +        * it is likely that the TLB entry and page-table mapping for the
> +        * guest's RIP are out of sync.
> +        *
> +        * Rather than synthesizing a VM-exit into L1 for every possible
> +        * instruction just flush the TLB, resume L2, and let hardware generate
> +        * the appropriate VM-exit.
> +        */
> +       vmx_flush_tlb_gva(vcpu, kvm_rip_read(vcpu));
> +       return X86EMUL_RETRY_INSTR;
>  }
>
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 00c88c2f34e4..2ab47485100f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6967,7 +6967,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>
>         r = x86_emulate_insn(ctxt);
>
> -       if (r == EMULATION_INTERCEPTED)
> +       if (r == EMULATION_INTERCEPTED || r == EMULATION_RETRY_INSTR)
>                 return 1;
>
>         if (r == EMULATION_FAILED) {
> --
> 2.27.0.290.gba653c62da-goog
>

Ping :)

Paolo Bonzini June 22, 2020, 10:53 p.m. UTC | #2

On 17/06/20 00:43, Oliver Upton wrote:
> -		if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC))
> -			return X86EMUL_CONTINUE;
> -
> -		/* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> +		intercepted = nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC);
>  		break;
>  

[...]

> +	/*
> +	 * The only uses of the emulator in VMX for instructions which may be
> +	 * intercepted are port IO instructions, descriptor-table accesses, and
> +	 * the RDTSCP instruction. As such, if the emulator has decoded an
> +	 * instruction that is different from the VM-exit provided by hardware
> +	 * it is likely that the TLB entry and page-table mapping for the
> +	 * guest's RIP are out of sync.
> +	 *
> +	 * Rather than synthesizing a VM-exit into L1 for every possible
> +	 * instruction just flush the TLB, resume L2, and let hardware generate
> +	 * the appropriate VM-exit.
> +	 */

So you're saying that (in the SECONDARY_EXEC_DESC case above for an
example) this should have been handled earlier by
nested_vmx_l1_wants_exit.  But what about LGDT and friends from an MMIO
address?  I would say we could just not care, but an infinite #UD loop
is an ugly failure mode.

(Or perhaps it's just not my day for reviewing code...).

Paolo

Sean Christopherson Aug. 13, 2020, 5:03 p.m. UTC | #3

On Tue, Jun 16, 2020 at 10:43:05PM +0000, Oliver Upton wrote:
> It is possible for the instruction emulator to decode a different
> instruction from what was implied by the VM-exit information provided by
> hardware in vmcs02. Such is the case when the TLB entry for the guest's
> IP is out of sync with the appropriate page-table mapping if page
> installation isn't followed with a TLB flush.
> 
> Currently, KVM refuses to emulate in these scenarios, instead injecting
> a #UD into L2. While this does address the security risk of
> CVE-2020-2732, it could result in spurious #UDs to the L2 guest. Fix
> this by instead flushing the TLB then resuming L2, allowing hardware to
> generate the appropriate VM-exit to be reflected into L1.
> 
> Exceptional handling is also required for RSM and RDTSCP instructions.
> RDTSCP could be emulated on hardware which doesn't support it,
> therefore hardware will not generate a RDTSCP VM-exit on L2 resume. The
> dual-monitor treatment of SMM is not supported in nVMX, which implies
> that L0 should never handle a RSM instruction. Resuming the guest will
> only result in another #UD. Avoid getting stuck in a loop with these
> instructions by injecting a #UD for RSM and the appropriate VM-exit for
> RDTSCP.
> 
> Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Oliver Upton <oupton@google.com>
> Reviewed-by: Jim Mattson <jmattson@google.com>
> Reviewed-by: Peter Shier <pshier@google.com>
> ---
>  arch/x86/kvm/emulate.c     |  2 ++
>  arch/x86/kvm/kvm_emulate.h |  1 +
>  arch/x86/kvm/vmx/vmx.c     | 68 ++++++++++++++++++++++++++++----------
>  arch/x86/kvm/x86.c         |  2 +-
>  4 files changed, 55 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index d0e2825ae617..6e56e7a29ba1 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -5812,6 +5812,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
>  	}
>  	if (rc == X86EMUL_INTERCEPTED)
>  		return EMULATION_INTERCEPTED;
> +	if (rc == X86EMUL_RETRY_INSTR)
> +		return EMULATION_RETRY_INSTR;
>  
>  	if (rc == X86EMUL_CONTINUE)
>  		writeback_registers(ctxt);
> diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> index 43c93ffa76ed..5bfab8d65cd1 100644
> --- a/arch/x86/kvm/kvm_emulate.h
> +++ b/arch/x86/kvm/kvm_emulate.h
> @@ -496,6 +496,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
>  #define EMULATION_OK 0
>  #define EMULATION_RESTART 1
>  #define EMULATION_INTERCEPTED 2
> +#define EMULATION_RETRY_INSTR 3
>  void init_decode_cache(struct x86_emulate_ctxt *ctxt);
>  int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
>  int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 08e26a9518c2..ebfafd7837ba 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7329,12 +7329,11 @@ static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
>  	to_vmx(vcpu)->req_immediate_exit = true;
>  }
>  
> -static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> -				  struct x86_instruction_info *info)
> +static bool vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> +				   struct x86_instruction_info *info)
>  {
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>  	unsigned short port;
> -	bool intercept;
>  	int size;
>  
>  	if (info->intercept == x86_intercept_in ||
> @@ -7354,13 +7353,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
>  	 * Otherwise, IO instruction VM-exits are controlled by the IO bitmaps.
>  	 */
>  	if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
> -		intercept = nested_cpu_has(vmcs12,
> -					   CPU_BASED_UNCOND_IO_EXITING);
> -	else
> -		intercept = nested_vmx_check_io_bitmaps(vcpu, port, size);
> +		return nested_cpu_has(vmcs12,
> +				      CPU_BASED_UNCOND_IO_EXITING);
>  
> -	/* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> -	return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
> +	return nested_vmx_check_io_bitmaps(vcpu, port, size);

It might be a slightly bigger patch, but I think it'll be cleaner code in the
end if this section is reordered to:

        /*
         * If the 'use IO bitmaps' VM-execution control is 1, IO instruction
         * VM-exits are controlled by the IO bitmaps, otherwise they depend
         * on the 'unconditional IO exiting' VM-execution control.
         */
        if (nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
                return nested_vmx_check_io_bitmaps(vcpu, port, size);

        return nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING);

>  }
>  
>  static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> @@ -7369,6 +7365,7 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>  			       struct x86_exception *exception)
>  {
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +	bool intercepted;
>  
>  	switch (info->intercept) {
>  	/*
> @@ -7381,13 +7378,27 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>  			exception->error_code_valid = false;
>  			return X86EMUL_PROPAGATE_FAULT;
>  		}
> +
> +		intercepted = nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
> +
> +		/*
> +		 * RDTSCP could be emulated on a CPU which doesn't support it.
> +		 * As such, flushing the TLB and resuming L2 will result in
> +		 * another #UD rather than a VM-exit to reflect into L1.
> +		 * Instead, synthesize the VM-exit here.
> +		 */
> +		if (intercepted) {
> +			nested_vmx_vmexit(vcpu, EXIT_REASON_RDTSCP, 0, 0);
> +			return X86EMUL_INTERCEPTED;
> +		}

Maybe this instead?

                if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_RDTSCP)) {
                        exception->vector = UD_VECTOR;
                        exception->error_code_valid = false;
                        return X86EMUL_PROPAGATE_FAULT;
                } else if (nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING)) {
                        /*
                         * RDTSCP could be emulated on a CPU which doesn't
                         * support it.  As such, flushing the TLB and resuming
                         * L2 will result in another #UD rather than a VM-exit
                         * to reflect into L1.  Instead, synthesize the VM-exit
                         * here.
                         */
                        nested_vmx_vmexit(vcpu, EXIT_REASON_RDTSCP, 0, 0);
                        return X86EMUL_INTERCEPTED;
                }
                intercepted = false;


>  		break;
>  
>  	case x86_intercept_in:
>  	case x86_intercept_ins:
>  	case x86_intercept_out:
>  	case x86_intercept_outs:
> -		return vmx_check_intercept_io(vcpu, info);
> +		intercepted = vmx_check_intercept_io(vcpu, info);
> +		break;
>  
>  	case x86_intercept_lgdt:
>  	case x86_intercept_lidt:
> @@ -7397,18 +7408,41 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
>  	case x86_intercept_sidt:
>  	case x86_intercept_sldt:
>  	case x86_intercept_str:
> -		if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC))
> -			return X86EMUL_CONTINUE;
> -
> -		/* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> +		intercepted = nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC);
>  		break;
>  
> -	/* TODO: check more intercepts... */
> +	/*
> +	 * The dual-monitor treatment of SMM is not supported in nVMX. As such,
> +	 * L0 will never handle the RSM instruction nor should it retry
> +	 * instruction execution. Instead, a #UD should be injected into the
> +	 * guest for the execution of RSM outside of SMM.
> +	 */
> +	case x86_intercept_rsm:
> +		exception->vector = UD_VECTOR;
> +		exception->error_code_valid = false;
> +		return X86EMUL_PROPAGATE_FAULT;

Why does RSM need special treatment?  Won't it just naturally #UD if we fall
through to the flush-and-retry path?

>  	default:
> -		break;
> +		intercepted = true;
>  	}
>  
> -	return X86EMUL_UNHANDLEABLE;
> +	if (!intercepted)
> +		return X86EMUL_CONTINUE;
> +
> +	/*
> +	 * The only uses of the emulator in VMX for instructions which may be
> +	 * intercepted are port IO instructions, descriptor-table accesses, and
> +	 * the RDTSCP instruction. As such, if the emulator has decoded an

I wouldn't list out the individual cases, it's pretty obvious what can be
emulated by looking at the above code, and inevitably something will be added
that requires updating this comment.

> +	 * instruction that is different from the VM-exit provided by hardware
> +	 * it is likely that the TLB entry and page-table mapping for the

Probaby better to avoid talking about "page-table mapping", because it's not
clear which page tables are being referenced. 

> +	 * guest's RIP are out of sync.

Maybe something like:

	/*
	 * There are very few instructions that KVM will emulate for L2 and can
	 * also be intercepted by l1.  If the emulator decoded an instruction
	 * that is different from the VM-exit provided by hardware, the TLB
	 * entry for guest's RIP is likely stale.  Rather than synthesizing a
	 * VM-exit into L1 for every possible instruction, just flush the TLB,
	 * resume L2, and let hardware generate the appropriate VM-exit.
	 */

> +	 *
> +	 * Rather than synthesizing a VM-exit into L1 for every possible
> +	 * instruction just flush the TLB, resume L2, and let hardware generate
> +	 * the appropriate VM-exit.
> +	 */
> +	vmx_flush_tlb_gva(vcpu, kvm_rip_read(vcpu));

This is wrong, it should flush kvm_get_linear_rip(vcpu).
 
> +	return X86EMUL_RETRY_INSTR;
>  }
>  
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 00c88c2f34e4..2ab47485100f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6967,7 +6967,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  
>  	r = x86_emulate_insn(ctxt);
>  
> -	if (r == EMULATION_INTERCEPTED)
> +	if (r == EMULATION_INTERCEPTED || r == EMULATION_RETRY_INSTR)
>  		return 1;
>  
>  	if (r == EMULATION_FAILED) {
> -- 
> 2.27.0.290.gba653c62da-goog
>

Oliver Upton Aug. 13, 2020, 8:44 p.m. UTC | #4

On Thu, Aug 13, 2020 at 12:03 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Tue, Jun 16, 2020 at 10:43:05PM +0000, Oliver Upton wrote:
> > It is possible for the instruction emulator to decode a different
> > instruction from what was implied by the VM-exit information provided by
> > hardware in vmcs02. Such is the case when the TLB entry for the guest's
> > IP is out of sync with the appropriate page-table mapping if page
> > installation isn't followed with a TLB flush.
> >
> > Currently, KVM refuses to emulate in these scenarios, instead injecting
> > a #UD into L2. While this does address the security risk of
> > CVE-2020-2732, it could result in spurious #UDs to the L2 guest. Fix
> > this by instead flushing the TLB then resuming L2, allowing hardware to
> > generate the appropriate VM-exit to be reflected into L1.
> >
> > Exceptional handling is also required for RSM and RDTSCP instructions.
> > RDTSCP could be emulated on hardware which doesn't support it,
> > therefore hardware will not generate a RDTSCP VM-exit on L2 resume. The
> > dual-monitor treatment of SMM is not supported in nVMX, which implies
> > that L0 should never handle a RSM instruction. Resuming the guest will
> > only result in another #UD. Avoid getting stuck in a loop with these
> > instructions by injecting a #UD for RSM and the appropriate VM-exit for
> > RDTSCP.
> >
> > Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Sean Christopherson <sean.j.christopherson@intel.com>
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > Reviewed-by: Jim Mattson <jmattson@google.com>
> > Reviewed-by: Peter Shier <pshier@google.com>
> > ---
> >  arch/x86/kvm/emulate.c     |  2 ++
> >  arch/x86/kvm/kvm_emulate.h |  1 +
> >  arch/x86/kvm/vmx/vmx.c     | 68 ++++++++++++++++++++++++++++----------
> >  arch/x86/kvm/x86.c         |  2 +-
> >  4 files changed, 55 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> > index d0e2825ae617..6e56e7a29ba1 100644
> > --- a/arch/x86/kvm/emulate.c
> > +++ b/arch/x86/kvm/emulate.c
> > @@ -5812,6 +5812,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
> >       }
> >       if (rc == X86EMUL_INTERCEPTED)
> >               return EMULATION_INTERCEPTED;
> > +     if (rc == X86EMUL_RETRY_INSTR)
> > +             return EMULATION_RETRY_INSTR;
> >
> >       if (rc == X86EMUL_CONTINUE)
> >               writeback_registers(ctxt);
> > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> > index 43c93ffa76ed..5bfab8d65cd1 100644
> > --- a/arch/x86/kvm/kvm_emulate.h
> > +++ b/arch/x86/kvm/kvm_emulate.h
> > @@ -496,6 +496,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
> >  #define EMULATION_OK 0
> >  #define EMULATION_RESTART 1
> >  #define EMULATION_INTERCEPTED 2
> > +#define EMULATION_RETRY_INSTR 3
> >  void init_decode_cache(struct x86_emulate_ctxt *ctxt);
> >  int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
> >  int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 08e26a9518c2..ebfafd7837ba 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7329,12 +7329,11 @@ static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> >       to_vmx(vcpu)->req_immediate_exit = true;
> >  }
> >
> > -static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> > -                               struct x86_instruction_info *info)
> > +static bool vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> > +                                struct x86_instruction_info *info)
> >  {
> >       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> >       unsigned short port;
> > -     bool intercept;
> >       int size;
> >
> >       if (info->intercept == x86_intercept_in ||
> > @@ -7354,13 +7353,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> >        * Otherwise, IO instruction VM-exits are controlled by the IO bitmaps.
> >        */
> >       if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
> > -             intercept = nested_cpu_has(vmcs12,
> > -                                        CPU_BASED_UNCOND_IO_EXITING);
> > -     else
> > -             intercept = nested_vmx_check_io_bitmaps(vcpu, port, size);
> > +             return nested_cpu_has(vmcs12,
> > +                                   CPU_BASED_UNCOND_IO_EXITING);
> >
> > -     /* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> > -     return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
> > +     return nested_vmx_check_io_bitmaps(vcpu, port, size);
>
> It might be a slightly bigger patch, but I think it'll be cleaner code in the
> end if this section is reordered to:
>
>         /*
>          * If the 'use IO bitmaps' VM-execution control is 1, IO instruction
>          * VM-exits are controlled by the IO bitmaps, otherwise they depend
>          * on the 'unconditional IO exiting' VM-execution control.
>          */
>         if (nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
>                 return nested_vmx_check_io_bitmaps(vcpu, port, size);
>
>         return nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING);
>
> >  }
> >
> >  static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> > @@ -7369,6 +7365,7 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> >                              struct x86_exception *exception)
> >  {
> >       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +     bool intercepted;
> >
> >       switch (info->intercept) {
> >       /*
> > @@ -7381,13 +7378,27 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> >                       exception->error_code_valid = false;
> >                       return X86EMUL_PROPAGATE_FAULT;
> >               }
> > +
> > +             intercepted = nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
> > +
> > +             /*
> > +              * RDTSCP could be emulated on a CPU which doesn't support it.
> > +              * As such, flushing the TLB and resuming L2 will result in
> > +              * another #UD rather than a VM-exit to reflect into L1.
> > +              * Instead, synthesize the VM-exit here.
> > +              */
> > +             if (intercepted) {
> > +                     nested_vmx_vmexit(vcpu, EXIT_REASON_RDTSCP, 0, 0);
> > +                     return X86EMUL_INTERCEPTED;
> > +             }
>
> Maybe this instead?
>
>                 if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_RDTSCP)) {
>                         exception->vector = UD_VECTOR;
>                         exception->error_code_valid = false;
>                         return X86EMUL_PROPAGATE_FAULT;
>                 } else if (nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING)) {
>                         /*
>                          * RDTSCP could be emulated on a CPU which doesn't
>                          * support it.  As such, flushing the TLB and resuming
>                          * L2 will result in another #UD rather than a VM-exit
>                          * to reflect into L1.  Instead, synthesize the VM-exit
>                          * here.
>                          */
>                         nested_vmx_vmexit(vcpu, EXIT_REASON_RDTSCP, 0, 0);
>                         return X86EMUL_INTERCEPTED;
>                 }
>                 intercepted = false;
>
>
> >               break;
> >
> >       case x86_intercept_in:
> >       case x86_intercept_ins:
> >       case x86_intercept_out:
> >       case x86_intercept_outs:
> > -             return vmx_check_intercept_io(vcpu, info);
> > +             intercepted = vmx_check_intercept_io(vcpu, info);
> > +             break;
> >
> >       case x86_intercept_lgdt:
> >       case x86_intercept_lidt:
> > @@ -7397,18 +7408,41 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> >       case x86_intercept_sidt:
> >       case x86_intercept_sldt:
> >       case x86_intercept_str:
> > -             if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC))
> > -                     return X86EMUL_CONTINUE;
> > -
> > -             /* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> > +             intercepted = nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC);
> >               break;
> >
> > -     /* TODO: check more intercepts... */
> > +     /*
> > +      * The dual-monitor treatment of SMM is not supported in nVMX. As such,
> > +      * L0 will never handle the RSM instruction nor should it retry
> > +      * instruction execution. Instead, a #UD should be injected into the
> > +      * guest for the execution of RSM outside of SMM.
> > +      */
> > +     case x86_intercept_rsm:
> > +             exception->vector = UD_VECTOR;
> > +             exception->error_code_valid = false;
> > +             return X86EMUL_PROPAGATE_FAULT;
>
> Why does RSM need special treatment?  Won't it just naturally #UD if we fall
> through to the flush-and-retry path?
>
> >       default:
> > -             break;
> > +             intercepted = true;
> >       }
> >
> > -     return X86EMUL_UNHANDLEABLE;
> > +     if (!intercepted)
> > +             return X86EMUL_CONTINUE;
> > +
> > +     /*
> > +      * The only uses of the emulator in VMX for instructions which may be
> > +      * intercepted are port IO instructions, descriptor-table accesses, and
> > +      * the RDTSCP instruction. As such, if the emulator has decoded an
>
> I wouldn't list out the individual cases, it's pretty obvious what can be
> emulated by looking at the above code, and inevitably something will be added
> that requires updating this comment.
>
> > +      * instruction that is different from the VM-exit provided by hardware
> > +      * it is likely that the TLB entry and page-table mapping for the
>
> Probaby better to avoid talking about "page-table mapping", because it's not
> clear which page tables are being referenced.
>
> > +      * guest's RIP are out of sync.
>
> Maybe something like:
>
>         /*
>          * There are very few instructions that KVM will emulate for L2 and can
>          * also be intercepted by l1.  If the emulator decoded an instruction
>          * that is different from the VM-exit provided by hardware, the TLB
>          * entry for guest's RIP is likely stale.  Rather than synthesizing a
>          * VM-exit into L1 for every possible instruction, just flush the TLB,
>          * resume L2, and let hardware generate the appropriate VM-exit.
>          */
>
> > +      *
> > +      * Rather than synthesizing a VM-exit into L1 for every possible
> > +      * instruction just flush the TLB, resume L2, and let hardware generate
> > +      * the appropriate VM-exit.
> > +      */
> > +     vmx_flush_tlb_gva(vcpu, kvm_rip_read(vcpu));
>
> This is wrong, it should flush kvm_get_linear_rip(vcpu).
>

I do not believe that the aim of this patch will work anymore, since:

1dbf5d68af6f ("KVM: VMX: Add guest physical address check in EPT
violation and misconfig")

Since it is possible to get into the emulator on any instruction that
induces an EPT violation, we'd wind up looping when we believe the
instruction needs to exit to L1 (TLB flush, resume guest, hit the same
EPT violation. Rinse, wash, repeat).

> > +     return X86EMUL_RETRY_INSTR;
> >  }
> >
> >  #ifdef CONFIG_X86_64
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 00c88c2f34e4..2ab47485100f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6967,7 +6967,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> >       r = x86_emulate_insn(ctxt);
> >
> > -     if (r == EMULATION_INTERCEPTED)
> > +     if (r == EMULATION_INTERCEPTED || r == EMULATION_RETRY_INSTR)
> >               return 1;
> >
> >       if (r == EMULATION_FAILED) {
> > --
> > 2.27.0.290.gba653c62da-goog
> >

Sean Christopherson Aug. 13, 2020, 11:59 p.m. UTC | #5

On Thu, Aug 13, 2020 at 03:44:08PM -0500, Oliver Upton wrote:
> On Thu, Aug 13, 2020 at 12:03 PM Sean Christopherson
> > > +      *
> > > +      * Rather than synthesizing a VM-exit into L1 for every possible
> > > +      * instruction just flush the TLB, resume L2, and let hardware generate
> > > +      * the appropriate VM-exit.
> > > +      */
> > > +     vmx_flush_tlb_gva(vcpu, kvm_rip_read(vcpu));
> >
> > This is wrong, it should flush kvm_get_linear_rip(vcpu).
> >
> 
> I do not believe that the aim of this patch will work anymore, since:
> 
> 1dbf5d68af6f ("KVM: VMX: Add guest physical address check in EPT
> violation and misconfig")
> 
> Since it is possible to get into the emulator on any instruction that
> induces an EPT violation, we'd wind up looping when we believe the
> instruction needs to exit to L1 (TLB flush, resume guest, hit the same
> EPT violation. Rinse, wash, repeat).

kvm_get_linear_rip() doesn't walk any page tables, it simply accounts for a
non-zero CS.base when !64-bit mode.  If we really wanted to, this could use
the emulation context's cached _eip, but I don't see any value in that since
both GUEST_CS_* and GUEST_RIP will already be cached by KVM.

unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu)
{
        if (is_64_bit_mode(vcpu))
                return kvm_rip_read(vcpu);
        return (u32)(get_segment_base(vcpu, VCPU_SREG_CS) +
                     kvm_rip_read(vcpu));
}



> 
> > > +     return X86EMUL_RETRY_INSTR;
> > >  }
> > >
> > >  #ifdef CONFIG_X86_64
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 00c88c2f34e4..2ab47485100f 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -6967,7 +6967,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > >
> > >       r = x86_emulate_insn(ctxt);
> > >
> > > -     if (r == EMULATION_INTERCEPTED)
> > > +     if (r == EMULATION_INTERCEPTED || r == EMULATION_RETRY_INSTR)
> > >               return 1;
> > >
> > >       if (r == EMULATION_FAILED) {
> > > --
> > > 2.27.0.290.gba653c62da-goog
> > >

Oliver Upton Aug. 14, 2020, 2:56 p.m. UTC | #6

On Thu, Aug 13, 2020 at 6:59 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Thu, Aug 13, 2020 at 03:44:08PM -0500, Oliver Upton wrote:
> > On Thu, Aug 13, 2020 at 12:03 PM Sean Christopherson
> > > > +      *
> > > > +      * Rather than synthesizing a VM-exit into L1 for every possible
> > > > +      * instruction just flush the TLB, resume L2, and let hardware generate
> > > > +      * the appropriate VM-exit.
> > > > +      */
> > > > +     vmx_flush_tlb_gva(vcpu, kvm_rip_read(vcpu));
> > >
> > > This is wrong, it should flush kvm_get_linear_rip(vcpu).
> > >
> >
> > I do not believe that the aim of this patch will work anymore, since:
> >
> > 1dbf5d68af6f ("KVM: VMX: Add guest physical address check in EPT
> > violation and misconfig")
> >
> > Since it is possible to get into the emulator on any instruction that
> > induces an EPT violation, we'd wind up looping when we believe the
> > instruction needs to exit to L1 (TLB flush, resume guest, hit the same
> > EPT violation. Rinse, wash, repeat).
>
> kvm_get_linear_rip() doesn't walk any page tables, it simply accounts for a
> non-zero CS.base when !64-bit mode.  If we really wanted to, this could use
> the emulation context's cached _eip, but I don't see any value in that since
> both GUEST_CS_* and GUEST_RIP will already be cached by KVM.
>
> unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu)
> {
>         if (is_64_bit_mode(vcpu))
>                 return kvm_rip_read(vcpu);
>         return (u32)(get_segment_base(vcpu, VCPU_SREG_CS) +
>                      kvm_rip_read(vcpu));
> }

Sorry, I was a tad imprecise. I haven't any issues with your
suggestion. Rather, I believe that my overall patch is ineffective.

Suppose we had an EPT violation for a GPA that exceeded the guest's
MAXPHYADDR. Let's also say that the EPT violation occurred on the
memory operand of an LMSW instruction. Per the aforementioned patch,
we will dive into the emulator. Since we check intercepts before
reading the operand out of memory, we will fall through to the default
case, set intercepted = true, flush TLB and resume.

>
>
>
> >
> > > > +     return X86EMUL_RETRY_INSTR;
> > > >  }
> > > >
> > > >  #ifdef CONFIG_X86_64
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 00c88c2f34e4..2ab47485100f 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -6967,7 +6967,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > > >
> > > >       r = x86_emulate_insn(ctxt);
> > > >
> > > > -     if (r == EMULATION_INTERCEPTED)
> > > > +     if (r == EMULATION_INTERCEPTED || r == EMULATION_RETRY_INSTR)
> > > >               return 1;
> > > >
> > > >       if (r == EMULATION_FAILED) {
> > > > --
> > > > 2.27.0.290.gba653c62da-goog
> > > >

Sean Christopherson Aug. 17, 2020, 5:34 p.m. UTC | #7

On Fri, Aug 14, 2020 at 09:56:33AM -0500, Oliver Upton wrote:
> On Thu, Aug 13, 2020 at 6:59 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Thu, Aug 13, 2020 at 03:44:08PM -0500, Oliver Upton wrote:
> > > On Thu, Aug 13, 2020 at 12:03 PM Sean Christopherson
> > > > > +      *
> > > > > +      * Rather than synthesizing a VM-exit into L1 for every possible
> > > > > +      * instruction just flush the TLB, resume L2, and let hardware generate
> > > > > +      * the appropriate VM-exit.
> > > > > +      */
> > > > > +     vmx_flush_tlb_gva(vcpu, kvm_rip_read(vcpu));
> > > >
> > > > This is wrong, it should flush kvm_get_linear_rip(vcpu).
> > > >
> > >
> > > I do not believe that the aim of this patch will work anymore, since:
> > >
> > > 1dbf5d68af6f ("KVM: VMX: Add guest physical address check in EPT
> > > violation and misconfig")
> > >
> > > Since it is possible to get into the emulator on any instruction that
> > > induces an EPT violation, we'd wind up looping when we believe the
> > > instruction needs to exit to L1 (TLB flush, resume guest, hit the same
> > > EPT violation. Rinse, wash, repeat).
> >
> > kvm_get_linear_rip() doesn't walk any page tables, it simply accounts for a
> > non-zero CS.base when !64-bit mode.  If we really wanted to, this could use
> > the emulation context's cached _eip, but I don't see any value in that since
> > both GUEST_CS_* and GUEST_RIP will already be cached by KVM.
> >
> > unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu)
> > {
> >         if (is_64_bit_mode(vcpu))
> >                 return kvm_rip_read(vcpu);
> >         return (u32)(get_segment_base(vcpu, VCPU_SREG_CS) +
> >                      kvm_rip_read(vcpu));
> > }
> 
> Sorry, I was a tad imprecise. I haven't any issues with your
> suggestion. Rather, I believe that my overall patch is ineffective.
> 
> Suppose we had an EPT violation for a GPA that exceeded the guest's
> MAXPHYADDR. Let's also say that the EPT violation occurred on the
> memory operand of an LMSW instruction. Per the aforementioned patch,
> we will dive into the emulator. Since we check intercepts before
> reading the operand out of memory, we will fall through to the default
> case, set intercepted = true, flush TLB and resume.

Hrm.  The new invocation of kvm_emulate_instruction() feels incomplete from
the perspective that it doesn't have a flag that states "this should always
cause a #PF, freak out if it doesn't".  Such a flag would allow keeping this
approach as this interception logic could bail early if it is set, knowing
that the emulator will inject a #PF or bail to userspace (or something like
that).

kvm: nVMX: flush TLB when decoded insn != VM-exit reason

Commit Message

Comments

Patch