diff mbox

KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail

Message ID 20180719175608.20141-1-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Sean Christopherson July 19, 2018, 5:56 p.m. UTC
A VMEnter that VMFails (as opposed to VMExits) does not touch host
state beyond registers that are explicitly noted in the VMFail path,
e.g. EFLAGS.  Host state does not need to be loaded because VMFail
is only signaled for consistency checks that occur before the CPU
starts to load guest state, i.e. there is no need to restore any
state as nothing has been modified.  But in the case where a VMFail
is detected by hardware and not by KVM (due to deferring consistency
checks to hardware), KVM has already loaded some amount of guest
state.  Luckily, "loaded" only means loaded to KVM's software model,
i.e. vmcs01 has not been modified.  So, unwind our software model to
the pre-VMEntry host state.

Not restoring host state in this VMFail path leads to a variety of
failures because we end up with stale data in vcpu->arch, e.g. CR0,
CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
significant delta in the stale data is all but guaranteed to crash
L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.

An alternative to this "soft" reload would be to load host state from
vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
wildly inconsistent with respect to the VMX architecture, e.g. an L1
VMM with separate VMExit and VMFail paths would explode.

Note that this approach does not mean KVM is 100% accurate with
respect to VMX hardware behavior, even at an architectural level
(the exact order of consistency checks is microarchitecture specific)
But 100% emulation accuracy isn't our goal, rather our goal is to be
consistent in the information delivered to L1, e.g. a VMExit should
not fall-through VMENTER, and a VMFail should not jump to HOST_RIP.
Achieving perfect emulation for consistency checks is a fool's errand,
e.g. KVM would need to manually detect *every* VMFail before checking
and loading any guest state (a consistency check VMExit never takes
priority over a consistency check VMFail).  Even if KVM managed to
achieve perfection (and didn't crater performance in the process),
the honeymoon would only last until KVM encountered new hardware with
new consistency checks.

This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
context after VMLAUNCH/VMRESUME failure)", but retains the core
aspects of that patch, just in an open coded form due to the need to
pull state from vmcs01 instead of vmcs12.  Restoring host state
resolves a variety of issues introduced by commit "4f350c6dbcb9
(kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
which remedied the incorrect behavior of treating VMFail like VMExit
but in doing so neglected to restore arch state that had been modified
prior to attempting nested VMEnter.

A sample failure that occurs due to stale vcpu.arch state is a fault
of some form while emulating an LGDT (due to emulated UMIP) from L1
L1 after a failed VMEntry to L3, in this case when running the KVM
unit test test_tpr_threshold_values in L1.  L0 also hits a WARN in
this case due to a stale arch.cr4.UMIP.

L1:
  BUG: unable to handle kernel paging request at ffffc90000663b9e
  PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
  Oops: 0009 [#1] SMP
  CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:native_load_gdt+0x0/0x10

  ...

  Call Trace:
   load_fixmap_gdt+0x22/0x30
   __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
   vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
   nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
   vmx_handle_exit+0x246/0x15a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
   kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
   do_vfs_ioctl+0x9f/0x600
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x4f/0x100
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

L0:
  WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
  ...
  CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
  Hardware name: Intel Corporation Kabylake Client platform/KBL S
  RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]

  ...

  Call Trace:
   kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
   kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
   do_vfs_ioctl+0x9f/0x5e0
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x49/0xf0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 153 insertions(+), 20 deletions(-)

Comments

Jim Mattson July 19, 2018, 10:13 p.m. UTC | #1
On Thu, Jul 19, 2018 at 10:56 AM, Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
> A VMEnter that VMFails (as opposed to VMExits) does not touch host
> state beyond registers that are explicitly noted in the VMFail path,
> e.g. EFLAGS.  Host state does not need to be loaded because VMFail
> is only signaled for consistency checks that occur before the CPU
> starts to load guest state, i.e. there is no need to restore any
> state as nothing has been modified.  But in the case where a VMFail
> is detected by hardware and not by KVM (due to deferring consistency
> checks to hardware), KVM has already loaded some amount of guest
> state.  Luckily, "loaded" only means loaded to KVM's software model,
> i.e. vmcs01 has not been modified.  So, unwind our software model to
> the pre-VMEntry host state.

Thank you, thank you, thank you. I've been hoping someone would take this on.

> Not restoring host state in this VMFail path leads to a variety of
> failures because we end up with stale data in vcpu->arch, e.g. CR0,
> CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
> significant delta in the stale data is all but guaranteed to crash
> L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
>
> An alternative to this "soft" reload would be to load host state from
> vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
> wildly inconsistent with respect to the VMX architecture, e.g. an L1
> VMM with separate VMExit and VMFail paths would explode.
>
> Note that this approach does not mean KVM is 100% accurate with
> respect to VMX hardware behavior, even at an architectural level
> (the exact order of consistency checks is microarchitecture specific)
> But 100% emulation accuracy isn't our goal, rather our goal is to be
> consistent in the information delivered to L1, e.g. a VMExit should
> not fall-through VMENTER, and a VMFail should not jump to HOST_RIP.
> Achieving perfect emulation for consistency checks is a fool's errand,
> e.g. KVM would need to manually detect *every* VMFail before checking
> and loading any guest state (a consistency check VMExit never takes
> priority over a consistency check VMFail).  Even if KVM managed to
> achieve perfection (and didn't crater performance in the process),
> the honeymoon would only last until KVM encountered new hardware with
> new consistency checks.

It should be possible to check for every VMFail in software, at least
for the hardware that's emulated. When new hardware introduces new
consistency checks, presumably the new checks involve new VMCS fields
or new VM-execution controls that kvm doesn't yet emulate. Part of
emulating the new hardware should include emulating the new checks.

Moreover, kvm has to be able to check for every VMfail in software or
risk priority inversion. For example, when processing the vmcs12
VM-entry MSR-load list, kvm should not emulate a VM-exit for "VM-entry
failure due to MSR loading" until it has verified that none of the
control-field checks that were deferred to hardware would have
resulted in a VMFail.

Of course, these concerns in no way detract from this welcome change.

> This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
> context after VMLAUNCH/VMRESUME failure)", but retains the core
> aspects of that patch, just in an open coded form due to the need to
> pull state from vmcs01 instead of vmcs12.  Restoring host state
> resolves a variety of issues introduced by commit "4f350c6dbcb9
> (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
> which remedied the incorrect behavior of treating VMFail like VMExit
> but in doing so neglected to restore arch state that had been modified
> prior to attempting nested VMEnter.
>
> A sample failure that occurs due to stale vcpu.arch state is a fault
> of some form while emulating an LGDT (due to emulated UMIP) from L1
> L1 after a failed VMEntry to L3, in this case when running the KVM
> unit test test_tpr_threshold_values in L1.  L0 also hits a WARN in
> this case due to a stale arch.cr4.UMIP.
>
> L1:
>   BUG: unable to handle kernel paging request at ffffc90000663b9e
>   PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
>   Oops: 0009 [#1] SMP
>   CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
>   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>   RIP: 0010:native_load_gdt+0x0/0x10
>
>   ...
>
>   Call Trace:
>    load_fixmap_gdt+0x22/0x30
>    __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
>    vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
>    nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
>    vmx_handle_exit+0x246/0x15a0 [kvm_intel]
>    kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
>    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
>    do_vfs_ioctl+0x9f/0x600
>    ksys_ioctl+0x66/0x70
>    __x64_sys_ioctl+0x16/0x20
>    do_syscall_64+0x4f/0x100
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> L0:
>   WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
>   ...
>   CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
>   Hardware name: Intel Corporation Kabylake Client platform/KBL S
>   RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
>
>   ...
>
>   Call Trace:
>    kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
>    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
>    do_vfs_ioctl+0x9f/0x5e0
>    ksys_ioctl+0x66/0x70
>    __x64_sys_ioctl+0x16/0x20
>    do_syscall_64+0x49/0xf0
>    entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
> Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
> Cc: Jim Mattson <jmattson@google.com>
> Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 153 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 1689f433f3a0..eb5f49d2c46d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -12199,24 +12199,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>         kvm_clear_interrupt_queue(vcpu);
>  }
>
> -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> -                       struct vmcs12 *vmcs12)
> -{
> -       u32 entry_failure_code;
> -
> -       nested_ept_uninit_mmu_context(vcpu);
> -
> -       /*
> -        * Only PDPTE load can fail as the value of cr3 was checked on entry and
> -        * couldn't have changed.
> -        */
> -       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> -               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> -
> -       if (!enable_ept)
> -               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> -}
> -
>  /*
>   * A part of what we need to when the nested L2 guest exits and we want to
>   * run its L1 parent, is to reset L1's guest state to the host state specified
> @@ -12230,6 +12212,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>                                    struct vmcs12 *vmcs12)
>  {
>         struct kvm_segment seg;
> +       u32 entry_failure_code;
>
>         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>                 vcpu->arch.efer = vmcs12->host_ia32_efer;
> @@ -12256,7 +12239,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>         vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
>         vmx_set_cr4(vcpu, vmcs12->host_cr4);
>
> -       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> +       nested_ept_uninit_mmu_context(vcpu);
> +
> +       /*
> +        * Only PDPTE load can fail as the value of cr3 was checked on entry and
> +        * couldn't have changed.
> +        */
> +       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> +               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> +
> +       if (!enable_ept)
> +               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
>
>         /*
>          * If vmcs01 don't use VPID, CPU flushes TLB on every
> @@ -12352,6 +12345,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>                 nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
>  }
>
> +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
> +{
> +       struct shared_msr_entry *efer_msr;
> +       unsigned int i;
> +
> +       if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
> +               return vmcs_read64(GUEST_IA32_EFER);
> +
> +       if (cpu_has_load_ia32_efer)
> +               return host_efer;
> +
> +       for (i = 0; i < vmx->msr_autoload.nr; ++i) {
> +               if (vmx->msr_autoload.guest[i].index == MSR_EFER)
> +                       return vmx->msr_autoload.guest[i].value;
> +       }
> +
> +       efer_msr = find_msr_entry(vmx, MSR_EFER);
> +       if (efer_msr)
> +               return efer_msr->data;
> +
> +       return host_efer;
> +}
> +
> +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> +{
> +       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +       struct vcpu_vmx *vmx = to_vmx(vcpu);
> +       struct vmx_msr_entry g, h;
> +       struct msr_data msr;
> +       gpa_t gpa;
> +       u32 i, j;
> +
> +       vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
> +
> +       if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
> +               /*
> +                * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
> +                * as vmcs01.GUEST_DR7 contains a userspace defined value
> +                * and vcpu->arch.dr7 is not squirreled away before the
> +                * nested VMENTER (not worth adding a variable in nested_vmx).
> +                */
> +               if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
> +                       kvm_set_dr(vcpu, 7, DR7_FIXED_1);
> +               else
> +                       WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
> +       }
> +
> +       /*
> +        * Note that calling vmx_set_{efer,cr0,cr4} is important as they
> +        * handle a variety of side effects to KVM's software model.
> +        */
> +       vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
> +
> +       vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> +       vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
> +
> +       vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> +       vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
> +
> +       nested_ept_uninit_mmu_context(vcpu);
> +       vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> +       __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> +
> +       /*
> +        * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> +        * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> +        * VMFail, like everything else we just need to ensure our
> +        * software model is up-to-date.
> +        */
> +       ept_save_pdptrs(vcpu);
> +
> +       kvm_mmu_reset_context(vcpu);
> +
> +       if (cpu_has_vmx_msr_bitmap())
> +               vmx_update_msr_bitmap(vcpu);
> +
> +       /*
> +        * This nasty bit of open coding is a compromise between blindly
> +        * loading L1's MSRs using the exit load lists (incorrect emulation
> +        * of VMFail), leaving the nested VM's MSRs in the software model
> +        * (incorrect behavior) and snapshotting the modified MSRs (too
> +        * expensive since the lists are unbound by hardware).  For each
> +        * MSR that was (prematurely) loaded from the nested VMEntry load
> +        * list, reload it from the exit load list if it exists and differs
> +        * from the guest value.  The intent is to stuff host state as
> +        * silently as possible, not to fully process the exit load list.
> +        */

Though the MSR lists are unbounded by hardware, the number of MSRs
emulated by kvm is bounded and small. Snapshotting the modified MSRs
shouldn't be that expensive, even if the same small set of MSRs occurs
over and over again in vmcs12's VM-entry MSR-load list.

> +       msr.host_initiated = false;
> +       for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
> +               gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
> +               if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
> +                       pr_debug_ratelimited(
> +                               "%s read MSR index failed (%u, 0x%08llx)\n",
> +                               __func__, i, gpa);
> +                       goto vmabort;
> +               }
> +
> +               for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
> +                       gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
> +                       if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
> +                               pr_debug_ratelimited(
> +                                       "%s read MSR failed (%u, 0x%08llx)\n",
> +                                       __func__, j, gpa);
> +                               goto vmabort;
> +                       }
> +                       if (h.index != g.index)
> +                               continue;
> +                       if (h.value == g.value)
> +                               break;
> +
> +                       if (nested_vmx_load_msr_check(vcpu, &h)) {
> +                               pr_debug_ratelimited(
> +                                       "%s check failed (%u, 0x%x, 0x%x)\n",
> +                                       __func__, j, h.index, h.reserved);
> +                               goto vmabort;
> +                       }
> +
> +                       msr.index = h.index;
> +                       msr.data = h.value;
> +                       if (kvm_set_msr(vcpu, &msr)) {
> +                               pr_debug_ratelimited(
> +                                       "%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> +                                       __func__, j, h.index, h.value);
> +                               goto vmabort;
> +                       }
> +               }
> +       }
> +
> +       return;
> +
> +vmabort:
> +       nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> +}
> +
>  /*
>   * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
>   * and modify vmcs12 to make it see what it would expect to see there if
> @@ -12490,7 +12617,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
>          */
>         nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
>
> -       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> +       /*
> +        * Restore L1's host state to KVM's software model.  We're here
> +        * because a consistency check was caught by hardware, which
> +        * means some amount of guest state has been propagated to KVM's
> +        * model and needs to be unwound to the host's state.
> +        */
> +       nested_vmx_restore_host_state(vcpu);
>
>         /*
>          * The emulated instruction was already skipped in
> --
> 2.18.0
>
Sean Christopherson July 20, 2018, 3:09 p.m. UTC | #2
On Thu, 2018-07-19 at 15:13 -0700, Jim Mattson wrote:
> On Thu, Jul 19, 2018 at 10:56 AM, Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > 
> > A VMEnter that VMFails (as opposed to VMExits) does not touch host
> > state beyond registers that are explicitly noted in the VMFail path,
> > e.g. EFLAGS.  Host state does not need to be loaded because VMFail
> > is only signaled for consistency checks that occur before the CPU
> > starts to load guest state, i.e. there is no need to restore any
> > state as nothing has been modified.  But in the case where a VMFail
> > is detected by hardware and not by KVM (due to deferring consistency
> > checks to hardware), KVM has already loaded some amount of guest
> > state.  Luckily, "loaded" only means loaded to KVM's software model,
> > i.e. vmcs01 has not been modified.  So, unwind our software model to
> > the pre-VMEntry host state.
> Thank you, thank you, thank you. I've been hoping someone would take this on.
> 
> > 
> > Not restoring host state in this VMFail path leads to a variety of
> > failures because we end up with stale data in vcpu->arch, e.g. CR0,
> > CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
> > significant delta in the stale data is all but guaranteed to crash
> > L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
> > 
> > An alternative to this "soft" reload would be to load host state from
> > vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
> > wildly inconsistent with respect to the VMX architecture, e.g. an L1
> > VMM with separate VMExit and VMFail paths would explode.
> > 
> > Note that this approach does not mean KVM is 100% accurate with
> > respect to VMX hardware behavior, even at an architectural level
> > (the exact order of consistency checks is microarchitecture specific)
> > But 100% emulation accuracy isn't our goal, rather our goal is to be
> > consistent in the information delivered to L1, e.g. a VMExit should
> > not fall-through VMENTER, and a VMFail should not jump to HOST_RIP.
> > Achieving perfect emulation for consistency checks is a fool's errand,
> > e.g. KVM would need to manually detect *every* VMFail before checking
> > and loading any guest state (a consistency check VMExit never takes
> > priority over a consistency check VMFail).  Even if KVM managed to
> > achieve perfection (and didn't crater performance in the process),
> > the honeymoon would only last until KVM encountered new hardware with
> > new consistency checks.
> It should be possible to check for every VMFail in software, at least
> for the hardware that's emulated. When new hardware introduces new
> consistency checks, presumably the new checks involve new VMCS fields
> or new VM-execution controls that kvm doesn't yet emulate. Part of
> emulating the new hardware should include emulating the new checks.

Somehow I glossed over the fact that KVM would fail the entry for any
unknown controls.  This blurb can be shortened to:

  Note that this approach does not mean KVM is 100% accurate with
  respect to VMX hardware behavior, even at an architectural level
  (the exact order of consistency checks is microarchitecture specific).
  But 100% emulation accuracy isn't the goal (with this patch), rather
  the goal is to be consistent in the information delivered to L1, e.g.
  a VMExit should not fall-through VMENTER, and a VMFail should not jump
  to HOST_RIP.

> Moreover, kvm has to be able to check for every VMfail in software or
> risk priority inversion. For example, when processing the vmcs12
> VM-entry MSR-load list, kvm should not emulate a VM-exit for "VM-entry
> failure due to MSR loading" until it has verified that none of the
> control-field checks that were deferred to hardware would have
> resulted in a VMFail.
> 
> Of course, these concerns in no way detract from this welcome change.
> 
> > 
> > This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
> > context after VMLAUNCH/VMRESUME failure)", but retains the core
> > aspects of that patch, just in an open coded form due to the need to
> > pull state from vmcs01 instead of vmcs12.  Restoring host state
> > resolves a variety of issues introduced by commit "4f350c6dbcb9
> > (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
> > which remedied the incorrect behavior of treating VMFail like VMExit
> > but in doing so neglected to restore arch state that had been modified
> > prior to attempting nested VMEnter.
> > 
> > A sample failure that occurs due to stale vcpu.arch state is a fault
> > of some form while emulating an LGDT (due to emulated UMIP) from L1
> > L1 after a failed VMEntry to L3, in this case when running the KVM
> > unit test test_tpr_threshold_values in L1.  L0 also hits a WARN in
> > this case due to a stale arch.cr4.UMIP.
> > 
> > L1:
> >   BUG: unable to handle kernel paging request at ffffc90000663b9e
> >   PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
> >   Oops: 0009 [#1] SMP
> >   CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
> >   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >   RIP: 0010:native_load_gdt+0x0/0x10
> > 
> >   ...
> > 
> >   Call Trace:
> >    load_fixmap_gdt+0x22/0x30
> >    __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
> >    vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
> >    nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
> >    vmx_handle_exit+0x246/0x15a0 [kvm_intel]
> >    kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
> >    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >    do_vfs_ioctl+0x9f/0x600
> >    ksys_ioctl+0x66/0x70
> >    __x64_sys_ioctl+0x16/0x20
> >    do_syscall_64+0x4f/0x100
> >    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > L0:
> >   WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
> >   ...
> >   CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
> >   Hardware name: Intel Corporation Kabylake Client platform/KBL S
> >   RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
> > 
> >   ...
> > 
> >   Call Trace:
> >    kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
> >    kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >    do_vfs_ioctl+0x9f/0x5e0
> >    ksys_ioctl+0x66/0x70
> >    __x64_sys_ioctl+0x16/0x20
> >    do_syscall_64+0x49/0xf0
> >    entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
> > Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
> > Cc: Jim Mattson <jmattson@google.com>
> > Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> >  arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 153 insertions(+), 20 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 1689f433f3a0..eb5f49d2c46d 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -12199,24 +12199,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >         kvm_clear_interrupt_queue(vcpu);
> >  }
> > 
> > -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> > -                       struct vmcs12 *vmcs12)
> > -{
> > -       u32 entry_failure_code;
> > -
> > -       nested_ept_uninit_mmu_context(vcpu);
> > -
> > -       /*
> > -        * Only PDPTE load can fail as the value of cr3 was checked on entry and
> > -        * couldn't have changed.
> > -        */
> > -       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> > -               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> > -
> > -       if (!enable_ept)
> > -               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> > -}
> > -
> >  /*
> >   * A part of what we need to when the nested L2 guest exits and we want to
> >   * run its L1 parent, is to reset L1's guest state to the host state specified
> > @@ -12230,6 +12212,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >                                    struct vmcs12 *vmcs12)
> >  {
> >         struct kvm_segment seg;
> > +       u32 entry_failure_code;
> > 
> >         if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
> >                 vcpu->arch.efer = vmcs12->host_ia32_efer;
> > @@ -12256,7 +12239,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >         vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> >         vmx_set_cr4(vcpu, vmcs12->host_cr4);
> > 
> > -       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> > +       nested_ept_uninit_mmu_context(vcpu);
> > +
> > +       /*
> > +        * Only PDPTE load can fail as the value of cr3 was checked on entry and
> > +        * couldn't have changed.
> > +        */
> > +       if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> > +               nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> > +
> > +       if (!enable_ept)
> > +               vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> > 
> >         /*
> >          * If vmcs01 don't use VPID, CPU flushes TLB on every
> > @@ -12352,6 +12345,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >                 nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> >  }
> > 
> > +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
> > +{
> > +       struct shared_msr_entry *efer_msr;
> > +       unsigned int i;
> > +
> > +       if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
> > +               return vmcs_read64(GUEST_IA32_EFER);
> > +
> > +       if (cpu_has_load_ia32_efer)
> > +               return host_efer;
> > +
> > +       for (i = 0; i < vmx->msr_autoload.nr; ++i) {
> > +               if (vmx->msr_autoload.guest[i].index == MSR_EFER)
> > +                       return vmx->msr_autoload.guest[i].value;
> > +       }
> > +
> > +       efer_msr = find_msr_entry(vmx, MSR_EFER);
> > +       if (efer_msr)
> > +               return efer_msr->data;
> > +
> > +       return host_efer;
> > +}
> > +
> > +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> > +{
> > +       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +       struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +       struct vmx_msr_entry g, h;
> > +       struct msr_data msr;
> > +       gpa_t gpa;
> > +       u32 i, j;
> > +
> > +       vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
> > +
> > +       if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
> > +               /*
> > +                * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
> > +                * as vmcs01.GUEST_DR7 contains a userspace defined value
> > +                * and vcpu->arch.dr7 is not squirreled away before the
> > +                * nested VMENTER (not worth adding a variable in nested_vmx).
> > +                */
> > +               if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
> > +                       kvm_set_dr(vcpu, 7, DR7_FIXED_1);
> > +               else
> > +                       WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
> > +       }
> > +
> > +       /*
> > +        * Note that calling vmx_set_{efer,cr0,cr4} is important as they
> > +        * handle a variety of side effects to KVM's software model.
> > +        */
> > +       vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
> > +
> > +       vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> > +       vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
> > +
> > +       vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> > +       vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
> > +
> > +       nested_ept_uninit_mmu_context(vcpu);
> > +       vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> > +       __set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> > +
> > +       /*
> > +        * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> > +        * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> > +        * VMFail, like everything else we just need to ensure our
> > +        * software model is up-to-date.
> > +        */
> > +       ept_save_pdptrs(vcpu);
> > +
> > +       kvm_mmu_reset_context(vcpu);
> > +
> > +       if (cpu_has_vmx_msr_bitmap())
> > +               vmx_update_msr_bitmap(vcpu);
> > +
> > +       /*
> > +        * This nasty bit of open coding is a compromise between blindly
> > +        * loading L1's MSRs using the exit load lists (incorrect emulation
> > +        * of VMFail), leaving the nested VM's MSRs in the software model
> > +        * (incorrect behavior) and snapshotting the modified MSRs (too
> > +        * expensive since the lists are unbound by hardware).  For each
> > +        * MSR that was (prematurely) loaded from the nested VMEntry load
> > +        * list, reload it from the exit load list if it exists and differs
> > +        * from the guest value.  The intent is to stuff host state as
> > +        * silently as possible, not to fully process the exit load list.
> > +        */
> Though the MSR lists are unbounded by hardware, the number of MSRs
> emulated by kvm is bounded and small. Snapshotting the modified MSRs
> shouldn't be that expensive, even if the same small set of MSRs occurs
> over and over again in vmcs12's VM-entry MSR-load list.
> 
> > 
> > +       msr.host_initiated = false;
> > +       for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
> > +               gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
> > +               if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
> > +                       pr_debug_ratelimited(
> > +                               "%s read MSR index failed (%u, 0x%08llx)\n",
> > +                               __func__, i, gpa);
> > +                       goto vmabort;
> > +               }
> > +
> > +               for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
> > +                       gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
> > +                       if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
> > +                               pr_debug_ratelimited(
> > +                                       "%s read MSR failed (%u, 0x%08llx)\n",
> > +                                       __func__, j, gpa);
> > +                               goto vmabort;
> > +                       }
> > +                       if (h.index != g.index)
> > +                               continue;
> > +                       if (h.value == g.value)
> > +                               break;
> > +
> > +                       if (nested_vmx_load_msr_check(vcpu, &h)) {
> > +                               pr_debug_ratelimited(
> > +                                       "%s check failed (%u, 0x%x, 0x%x)\n",
> > +                                       __func__, j, h.index, h.reserved);
> > +                               goto vmabort;
> > +                       }
> > +
> > +                       msr.index = h.index;
> > +                       msr.data = h.value;
> > +                       if (kvm_set_msr(vcpu, &msr)) {
> > +                               pr_debug_ratelimited(
> > +                                       "%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> > +                                       __func__, j, h.index, h.value);
> > +                               goto vmabort;
> > +                       }
> > +               }
> > +       }
> > +
> > +       return;
> > +
> > +vmabort:
> > +       nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> > +}
> > +
> >  /*
> >   * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
> >   * and modify vmcs12 to make it see what it would expect to see there if
> > @@ -12490,7 +12617,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
> >          */
> >         nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> > 
> > -       load_vmcs12_mmu_host_state(vcpu, vmcs12);
> > +       /*
> > +        * Restore L1's host state to KVM's software model.  We're here
> > +        * because a consistency check was caught by hardware, which
> > +        * means some amount of guest state has been propagated to KVM's
> > +        * model and needs to be unwound to the host's state.
> > +        */
> > +       nested_vmx_restore_host_state(vcpu);
> > 
> >         /*
> >          * The emulated instruction was already skipped in
> > --
> > 2.18.0
> >
Krish Sadhukhan July 23, 2018, 10:26 p.m. UTC | #3
On 07/19/2018 10:56 AM, Sean Christopherson wrote:
> A VMEnter that VMFails (as opposed to VMExits) does not touch host
> state beyond registers that are explicitly noted in the VMFail path,
> e.g. EFLAGS.  Host state does not need to be loaded because VMFail
> is only signaled for consistency checks that occur before the CPU
> starts to load guest state, i.e. there is no need to restore any
> state as nothing has been modified.  But in the case where a VMFail
> is detected by hardware and not by KVM (due to deferring consistency
> checks to hardware), KVM has already loaded some amount of guest
> state.  Luckily, "loaded" only means loaded to KVM's software model,
> i.e. vmcs01 has not been modified.  So, unwind our software model to
> the pre-VMEntry host state.
>
> Not restoring host state in this VMFail path leads to a variety of
> failures because we end up with stale data in vcpu->arch, e.g. CR0,
> CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
> significant delta in the stale data is all but guaranteed to crash
> L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
>
> An alternative to this "soft" reload would be to load host state from
> vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
> wildly inconsistent with respect to the VMX architecture, e.g. an L1
> VMM with separate VMExit and VMFail paths would explode.
>
> Note that this approach does not mean KVM is 100% accurate with
> respect to VMX hardware behavior, even at an architectural level
> (the exact order of consistency checks is microarchitecture specific)
> But 100% emulation accuracy isn't our goal, rather our goal is to be
> consistent in the information delivered to L1, e.g. a VMExit should
> not fall-through VMENTER, and a VMFail should not jump to HOST_RIP.
> Achieving perfect emulation for consistency checks is a fool's errand,
> e.g. KVM would need to manually detect *every* VMFail before checking
> and loading any guest state (a consistency check VMExit never takes
> priority over a consistency check VMFail).  Even if KVM managed to
> achieve perfection (and didn't crater performance in the process),
> the honeymoon would only last until KVM encountered new hardware with
> new consistency checks.
>
> This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
> context after VMLAUNCH/VMRESUME failure)", but retains the core
> aspects of that patch, just in an open coded form due to the need to
> pull state from vmcs01 instead of vmcs12.  Restoring host state
> resolves a variety of issues introduced by commit "4f350c6dbcb9
> (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
> which remedied the incorrect behavior of treating VMFail like VMExit
> but in doing so neglected to restore arch state that had been modified
> prior to attempting nested VMEnter.
>
> A sample failure that occurs due to stale vcpu.arch state is a fault
> of some form while emulating an LGDT (due to emulated UMIP) from L1
> L1 after a failed VMEntry to L3, in this case when running the KVM

Nit:  "L1"  is repeated twice here.

> unit test test_tpr_threshold_values in L1.  L0 also hits a WARN in
> this case due to a stale arch.cr4.UMIP.
>
> L1:
>    BUG: unable to handle kernel paging request at ffffc90000663b9e
>    PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
>    Oops: 0009 [#1] SMP
>    CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
>    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>    RIP: 0010:native_load_gdt+0x0/0x10
>
>    ...
>
>    Call Trace:
>     load_fixmap_gdt+0x22/0x30
>     __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
>     vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
>     nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
>     vmx_handle_exit+0x246/0x15a0 [kvm_intel]
>     kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
>     kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
>     do_vfs_ioctl+0x9f/0x600
>     ksys_ioctl+0x66/0x70
>     __x64_sys_ioctl+0x16/0x20
>     do_syscall_64+0x4f/0x100
>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> L0:
>    WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
>    ...
>    CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
>    Hardware name: Intel Corporation Kabylake Client platform/KBL S
>    RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
>
>    ...
>
>    Call Trace:
>     kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
>     kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
>     do_vfs_ioctl+0x9f/0x5e0
>     ksys_ioctl+0x66/0x70
>     __x64_sys_ioctl+0x16/0x20
>     do_syscall_64+0x49/0xf0
>     entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
> Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
> Cc: Jim Mattson <jmattson@google.com>
> Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>   arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
>   1 file changed, 153 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 1689f433f3a0..eb5f49d2c46d 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -12199,24 +12199,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>   	kvm_clear_interrupt_queue(vcpu);
>   }
>   
> -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> -			struct vmcs12 *vmcs12)
> -{
> -	u32 entry_failure_code;
> -
> -	nested_ept_uninit_mmu_context(vcpu);
> -
> -	/*
> -	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> -	 * couldn't have changed.
> -	 */
> -	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> -		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> -
> -	if (!enable_ept)
> -		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> -}
> -
>   /*
>    * A part of what we need to when the nested L2 guest exits and we want to
>    * run its L1 parent, is to reset L1's guest state to the host state specified
> @@ -12230,6 +12212,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>   				   struct vmcs12 *vmcs12)
>   {
>   	struct kvm_segment seg;
> +	u32 entry_failure_code;
>   
>   	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>   		vcpu->arch.efer = vmcs12->host_ia32_efer;
> @@ -12256,7 +12239,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>   	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
>   	vmx_set_cr4(vcpu, vmcs12->host_cr4);
>   
> -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> +	nested_ept_uninit_mmu_context(vcpu);
> +
> +	/*
> +	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> +	 * couldn't have changed.
> +	 */
> +	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> +
> +	if (!enable_ept)
> +		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
>   
>   	/*
>   	 * If vmcs01 don't use VPID, CPU flushes TLB on every
> @@ -12352,6 +12345,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>   		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
>   }
>   
> +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
> +{
> +	struct shared_msr_entry *efer_msr;
> +	unsigned int i;
> +
> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
> +		return vmcs_read64(GUEST_IA32_EFER);
> +
> +	if (cpu_has_load_ia32_efer)
> +		return host_efer;
> +
> +	for (i = 0; i < vmx->msr_autoload.nr; ++i) {
> +		if (vmx->msr_autoload.guest[i].index == MSR_EFER)
> +			return vmx->msr_autoload.guest[i].value;
> +	}
> +
> +	efer_msr = find_msr_entry(vmx, MSR_EFER);
> +	if (efer_msr)
> +		return efer_msr->data;
> +
> +	return host_efer;
> +}
> +
> +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)

For readability purpose, should we add some comments outlining the 
prominent elements that we are restoring in this function ?

> +{
> +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	struct vmx_msr_entry g, h;
> +	struct msr_data msr;
> +	gpa_t gpa;
> +	u32 i, j;
> +
> +	vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
> +
> +	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
> +		/*
> +		 * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
> +		 * as vmcs01.GUEST_DR7 contains a userspace defined value
> +		 * and vcpu->arch.dr7 is not squirreled away before the
> +		 * nested VMENTER (not worth adding a variable in nested_vmx).
> +		 */
> +		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
> +			kvm_set_dr(vcpu, 7, DR7_FIXED_1);
> +		else
> +			WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
> +	}
> +
> +	/*
> +	 * Note that calling vmx_set_{efer,cr0,cr4} is important as they
> +	 * handle a variety of side effects to KVM's software model.
> +	 */
> +	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
> +
> +	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> +	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
> +
> +	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> +	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
> +
> +	nested_ept_uninit_mmu_context(vcpu);
> +	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> +	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> +
> +	/*
> +	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> +	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> +	 * VMFail, like everything else we just need to ensure our
> +	 * software model is up-to-date.
> +	 */
> +	ept_save_pdptrs(vcpu);
> +
> +	kvm_mmu_reset_context(vcpu);
> +
> +	if (cpu_has_vmx_msr_bitmap())
> +		vmx_update_msr_bitmap(vcpu);
> +
> +	/*
> +	 * This nasty bit of open coding is a compromise between blindly
> +	 * loading L1's MSRs using the exit load lists (incorrect emulation
> +	 * of VMFail), leaving the nested VM's MSRs in the software model
> +	 * (incorrect behavior) and snapshotting the modified MSRs (too
> +	 * expensive since the lists are unbound by hardware).  For each
> +	 * MSR that was (prematurely) loaded from the nested VMEntry load
> +	 * list, reload it from the exit load list if it exists and differs
> +	 * from the guest value.  The intent is to stuff host state as
> +	 * silently as possible, not to fully process the exit load list.
> +	 */
> +	msr.host_initiated = false;
> +	for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
> +		gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
> +		if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
> +			pr_debug_ratelimited(
> +				"%s read MSR index failed (%u, 0x%08llx)\n",
> +				__func__, i, gpa);
> +			goto vmabort;
> +		}
> +
> +		for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
> +			gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
> +			if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
> +				pr_debug_ratelimited(
> +					"%s read MSR failed (%u, 0x%08llx)\n",
> +					__func__, j, gpa);
> +				goto vmabort;
> +			}
> +			if (h.index != g.index)
> +				continue;
> +			if (h.value == g.value)
> +				break;
> +
> +			if (nested_vmx_load_msr_check(vcpu, &h)) {
> +				pr_debug_ratelimited(
> +					"%s check failed (%u, 0x%x, 0x%x)\n",
> +					__func__, j, h.index, h.reserved);
> +				goto vmabort;
> +			}
> +
> +			msr.index = h.index;
> +			msr.data = h.value;
> +			if (kvm_set_msr(vcpu, &msr)) {
> +				pr_debug_ratelimited(
> +					"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> +					__func__, j, h.index, h.value);
> +				goto vmabort;
> +			}
> +		}
> +	}
> +

Is it possible to re-use nested_vmx_load_msr() instead of duplicating 
the same loops ?

> +	return;
> +
> +vmabort:
> +	nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> +}
> +
>   /*
>    * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
>    * and modify vmcs12 to make it see what it would expect to see there if
> @@ -12490,7 +12617,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
>   	 */
>   	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
>   
> -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> +	/*
> +	 * Restore L1's host state to KVM's software model.  We're here
> +	 * because a consistency check was caught by hardware, which
> +	 * means some amount of guest state has been propagated to KVM's
> +	 * model and needs to be unwound to the host's state.
> +	 */
> +	nested_vmx_restore_host_state(vcpu);
>   
>   	/*
>   	 * The emulated instruction was already skipped in
Sean Christopherson July 24, 2018, 1:31 p.m. UTC | #4
On Mon, 2018-07-23 at 15:26 -0700, Krish Sadhukhan wrote:
> 
> On 07/19/2018 10:56 AM, Sean Christopherson wrote:
> > 
> > A VMEnter that VMFails (as opposed to VMExits) does not touch host
> > state beyond registers that are explicitly noted in the VMFail path,
> > e.g. EFLAGS.  Host state does not need to be loaded because VMFail
> > is only signaled for consistency checks that occur before the CPU
> > starts to load guest state, i.e. there is no need to restore any
> > state as nothing has been modified.  But in the case where a VMFail
> > is detected by hardware and not by KVM (due to deferring consistency
> > checks to hardware), KVM has already loaded some amount of guest
> > state.  Luckily, "loaded" only means loaded to KVM's software model,
> > i.e. vmcs01 has not been modified.  So, unwind our software model to
> > the pre-VMEntry host state.
> > 
> > Not restoring host state in this VMFail path leads to a variety of
> > failures because we end up with stale data in vcpu->arch, e.g. CR0,
> > CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
> > significant delta in the stale data is all but guaranteed to crash
> > L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
> > 
> > An alternative to this "soft" reload would be to load host state from
> > vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
> > wildly inconsistent with respect to the VMX architecture, e.g. an L1
> > VMM with separate VMExit and VMFail paths would explode.
> > 
> > Note that this approach does not mean KVM is 100% accurate with
> > respect to VMX hardware behavior, even at an architectural level
> > (the exact order of consistency checks is microarchitecture specific)
> > But 100% emulation accuracy isn't our goal, rather our goal is to be
> > consistent in the information delivered to L1, e.g. a VMExit should
> > not fall-through VMENTER, and a VMFail should not jump to HOST_RIP.
> > Achieving perfect emulation for consistency checks is a fool's errand,
> > e.g. KVM would need to manually detect *every* VMFail before checking
> > and loading any guest state (a consistency check VMExit never takes
> > priority over a consistency check VMFail).  Even if KVM managed to
> > achieve perfection (and didn't crater performance in the process),
> > the honeymoon would only last until KVM encountered new hardware with
> > new consistency checks.
> > 
> > This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
> > context after VMLAUNCH/VMRESUME failure)", but retains the core
> > aspects of that patch, just in an open coded form due to the need to
> > pull state from vmcs01 instead of vmcs12.  Restoring host state
> > resolves a variety of issues introduced by commit "4f350c6dbcb9
> > (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
> > which remedied the incorrect behavior of treating VMFail like VMExit
> > but in doing so neglected to restore arch state that had been modified
> > prior to attempting nested VMEnter.
> > 
> > A sample failure that occurs due to stale vcpu.arch state is a fault
> > of some form while emulating an LGDT (due to emulated UMIP) from L1
> > L1 after a failed VMEntry to L3, in this case when running the KVM
> Nit:  "L1"  is repeated twice here.
> 
> > 
> > unit test test_tpr_threshold_values in L1.  L0 also hits a WARN in
> > this case due to a stale arch.cr4.UMIP.
> > 
> > L1:
> >    BUG: unable to handle kernel paging request at ffffc90000663b9e
> >    PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
> >    Oops: 0009 [#1] SMP
> >    CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
> >    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> >    RIP: 0010:native_load_gdt+0x0/0x10
> > 
> >    ...
> > 
> >    Call Trace:
> >     load_fixmap_gdt+0x22/0x30
> >     __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
> >     vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
> >     nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
> >     vmx_handle_exit+0x246/0x15a0 [kvm_intel]
> >     kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
> >     kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >     do_vfs_ioctl+0x9f/0x600
> >     ksys_ioctl+0x66/0x70
> >     __x64_sys_ioctl+0x16/0x20
> >     do_syscall_64+0x4f/0x100
> >     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > L0:
> >    WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
> >    ...
> >    CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
> >    Hardware name: Intel Corporation Kabylake Client platform/KBL S
> >    RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
> > 
> >    ...
> > 
> >    Call Trace:
> >     kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
> >     kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
> >     do_vfs_ioctl+0x9f/0x5e0
> >     ksys_ioctl+0x66/0x70
> >     __x64_sys_ioctl+0x16/0x20
> >     do_syscall_64+0x49/0xf0
> >     entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
> > Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
> > Cc: Jim Mattson <jmattson@google.com>
> > Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: Wanpeng Li <wanpeng.li@hotmail.com>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> >   arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
> >   1 file changed, 153 insertions(+), 20 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 1689f433f3a0..eb5f49d2c46d 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -12199,24 +12199,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> >   	kvm_clear_interrupt_queue(vcpu);
> >   }
> >   
> > -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
> > -			struct vmcs12 *vmcs12)
> > -{
> > -	u32 entry_failure_code;
> > -
> > -	nested_ept_uninit_mmu_context(vcpu);
> > -
> > -	/*
> > -	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> > -	 * couldn't have changed.
> > -	 */
> > -	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> > -		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> > -
> > -	if (!enable_ept)
> > -		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> > -}
> > -
> >   /*
> >    * A part of what we need to when the nested L2 guest exits and we want to
> >    * run its L1 parent, is to reset L1's guest state to the host state specified
> > @@ -12230,6 +12212,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >   				   struct vmcs12 *vmcs12)
> >   {
> >   	struct kvm_segment seg;
> > +	u32 entry_failure_code;
> >   
> >   	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
> >   		vcpu->arch.efer = vmcs12->host_ia32_efer;
> > @@ -12256,7 +12239,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >   	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> >   	vmx_set_cr4(vcpu, vmcs12->host_cr4);
> >   
> > -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> > +	nested_ept_uninit_mmu_context(vcpu);
> > +
> > +	/*
> > +	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
> > +	 * couldn't have changed.
> > +	 */
> > +	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
> > +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
> > +
> > +	if (!enable_ept)
> > +		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
> >   
> >   	/*
> >   	 * If vmcs01 don't use VPID, CPU flushes TLB on every
> > @@ -12352,6 +12345,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
> >   		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> >   }
> >   
> > +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
> > +{
> > +	struct shared_msr_entry *efer_msr;
> > +	unsigned int i;
> > +
> > +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
> > +		return vmcs_read64(GUEST_IA32_EFER);
> > +
> > +	if (cpu_has_load_ia32_efer)
> > +		return host_efer;
> > +
> > +	for (i = 0; i < vmx->msr_autoload.nr; ++i) {
> > +		if (vmx->msr_autoload.guest[i].index == MSR_EFER)
> > +			return vmx->msr_autoload.guest[i].value;
> > +	}
> > +
> > +	efer_msr = find_msr_entry(vmx, MSR_EFER);
> > +	if (efer_msr)
> > +		return efer_msr->data;
> > +
> > +	return host_efer;
> > +}
> > +
> > +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
> For readability purpose, should we add some comments outlining the 
> prominent elements that we are restoring in this function ?

Hmm what about a high level comment?  I'm hesistant to list things
explicitly as that tends to result in stale comments.  And I didn't
have a method for choosing what (not) to restore beyond ensuring
everything that might have been touched in enter_vmx_non_root_mode()
was restored.

Something like:

  Unwind KVM's software model, i.e. vcpu->arch, to L1's pre-VMEnter
  state, i.e. L1 host state, pulling data from vmcs01.  The rule of
  thumb is that anything touched by enter_vmx_non_root_mode() needs
  to be restored.

> > 
> > +{
> > +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +	struct vmx_msr_entry g, h;
> > +	struct msr_data msr;
> > +	gpa_t gpa;
> > +	u32 i, j;
> > +
> > +	vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
> > +
> > +	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
> > +		/*
> > +		 * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
> > +		 * as vmcs01.GUEST_DR7 contains a userspace defined value
> > +		 * and vcpu->arch.dr7 is not squirreled away before the
> > +		 * nested VMENTER (not worth adding a variable in nested_vmx).
> > +		 */
> > +		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
> > +			kvm_set_dr(vcpu, 7, DR7_FIXED_1);
> > +		else
> > +			WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
> > +	}
> > +
> > +	/*
> > +	 * Note that calling vmx_set_{efer,cr0,cr4} is important as they
> > +	 * handle a variety of side effects to KVM's software model.
> > +	 */
> > +	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
> > +
> > +	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
> > +	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
> > +
> > +	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
> > +	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
> > +
> > +	nested_ept_uninit_mmu_context(vcpu);
> > +	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> > +	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
> > +
> > +	/*
> > +	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> > +	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> > +	 * VMFail, like everything else we just need to ensure our
> > +	 * software model is up-to-date.
> > +	 */
> > +	ept_save_pdptrs(vcpu);
> > +
> > +	kvm_mmu_reset_context(vcpu);
> > +
> > +	if (cpu_has_vmx_msr_bitmap())
> > +		vmx_update_msr_bitmap(vcpu);
> > +
> > +	/*
> > +	 * This nasty bit of open coding is a compromise between blindly
> > +	 * loading L1's MSRs using the exit load lists (incorrect emulation
> > +	 * of VMFail), leaving the nested VM's MSRs in the software model
> > +	 * (incorrect behavior) and snapshotting the modified MSRs (too
> > +	 * expensive since the lists are unbound by hardware).  For each
> > +	 * MSR that was (prematurely) loaded from the nested VMEntry load
> > +	 * list, reload it from the exit load list if it exists and differs
> > +	 * from the guest value.  The intent is to stuff host state as
> > +	 * silently as possible, not to fully process the exit load list.
> > +	 */
> > +	msr.host_initiated = false;
> > +	for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
> > +		gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
> > +		if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
> > +			pr_debug_ratelimited(
> > +				"%s read MSR index failed (%u, 0x%08llx)\n",
> > +				__func__, i, gpa);
> > +			goto vmabort;
> > +		}
> > +
> > +		for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
> > +			gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
> > +			if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
> > +				pr_debug_ratelimited(
> > +					"%s read MSR failed (%u, 0x%08llx)\n",
> > +					__func__, j, gpa);
> > +				goto vmabort;
> > +			}
> > +			if (h.index != g.index)
> > +				continue;
> > +			if (h.value == g.value)
> > +				break;
> > +
> > +			if (nested_vmx_load_msr_check(vcpu, &h)) {
> > +				pr_debug_ratelimited(
> > +					"%s check failed (%u, 0x%x, 0x%x)\n",
> > +					__func__, j, h.index, h.reserved);
> > +				goto vmabort;
> > +			}
> > +
> > +			msr.index = h.index;
> > +			msr.data = h.value;
> > +			if (kvm_set_msr(vcpu, &msr)) {
> > +				pr_debug_ratelimited(
> > +					"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> > +					__func__, j, h.index, h.value);
> > +				goto vmabort;
> > +			}
> > +		}
> > +	}
> > +
> Is it possible to re-use nested_vmx_load_msr() instead of duplicating 
> the same loops ?

We could use nested_vmx_load_msr() as-is to directly load on-exit list,
but that has the downside of potentially overwriting MSRs that should
not have been touched, or triggering side effects that should not have
occurred.  I considered modifying nested_vmx_load_msr() to splice in
this optional behavior, but the resulting code would be hideous and I
didn't want to convolute an otherwise straightforward function for this
one-off case.

> > 
> > +	return;
> > +
> > +vmabort:
> > +	nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
> > +}
> > +
> >   /*
> >    * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
> >    * and modify vmcs12 to make it see what it would expect to see there if
> > @@ -12490,7 +12617,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
> >   	 */
> >   	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
> >   
> > -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
> > +	/*
> > +	 * Restore L1's host state to KVM's software model.  We're here
> > +	 * because a consistency check was caught by hardware, which
> > +	 * means some amount of guest state has been propagated to KVM's
> > +	 * model and needs to be unwound to the host's state.
> > +	 */
> > +	nested_vmx_restore_host_state(vcpu);
> >   
> >   	/*
> >   	 * The emulated instruction was already skipped in
Krish Sadhukhan July 28, 2018, 4:48 a.m. UTC | #5
On 07/24/2018 06:31 AM, Sean Christopherson wrote:
> On Mon, 2018-07-23 at 15:26 -0700, Krish Sadhukhan wrote:
>> On 07/19/2018 10:56 AM, Sean Christopherson wrote:
>>> A VMEnter that VMFails (as opposed to VMExits) does not touch host
>>> state beyond registers that are explicitly noted in the VMFail path,
>>> e.g. EFLAGS.  Host state does not need to be loaded because VMFail
>>> is only signaled for consistency checks that occur before the CPU
>>> starts to load guest state, i.e. there is no need to restore any
>>> state as nothing has been modified.  But in the case where a VMFail
>>> is detected by hardware and not by KVM (due to deferring consistency
>>> checks to hardware), KVM has already loaded some amount of guest
>>> state.  Luckily, "loaded" only means loaded to KVM's software model,
>>> i.e. vmcs01 has not been modified.  So, unwind our software model to
>>> the pre-VMEntry host state.
>>>
>>> Not restoring host state in this VMFail path leads to a variety of
>>> failures because we end up with stale data in vcpu->arch, e.g. CR0,
>>> CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
>>> significant delta in the stale data is all but guaranteed to crash
>>> L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
>>>
>>> An alternative to this "soft" reload would be to load host state from
>>> vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
>>> wildly inconsistent with respect to the VMX architecture, e.g. an L1
>>> VMM with separate VMExit and VMFail paths would explode.
>>>
>>> Note that this approach does not mean KVM is 100% accurate with
>>> respect to VMX hardware behavior, even at an architectural level
>>> (the exact order of consistency checks is microarchitecture specific)
>>> But 100% emulation accuracy isn't our goal, rather our goal is to be
>>> consistent in the information delivered to L1, e.g. a VMExit should
>>> not fall-through VMENTER, and a VMFail should not jump to HOST_RIP.
>>> Achieving perfect emulation for consistency checks is a fool's errand,
>>> e.g. KVM would need to manually detect *every* VMFail before checking
>>> and loading any guest state (a consistency check VMExit never takes
>>> priority over a consistency check VMFail).  Even if KVM managed to
>>> achieve perfection (and didn't crater performance in the process),
>>> the honeymoon would only last until KVM encountered new hardware with
>>> new consistency checks.
>>>
>>> This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
>>> context after VMLAUNCH/VMRESUME failure)", but retains the core
>>> aspects of that patch, just in an open coded form due to the need to
>>> pull state from vmcs01 instead of vmcs12.  Restoring host state
>>> resolves a variety of issues introduced by commit "4f350c6dbcb9
>>> (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
>>> which remedied the incorrect behavior of treating VMFail like VMExit
>>> but in doing so neglected to restore arch state that had been modified
>>> prior to attempting nested VMEnter.
>>>
>>> A sample failure that occurs due to stale vcpu.arch state is a fault
>>> of some form while emulating an LGDT (due to emulated UMIP) from L1
>>> L1 after a failed VMEntry to L3, in this case when running the KVM
>> Nit:  "L1"  is repeated twice here.
>>
>>> unit test test_tpr_threshold_values in L1.  L0 also hits a WARN in
>>> this case due to a stale arch.cr4.UMIP.
>>>
>>> L1:
>>>     BUG: unable to handle kernel paging request at ffffc90000663b9e
>>>     PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
>>>     Oops: 0009 [#1] SMP
>>>     CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
>>>     Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
>>>     RIP: 0010:native_load_gdt+0x0/0x10
>>>
>>>     ...
>>>
>>>     Call Trace:
>>>      load_fixmap_gdt+0x22/0x30
>>>      __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
>>>      vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
>>>      nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
>>>      vmx_handle_exit+0x246/0x15a0 [kvm_intel]
>>>      kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
>>>      kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
>>>      do_vfs_ioctl+0x9f/0x600
>>>      ksys_ioctl+0x66/0x70
>>>      __x64_sys_ioctl+0x16/0x20
>>>      do_syscall_64+0x4f/0x100
>>>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> L0:
>>>     WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
>>>     ...
>>>     CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
>>>     Hardware name: Intel Corporation Kabylake Client platform/KBL S
>>>     RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
>>>
>>>     ...
>>>
>>>     Call Trace:
>>>      kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
>>>      kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
>>>      do_vfs_ioctl+0x9f/0x5e0
>>>      ksys_ioctl+0x66/0x70
>>>      __x64_sys_ioctl+0x16/0x20
>>>      do_syscall_64+0x49/0xf0
>>>      entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>>
>>> Fixes: 5af4157388ad (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
>>> Fixes: 4f350c6dbcb9 (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
>>> Cc: Jim Mattson <jmattson@google.com>
>>> Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>> Cc: Wanpeng Li <wanpeng.li@hotmail.com>
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>>> ---
>>>    arch/x86/kvm/vmx.c | 173 +++++++++++++++++++++++++++++++++++++++------
>>>    1 file changed, 153 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 1689f433f3a0..eb5f49d2c46d 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -12199,24 +12199,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>>>    	kvm_clear_interrupt_queue(vcpu);
>>>    }
>>>    
>>> -static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
>>> -			struct vmcs12 *vmcs12)
>>> -{
>>> -	u32 entry_failure_code;
>>> -
>>> -	nested_ept_uninit_mmu_context(vcpu);
>>> -
>>> -	/*
>>> -	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
>>> -	 * couldn't have changed.
>>> -	 */
>>> -	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
>>> -		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>>> -
>>> -	if (!enable_ept)
>>> -		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
>>> -}
>>> -
>>>    /*
>>>     * A part of what we need to when the nested L2 guest exits and we want to
>>>     * run its L1 parent, is to reset L1's guest state to the host state specified
>>> @@ -12230,6 +12212,7 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>>>    				   struct vmcs12 *vmcs12)
>>>    {
>>>    	struct kvm_segment seg;
>>> +	u32 entry_failure_code;
>>>    
>>>    	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
>>>    		vcpu->arch.efer = vmcs12->host_ia32_efer;
>>> @@ -12256,7 +12239,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>>>    	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
>>>    	vmx_set_cr4(vcpu, vmcs12->host_cr4);
>>>    
>>> -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
>>> +	nested_ept_uninit_mmu_context(vcpu);
>>> +
>>> +	/*
>>> +	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
>>> +	 * couldn't have changed.
>>> +	 */
>>> +	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
>>> +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
>>> +
>>> +	if (!enable_ept)
>>> +		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
>>>    
>>>    	/*
>>>    	 * If vmcs01 don't use VPID, CPU flushes TLB on every
>>> @@ -12352,6 +12345,140 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
>>>    		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
>>>    }
>>>    
>>> +static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
>>> +{
>>> +	struct shared_msr_entry *efer_msr;
>>> +	unsigned int i;
>>> +
>>> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
>>> +		return vmcs_read64(GUEST_IA32_EFER);
>>> +
>>> +	if (cpu_has_load_ia32_efer)
>>> +		return host_efer;
>>> +
>>> +	for (i = 0; i < vmx->msr_autoload.nr; ++i) {
>>> +		if (vmx->msr_autoload.guest[i].index == MSR_EFER)
>>> +			return vmx->msr_autoload.guest[i].value;
>>> +	}
>>> +
>>> +	efer_msr = find_msr_entry(vmx, MSR_EFER);
>>> +	if (efer_msr)
>>> +		return efer_msr->data;
>>> +
>>> +	return host_efer;
>>> +}
>>> +
>>> +static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>> For readability purpose, should we add some comments outlining the
>> prominent elements that we are restoring in this function ?
> Hmm what about a high level comment?  I'm hesistant to list things
> explicitly as that tends to result in stale comments.  And I didn't
> have a method for choosing what (not) to restore beyond ensuring
> everything that might have been touched in enter_vmx_non_root_mode()
> was restored.
>
> Something like:
>
>    Unwind KVM's software model, i.e. vcpu->arch, to L1's pre-VMEnter
>    state, i.e. L1 host state, pulling data from vmcs01.  The rule of
>    thumb is that anything touched by enter_vmx_non_root_mode() needs
>    to be restored.

This should be good.

>
>>> +{
>>> +	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>>> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
>>> +	struct vmx_msr_entry g, h;
>>> +	struct msr_data msr;
>>> +	gpa_t gpa;
>>> +	u32 i, j;
>>> +
>>> +	vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
>>> +
>>> +	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
>>> +		/*
>>> +		 * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
>>> +		 * as vmcs01.GUEST_DR7 contains a userspace defined value
>>> +		 * and vcpu->arch.dr7 is not squirreled away before the
>>> +		 * nested VMENTER (not worth adding a variable in nested_vmx).
>>> +		 */
>>> +		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
>>> +			kvm_set_dr(vcpu, 7, DR7_FIXED_1);
>>> +		else
>>> +			WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
>>> +	}
>>> +
>>> +	/*
>>> +	 * Note that calling vmx_set_{efer,cr0,cr4} is important as they
>>> +	 * handle a variety of side effects to KVM's software model.
>>> +	 */
>>> +	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
>>> +
>>> +	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
>>> +	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
>>> +
>>> +	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
>>> +	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
>>> +
>>> +	nested_ept_uninit_mmu_context(vcpu);
>>> +	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
>>> +	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
>>> +
>>> +	/*
>>> +	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
>>> +	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
>>> +	 * VMFail, like everything else we just need to ensure our
>>> +	 * software model is up-to-date.
>>> +	 */
>>> +	ept_save_pdptrs(vcpu);
>>> +
>>> +	kvm_mmu_reset_context(vcpu);
>>> +
>>> +	if (cpu_has_vmx_msr_bitmap())
>>> +		vmx_update_msr_bitmap(vcpu);
>>> +
>>> +	/*
>>> +	 * This nasty bit of open coding is a compromise between blindly
>>> +	 * loading L1's MSRs using the exit load lists (incorrect emulation
>>> +	 * of VMFail), leaving the nested VM's MSRs in the software model
>>> +	 * (incorrect behavior) and snapshotting the modified MSRs (too
>>> +	 * expensive since the lists are unbound by hardware).  For each
>>> +	 * MSR that was (prematurely) loaded from the nested VMEntry load
>>> +	 * list, reload it from the exit load list if it exists and differs
>>> +	 * from the guest value.  The intent is to stuff host state as
>>> +	 * silently as possible, not to fully process the exit load list.
>>> +	 */
>>> +	msr.host_initiated = false;
>>> +	for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
>>> +		gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
>>> +		if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
>>> +			pr_debug_ratelimited(
>>> +				"%s read MSR index failed (%u, 0x%08llx)\n",
>>> +				__func__, i, gpa);
>>> +			goto vmabort;
>>> +		}
>>> +
>>> +		for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
>>> +			gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
>>> +			if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
>>> +				pr_debug_ratelimited(
>>> +					"%s read MSR failed (%u, 0x%08llx)\n",
>>> +					__func__, j, gpa);
>>> +				goto vmabort;
>>> +			}
>>> +			if (h.index != g.index)
>>> +				continue;
>>> +			if (h.value == g.value)
>>> +				break;
>>> +
>>> +			if (nested_vmx_load_msr_check(vcpu, &h)) {
>>> +				pr_debug_ratelimited(
>>> +					"%s check failed (%u, 0x%x, 0x%x)\n",
>>> +					__func__, j, h.index, h.reserved);
>>> +				goto vmabort;
>>> +			}
>>> +
>>> +			msr.index = h.index;
>>> +			msr.data = h.value;
>>> +			if (kvm_set_msr(vcpu, &msr)) {
>>> +				pr_debug_ratelimited(
>>> +					"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
>>> +					__func__, j, h.index, h.value);
>>> +				goto vmabort;
>>> +			}
>>> +		}
>>> +	}
>>> +
>> Is it possible to re-use nested_vmx_load_msr() instead of duplicating
>> the same loops ?
> We could use nested_vmx_load_msr() as-is to directly load on-exit list,
> but that has the downside of potentially overwriting MSRs that should
> not have been touched, or triggering side effects that should not have
> occurred.  I considered modifying nested_vmx_load_msr() to splice in
> this optional behavior, but the resulting code would be hideous and I
> didn't want to convolute an otherwise straightforward function for this
> one-off case.

I agree.  Does it make sense to create a new function and call it, say, 
nested_vmx_load_modified_msr or nested_vmx_load_select_msr which might 
find a usage in another part of the code in future ?

>
>>> +	return;
>>> +
>>> +vmabort:
>>> +	nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
>>> +}
>>> +
>>>    /*
>>>     * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
>>>     * and modify vmcs12 to make it see what it would expect to see there if
>>> @@ -12490,7 +12617,13 @@ static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
>>>    	 */
>>>    	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
>>>    
>>> -	load_vmcs12_mmu_host_state(vcpu, vmcs12);
>>> +	/*
>>> +	 * Restore L1's host state to KVM's software model.  We're here
>>> +	 * because a consistency check was caught by hardware, which
>>> +	 * means some amount of guest state has been propagated to KVM's
>>> +	 * model and needs to be unwound to the host's state.
>>> +	 */
>>> +	nested_vmx_restore_host_state(vcpu);
>>>    
>>>    	/*
>>>    	 * The emulated instruction was already skipped in
Sean Christopherson July 31, 2018, 3:50 p.m. UTC | #6
On Fri, 2018-07-27 at 21:48 -0700, Krish Sadhukhan wrote:

> On 07/24/2018 06:31 AM, Sean Christopherson wrote:
> > 
> > On Mon, 2018-07-23 at 15:26 -0700, Krish Sadhukhan wrote:
> > > 
> > > On 07/19/2018 10:56 AM, Sean Christopherson wrote:
> > > > 

...

> > > > +	/*
> > > > +	 * This nasty bit of open coding is a compromise between blindly
> > > > +	 * loading L1's MSRs using the exit load lists (incorrect emulation
> > > > +	 * of VMFail), leaving the nested VM's MSRs in the software model
> > > > +	 * (incorrect behavior) and snapshotting the modified MSRs (too
> > > > +	 * expensive since the lists are unbound by hardware).  For each
> > > > +	 * MSR that was (prematurely) loaded from the nested VMEntry load
> > > > +	 * list, reload it from the exit load list if it exists and differs
> > > > +	 * from the guest value.  The intent is to stuff host state as
> > > > +	 * silently as possible, not to fully process the exit load list.
> > > > +	 */
> > > > +	msr.host_initiated = false;
> > > > +	for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
> > > > +		gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
> > > > +		if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
> > > > +			pr_debug_ratelimited(
> > > > +				"%s read MSR index failed (%u, 0x%08llx)\n",
> > > > +				__func__, i, gpa);
> > > > +			goto vmabort;
> > > > +		}
> > > > +
> > > > +		for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
> > > > +			gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
> > > > +			if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
> > > > +				pr_debug_ratelimited(
> > > > +					"%s read MSR failed (%u, 0x%08llx)\n",
> > > > +					__func__, j, gpa);
> > > > +				goto vmabort;
> > > > +			}
> > > > +			if (h.index != g.index)
> > > > +				continue;
> > > > +			if (h.value == g.value)
> > > > +				break;
> > > > +
> > > > +			if (nested_vmx_load_msr_check(vcpu, &h)) {
> > > > +				pr_debug_ratelimited(
> > > > +					"%s check failed (%u, 0x%x, 0x%x)\n",
> > > > +					__func__, j, h.index, h.reserved);
> > > > +				goto vmabort;
> > > > +			}
> > > > +
> > > > +			msr.index = h.index;
> > > > +			msr.data = h.value;
> > > > +			if (kvm_set_msr(vcpu, &msr)) {
> > > > +				pr_debug_ratelimited(
> > > > +					"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
> > > > +					__func__, j, h.index, h.value);
> > > > +				goto vmabort;
> > > > +			}
> > > > +		}
> > > > +	}
> > > > +
> > > Is it possible to re-use nested_vmx_load_msr() instead of duplicating
> > > the same loops ?
> >
> > We could use nested_vmx_load_msr() as-is to directly load on-exit list,
> > but that has the downside of potentially overwriting MSRs that should
> > not have been touched, or triggering side effects that should not have
> > occurred.  I considered modifying nested_vmx_load_msr() to splice in
> > this optional behavior, but the resulting code would be hideous and I
> > didn't want to convolute an otherwise straightforward function for this
> > one-off case.
>
> I agree.  Does it make sense to create a new function and call it, say, 
> nested_vmx_load_modified_msr or nested_vmx_load_select_msr which might 
> find a usage in another part of the code in future ?

I don't think so, processing the MSR load lists should be unique to
nested VMEnter.
diff mbox

Patch

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 1689f433f3a0..eb5f49d2c46d 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -12199,24 +12199,6 @@  static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 	kvm_clear_interrupt_queue(vcpu);
 }
 
-static void load_vmcs12_mmu_host_state(struct kvm_vcpu *vcpu,
-			struct vmcs12 *vmcs12)
-{
-	u32 entry_failure_code;
-
-	nested_ept_uninit_mmu_context(vcpu);
-
-	/*
-	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
-	 * couldn't have changed.
-	 */
-	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
-		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
-
-	if (!enable_ept)
-		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
-}
-
 /*
  * A part of what we need to when the nested L2 guest exits and we want to
  * run its L1 parent, is to reset L1's guest state to the host state specified
@@ -12230,6 +12212,7 @@  static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 				   struct vmcs12 *vmcs12)
 {
 	struct kvm_segment seg;
+	u32 entry_failure_code;
 
 	if (vmcs12->vm_exit_controls & VM_EXIT_LOAD_IA32_EFER)
 		vcpu->arch.efer = vmcs12->host_ia32_efer;
@@ -12256,7 +12239,17 @@  static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
 	vmx_set_cr4(vcpu, vmcs12->host_cr4);
 
-	load_vmcs12_mmu_host_state(vcpu, vmcs12);
+	nested_ept_uninit_mmu_context(vcpu);
+
+	/*
+	 * Only PDPTE load can fail as the value of cr3 was checked on entry and
+	 * couldn't have changed.
+	 */
+	if (nested_vmx_load_cr3(vcpu, vmcs12->host_cr3, false, &entry_failure_code))
+		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);
+
+	if (!enable_ept)
+		vcpu->arch.walk_mmu->inject_page_fault = kvm_inject_page_fault;
 
 	/*
 	 * If vmcs01 don't use VPID, CPU flushes TLB on every
@@ -12352,6 +12345,140 @@  static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
 }
 
+static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
+{
+	struct shared_msr_entry *efer_msr;
+	unsigned int i;
+
+	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_EFER)
+		return vmcs_read64(GUEST_IA32_EFER);
+
+	if (cpu_has_load_ia32_efer)
+		return host_efer;
+
+	for (i = 0; i < vmx->msr_autoload.nr; ++i) {
+		if (vmx->msr_autoload.guest[i].index == MSR_EFER)
+			return vmx->msr_autoload.guest[i].value;
+	}
+
+	efer_msr = find_msr_entry(vmx, MSR_EFER);
+	if (efer_msr)
+		return efer_msr->data;
+
+	return host_efer;
+}
+
+static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
+{
+	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vmx_msr_entry g, h;
+	struct msr_data msr;
+	gpa_t gpa;
+	u32 i, j;
+
+	vcpu->arch.pat = vmcs_read64(GUEST_IA32_PAT);
+
+	if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS) {
+		/*
+		 * L1's host DR7 is lost if KVM_GUESTDBG_USE_HW_BP is set
+		 * as vmcs01.GUEST_DR7 contains a userspace defined value
+		 * and vcpu->arch.dr7 is not squirreled away before the
+		 * nested VMENTER (not worth adding a variable in nested_vmx).
+		 */
+		if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
+			kvm_set_dr(vcpu, 7, DR7_FIXED_1);
+		else
+			WARN_ON(kvm_set_dr(vcpu, 7, vmcs_readl(GUEST_DR7)));
+	}
+
+	/*
+	 * Note that calling vmx_set_{efer,cr0,cr4} is important as they
+	 * handle a variety of side effects to KVM's software model.
+	 */
+	vmx_set_efer(vcpu, nested_vmx_get_vmcs01_guest_efer(vmx));
+
+	vcpu->arch.cr0_guest_owned_bits = X86_CR0_TS;
+	vmx_set_cr0(vcpu, vmcs_readl(CR0_READ_SHADOW));
+
+	vcpu->arch.cr4_guest_owned_bits = ~vmcs_readl(CR4_GUEST_HOST_MASK);
+	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
+
+	nested_ept_uninit_mmu_context(vcpu);
+	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
+	__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
+
+	/*
+	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
+	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
+	 * VMFail, like everything else we just need to ensure our
+	 * software model is up-to-date.
+	 */
+	ept_save_pdptrs(vcpu);
+
+	kvm_mmu_reset_context(vcpu);
+
+	if (cpu_has_vmx_msr_bitmap())
+		vmx_update_msr_bitmap(vcpu);
+
+	/*
+	 * This nasty bit of open coding is a compromise between blindly
+	 * loading L1's MSRs using the exit load lists (incorrect emulation
+	 * of VMFail), leaving the nested VM's MSRs in the software model
+	 * (incorrect behavior) and snapshotting the modified MSRs (too
+	 * expensive since the lists are unbound by hardware).  For each
+	 * MSR that was (prematurely) loaded from the nested VMEntry load
+	 * list, reload it from the exit load list if it exists and differs
+	 * from the guest value.  The intent is to stuff host state as
+	 * silently as possible, not to fully process the exit load list.
+	 */
+	msr.host_initiated = false;
+	for (i = 0; i < vmcs12->vm_entry_msr_load_count; i++) {
+		gpa = vmcs12->vm_entry_msr_load_addr + (i * sizeof(g));
+		if (kvm_vcpu_read_guest(vcpu, gpa, &g, sizeof(g))) {
+			pr_debug_ratelimited(
+				"%s read MSR index failed (%u, 0x%08llx)\n",
+				__func__, i, gpa);
+			goto vmabort;
+		}
+
+		for (j = 0; j < vmcs12->vm_exit_msr_load_count; j++) {
+			gpa = vmcs12->vm_exit_msr_load_addr + (j * sizeof(h));
+			if (kvm_vcpu_read_guest(vcpu, gpa, &h, sizeof(h))) {
+				pr_debug_ratelimited(
+					"%s read MSR failed (%u, 0x%08llx)\n",
+					__func__, j, gpa);
+				goto vmabort;
+			}
+			if (h.index != g.index)
+				continue;
+			if (h.value == g.value)
+				break;
+
+			if (nested_vmx_load_msr_check(vcpu, &h)) {
+				pr_debug_ratelimited(
+					"%s check failed (%u, 0x%x, 0x%x)\n",
+					__func__, j, h.index, h.reserved);
+				goto vmabort;
+			}
+
+			msr.index = h.index;
+			msr.data = h.value;
+			if (kvm_set_msr(vcpu, &msr)) {
+				pr_debug_ratelimited(
+					"%s WRMSR failed (%u, 0x%x, 0x%llx)\n",
+					__func__, j, h.index, h.value);
+				goto vmabort;
+			}
+		}
+	}
+
+	return;
+
+vmabort:
+	nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_MSR_FAIL);
+}
+
 /*
  * Emulate an exit from nested guest (L2) to L1, i.e., prepare to run L1
  * and modify vmcs12 to make it see what it would expect to see there if
@@ -12490,7 +12617,13 @@  static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
 	 */
 	nested_vmx_failValid(vcpu, VMXERR_ENTRY_INVALID_CONTROL_FIELD);
 
-	load_vmcs12_mmu_host_state(vcpu, vmcs12);
+	/*
+	 * Restore L1's host state to KVM's software model.  We're here
+	 * because a consistency check was caught by hardware, which
+	 * means some amount of guest state has been propagated to KVM's
+	 * model and needs to be unwound to the host's state.
+	 */
+	nested_vmx_restore_host_state(vcpu);
 
 	/*
 	 * The emulated instruction was already skipped in