[1/2] KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter
diff mbox series

Message ID 20190926214302.21990-2-sean.j.christopherson@intel.com
State New
Headers show
Series
  • KVM: nVMX: Bug fix for consuming stale vmcs02.GUEST_CR3
Related show

Commit Message

Sean Christopherson Sept. 26, 2019, 9:43 p.m. UTC
Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
VM-Exit occurs without actually entering L2, e.g. if the nested run
is squashed because L2 is being put into HLT.

In an ideal world where EPT *requires* unrestricted guest (and vice
versa), VMX could handle CR3 similar to how it handles RSP and RIP,
e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
the unrestricted guest silliness complicates the dirty tracking logic
to the point that explicitly handling vmcs02.GUEST_CR3 during nested
VM-Enter is a simpler overall implementation.

Cc: stable@vger.kernel.org
Reported-by: Reto Buerki <reet@codelabs.ch>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 8 ++++++++
 arch/x86/kvm/vmx/vmx.c    | 9 ++++++---
 2 files changed, 14 insertions(+), 3 deletions(-)

Comments

Jim Mattson Sept. 26, 2019, 11:39 p.m. UTC | #1
On Thu, Sep 26, 2019 at 2:43 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> VM-Exit occurs without actually entering L2, e.g. if the nested run
> is squashed because L2 is being put into HLT.
>
> In an ideal world where EPT *requires* unrestricted guest (and vice
> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
> the unrestricted guest silliness complicates the dirty tracking logic
> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> VM-Enter is a simpler overall implementation.
>
> Cc: stable@vger.kernel.org
> Reported-by: Reto Buerki <reet@codelabs.ch>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 8 ++++++++
>  arch/x86/kvm/vmx/vmx.c    | 9 ++++++---
>  2 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 41abc62c9a8a..971a24134081 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>                                 entry_failure_code))
>                 return -EINVAL;
>
> +       /*
> +        * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> +        * on nested VM-Exit, which can occur without actually running L2, e.g.
> +        * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
> +        */
> +       if (enable_ept)
> +               vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
> +

This part of the change I can definitely confirm, since we have a
similar change in our fetid pile of commits that have yet to be
upstreamed. Thanks for taking care of this one!

>         /* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
>         if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
>             is_pae_paging(vcpu)) {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d4575ffb3cec..b530950a9c2b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>  {
>         struct kvm *kvm = vcpu->kvm;
>         unsigned long guest_cr3;
> +       bool skip_cr3 = false;
>         u64 eptp;
>
>         guest_cr3 = cr3;
> @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>                         spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
>                 }
>
> -               if (enable_unrestricted_guest || is_paging(vcpu) ||
> -                   is_guest_mode(vcpu))
> +               if (is_guest_mode(vcpu))
> +                       skip_cr3 = true;
> +               else if (enable_unrestricted_guest || is_paging(vcpu))
>                         guest_cr3 = kvm_read_cr3(vcpu);
>                 else
>                         guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
>                 ept_load_pdptrs(vcpu);
>         }
>
> -       vmcs_writel(GUEST_CR3, guest_cr3);
> +       if (!skip_cr3)
> +               vmcs_writel(GUEST_CR3, guest_cr3);

Is this part of the change necessary, or is it just an optimization to
save a redundant VMWRITE?

>  }
>
>  int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> --
> 2.22.0
>
Liran Alon Sept. 27, 2019, 12:06 a.m. UTC | #2
> On 27 Sep 2019, at 0:43, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> VM-Exit occurs without actually entering L2, e.g. if the nested run
> is squashed because L2 is being put into HLT.

I would rephrase to “If an emulated VMEntry is squashed because L1 sets vmcs12->guest_activity_state to HLT”.
I think it’s a bit more explicit.

> 
> In an ideal world where EPT *requires* unrestricted guest (and vice
> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
> the unrestricted guest silliness complicates the dirty tracking logic
> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> VM-Enter is a simpler overall implementation.
> 
> Cc: stable@vger.kernel.org
> Reported-by: Reto Buerki <reet@codelabs.ch>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---
> arch/x86/kvm/vmx/nested.c | 8 ++++++++
> arch/x86/kvm/vmx/vmx.c    | 9 ++++++---
> 2 files changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 41abc62c9a8a..971a24134081 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> 				entry_failure_code))
> 		return -EINVAL;
> 
> +	/*
> +	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> +	 * on nested VM-Exit, which can occur without actually running L2, e.g.
> +	 * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
> +	 */

If I understand correctly, it’s not exactly if L2 is entering HLT state in general.
(E.g. issue doesn’t occur if L2 runs HLT directly which is not configured to be intercepted by vmcs12).
It’s specifically when L1 enters L2 with a HLT guest-activity-state. I suggest rephrasing comment.

> +	if (enable_ept)
> +		vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
> +
> 	/* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
> 	if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
> 	    is_pae_paging(vcpu)) {
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index d4575ffb3cec..b530950a9c2b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> {
> 	struct kvm *kvm = vcpu->kvm;
> 	unsigned long guest_cr3;
> +	bool skip_cr3 = false;
> 	u64 eptp;
> 
> 	guest_cr3 = cr3;
> @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> 			spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
> 		}
> 
> -		if (enable_unrestricted_guest || is_paging(vcpu) ||
> -		    is_guest_mode(vcpu))
> +		if (is_guest_mode(vcpu))
> +			skip_cr3 = true;
> +		else if (enable_unrestricted_guest || is_paging(vcpu))
> 			guest_cr3 = kvm_read_cr3(vcpu);
> 		else
> 			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
> 		ept_load_pdptrs(vcpu);
> 	}
> 
> -	vmcs_writel(GUEST_CR3, guest_cr3);
> +	if (!skip_cr3)

Nit: It’s a matter of taste, but I prefer positive conditions. i.e. “bool write_guest_cr3”.

Anyway, code seems valid to me. Nice catch.
Reviewed-by: Liran Alon <liran.alon@oracle.com>

-Liran

> +		vmcs_writel(GUEST_CR3, guest_cr3);
> }
> 
> int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> -- 
> 2.22.0
>
Sean Christopherson Sept. 27, 2019, 2:22 p.m. UTC | #3
On Thu, Sep 26, 2019 at 04:39:28PM -0700, Jim Mattson wrote:
> On Thu, Sep 26, 2019 at 2:43 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> > -       vmcs_writel(GUEST_CR3, guest_cr3);
> > +       if (!skip_cr3)
> > +               vmcs_writel(GUEST_CR3, guest_cr3);
> 
> Is this part of the change necessary, or is it just an optimization to
> save a redundant VMWRITE?

Save the redundant VMWRITE.  I also wanted to convey the idea that the
nested code is responsible for setting GUEST_CR3 to the correct value.
I'd be happy to add a comment to make the latter point explicit.

> >  }
> >
> >  int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> > --
> > 2.22.0
> >
Sean Christopherson Sept. 27, 2019, 2:27 p.m. UTC | #4
On Fri, Sep 27, 2019 at 03:06:02AM +0300, Liran Alon wrote:
> 
> 
> > On 27 Sep 2019, at 0:43, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
> > isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
> > is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
> > refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
> > VM-Exit occurs without actually entering L2, e.g. if the nested run
> > is squashed because L2 is being put into HLT.
> 
> I would rephrase to “If an emulated VMEntry is squashed because L1 sets
> vmcs12->guest_activity_state to HLT”.  I think it’s a bit more explicit.
> 
> > 
> > In an ideal world where EPT *requires* unrestricted guest (and vice
> > versa), VMX could handle CR3 similar to how it handles RSP and RIP,
> > e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
> > the unrestricted guest silliness complicates the dirty tracking logic
> > to the point that explicitly handling vmcs02.GUEST_CR3 during nested
> > VM-Enter is a simpler overall implementation.
> > 
> > Cc: stable@vger.kernel.org
> > Reported-by: Reto Buerki <reet@codelabs.ch>
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> > arch/x86/kvm/vmx/nested.c | 8 ++++++++
> > arch/x86/kvm/vmx/vmx.c    | 9 ++++++---
> > 2 files changed, 14 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 41abc62c9a8a..971a24134081 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
> > 				entry_failure_code))
> > 		return -EINVAL;
> > 
> > +	/*
> > +	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> > +	 * on nested VM-Exit, which can occur without actually running L2, e.g.
> > +	 * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
> > +	 */
> 
> If I understand correctly, it’s not exactly if L2 is entering HLT state in
> general.  (E.g. issue doesn’t occur if L2 runs HLT directly which is not
> configured to be intercepted by vmcs12).  It’s specifically when L1 enters L2
> with a HLT guest-activity-state. I suggest rephrasing comment.

I deliberately worded the comment so that it remains valid if there are
more conditions in the future that cause KVM to skip running L2.  What if
I split the difference and make the changelog more explicit, but leave the
comment as is?

> > +	if (enable_ept)
> > +		vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
> > +
> > 	/* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
> > 	if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
> > 	    is_pae_paging(vcpu)) {
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index d4575ffb3cec..b530950a9c2b 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> > {
> > 	struct kvm *kvm = vcpu->kvm;
> > 	unsigned long guest_cr3;
> > +	bool skip_cr3 = false;
> > 	u64 eptp;
> > 
> > 	guest_cr3 = cr3;
> > @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
> > 			spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
> > 		}
> > 
> > -		if (enable_unrestricted_guest || is_paging(vcpu) ||
> > -		    is_guest_mode(vcpu))
> > +		if (is_guest_mode(vcpu))
> > +			skip_cr3 = true;
> > +		else if (enable_unrestricted_guest || is_paging(vcpu))
> > 			guest_cr3 = kvm_read_cr3(vcpu);
> > 		else
> > 			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
> > 		ept_load_pdptrs(vcpu);
> > 	}
> > 
> > -	vmcs_writel(GUEST_CR3, guest_cr3);
> > +	if (!skip_cr3)
> 
> Nit: It’s a matter of taste, but I prefer positive conditions. i.e. “bool
> write_guest_cr3”.
> 
> Anyway, code seems valid to me. Nice catch.
> Reviewed-by: Liran Alon <liran.alon@oracle.com>
> 
> -Liran
> 
> > +		vmcs_writel(GUEST_CR3, guest_cr3);
> > }
> > 
> > int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> > -- 
> > 2.22.0
> > 
>
Liran Alon Sept. 27, 2019, 2:44 p.m. UTC | #5
> On 27 Sep 2019, at 17:27, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Fri, Sep 27, 2019 at 03:06:02AM +0300, Liran Alon wrote:
>> 
>> 
>>> On 27 Sep 2019, at 0:43, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> 
>>> Write the desired L2 CR3 into vmcs02.GUEST_CR3 during nested VM-Enter
>>> isntead of deferring the VMWRITE until vmx_set_cr3().  If the VMWRITE
>>> is deferred, then KVM can consume a stale vmcs02.GUEST_CR3 when it
>>> refreshes vmcs12->guest_cr3 during nested_vmx_vmexit() if the emulated
>>> VM-Exit occurs without actually entering L2, e.g. if the nested run
>>> is squashed because L2 is being put into HLT.
>> 
>> I would rephrase to “If an emulated VMEntry is squashed because L1 sets
>> vmcs12->guest_activity_state to HLT”.  I think it’s a bit more explicit.
>> 
>>> 
>>> In an ideal world where EPT *requires* unrestricted guest (and vice
>>> versa), VMX could handle CR3 similar to how it handles RSP and RIP,
>>> e.g. mark CR3 dirty and conditionally load it at vmx_vcpu_run().  But
>>> the unrestricted guest silliness complicates the dirty tracking logic
>>> to the point that explicitly handling vmcs02.GUEST_CR3 during nested
>>> VM-Enter is a simpler overall implementation.
>>> 
>>> Cc: stable@vger.kernel.org
>>> Reported-by: Reto Buerki <reet@codelabs.ch>
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>>> ---
>>> arch/x86/kvm/vmx/nested.c | 8 ++++++++
>>> arch/x86/kvm/vmx/vmx.c    | 9 ++++++---
>>> 2 files changed, 14 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>> index 41abc62c9a8a..971a24134081 100644
>>> --- a/arch/x86/kvm/vmx/nested.c
>>> +++ b/arch/x86/kvm/vmx/nested.c
>>> @@ -2418,6 +2418,14 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>>> 				entry_failure_code))
>>> 		return -EINVAL;
>>> 
>>> +	/*
>>> +	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
>>> +	 * on nested VM-Exit, which can occur without actually running L2, e.g.
>>> +	 * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
>>> +	 */
>> 
>> If I understand correctly, it’s not exactly if L2 is entering HLT state in
>> general.  (E.g. issue doesn’t occur if L2 runs HLT directly which is not
>> configured to be intercepted by vmcs12).  It’s specifically when L1 enters L2
>> with a HLT guest-activity-state. I suggest rephrasing comment.
> 
> I deliberately worded the comment so that it remains valid if there are
> more conditions in the future that cause KVM to skip running L2.  What if
> I split the difference and make the changelog more explicit, but leave the
> comment as is?

I think what is confusing in comment is that it seems to also refer to the case
where L2 directly enters HLT state without L1 intercept. Which isn’t related.
So I would explicitly mention it’s when L1 enters L2 but don’t physically enter guest
with vmcs02 because L2 is in HLT state.

-Liran

> 
>>> +	if (enable_ept)
>>> +		vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
>>> +
>>> 	/* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
>>> 	if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
>>> 	    is_pae_paging(vcpu)) {
>>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>>> index d4575ffb3cec..b530950a9c2b 100644
>>> --- a/arch/x86/kvm/vmx/vmx.c
>>> +++ b/arch/x86/kvm/vmx/vmx.c
>>> @@ -2985,6 +2985,7 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>>> {
>>> 	struct kvm *kvm = vcpu->kvm;
>>> 	unsigned long guest_cr3;
>>> +	bool skip_cr3 = false;
>>> 	u64 eptp;
>>> 
>>> 	guest_cr3 = cr3;
>>> @@ -3000,15 +3001,17 @@ void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
>>> 			spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
>>> 		}
>>> 
>>> -		if (enable_unrestricted_guest || is_paging(vcpu) ||
>>> -		    is_guest_mode(vcpu))
>>> +		if (is_guest_mode(vcpu))
>>> +			skip_cr3 = true;
>>> +		else if (enable_unrestricted_guest || is_paging(vcpu))
>>> 			guest_cr3 = kvm_read_cr3(vcpu);
>>> 		else
>>> 			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
>>> 		ept_load_pdptrs(vcpu);
>>> 	}
>>> 
>>> -	vmcs_writel(GUEST_CR3, guest_cr3);
>>> +	if (!skip_cr3)
>> 
>> Nit: It’s a matter of taste, but I prefer positive conditions. i.e. “bool
>> write_guest_cr3”.
>> 
>> Anyway, code seems valid to me. Nice catch.
>> Reviewed-by: Liran Alon <liran.alon@oracle.com>
>> 
>> -Liran
>> 
>>> +		vmcs_writel(GUEST_CR3, guest_cr3);
>>> }
>>> 
>>> int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>>> -- 
>>> 2.22.0
>>> 
>>
Sean Christopherson Sept. 27, 2019, 3:02 p.m. UTC | #6
On Fri, Sep 27, 2019 at 05:44:53PM +0300, Liran Alon wrote:
> 
> > On 27 Sep 2019, at 17:27, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > On Fri, Sep 27, 2019 at 03:06:02AM +0300, Liran Alon wrote:
> >> 
> >>> On 27 Sep 2019, at 0:43, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> >>> 
> >>> +	/*
> >>> +	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
> >>> +	 * on nested VM-Exit, which can occur without actually running L2, e.g.
> >>> +	 * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
> >>> +	 */
> >> 
> >> If I understand correctly, it’s not exactly if L2 is entering HLT state in
> >> general.  (E.g. issue doesn’t occur if L2 runs HLT directly which is not
> >> configured to be intercepted by vmcs12).  It’s specifically when L1 enters L2
> >> with a HLT guest-activity-state. I suggest rephrasing comment.
> > 
> > I deliberately worded the comment so that it remains valid if there are
> > more conditions in the future that cause KVM to skip running L2.  What if
> > I split the difference and make the changelog more explicit, but leave the
> > comment as is?
> 
> I think what is confusing in comment is that it seems to also refer to the case
> where L2 directly enters HLT state without L1 intercept. Which isn’t related.
> So I would explicitly mention it’s when L1 enters L2 but don’t physically enter guest
> with vmcs02 because L2 is in HLT state.

Ah, gotcha, I'll tweak the wording.

Patch
diff mbox series

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 41abc62c9a8a..971a24134081 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2418,6 +2418,14 @@  static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 				entry_failure_code))
 		return -EINVAL;
 
+	/*
+	 * Immediately write vmcs02.GUEST_CR3.  It will be propagated to vmcs12
+	 * on nested VM-Exit, which can occur without actually running L2, e.g.
+	 * if L2 is entering HLT state, and thus without hitting vmx_set_cr3().
+	 */
+	if (enable_ept)
+		vmcs_writel(GUEST_CR3, vmcs12->guest_cr3);
+
 	/* Late preparation of GUEST_PDPTRs now that EFER and CRs are set. */
 	if (load_guest_pdptrs_vmcs12 && nested_cpu_has_ept(vmcs12) &&
 	    is_pae_paging(vcpu)) {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d4575ffb3cec..b530950a9c2b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2985,6 +2985,7 @@  void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 {
 	struct kvm *kvm = vcpu->kvm;
 	unsigned long guest_cr3;
+	bool skip_cr3 = false;
 	u64 eptp;
 
 	guest_cr3 = cr3;
@@ -3000,15 +3001,17 @@  void vmx_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
 			spin_unlock(&to_kvm_vmx(kvm)->ept_pointer_lock);
 		}
 
-		if (enable_unrestricted_guest || is_paging(vcpu) ||
-		    is_guest_mode(vcpu))
+		if (is_guest_mode(vcpu))
+			skip_cr3 = true;
+		else if (enable_unrestricted_guest || is_paging(vcpu))
 			guest_cr3 = kvm_read_cr3(vcpu);
 		else
 			guest_cr3 = to_kvm_vmx(kvm)->ept_identity_map_addr;
 		ept_load_pdptrs(vcpu);
 	}
 
-	vmcs_writel(GUEST_CR3, guest_cr3);
+	if (!skip_cr3)
+		vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
 int vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)