KVM: nVMX: initialize PML fields in vmcs02

Message ID	20170404121853.28057-1-lprosek@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <kvm-owner@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 689D0C054909 From: Ladi Prosek <lprosek@redhat.com> To: kvm@vger.kernel.org Cc: kai.huang@linux.intel.com, wanpeng.li@hotmail.com Subject: [PATCH] KVM: nVMX: initialize PML fields in vmcs02 Date: Tue, 4 Apr 2017 14:18:53 +0200 Message-Id: <20170404121853.28057-1-lprosek@redhat.com> Sender: kvm-owner@vger.kernel.org Precedence: bulk

Ladi Prosek April 4, 2017, 12:18 p.m. UTC

L2 was running with uninitialized PML fields which led to incomplete
dirty bitmap logging. This manifested as all kinds of subtle erratic
behavior of the nested guest.

Fixes: 843e4330573c ("KVM: VMX: Add PML support in VMX")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
---
 arch/x86/kvm/vmx.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

David Hildenbrand April 4, 2017, 12:44 p.m. UTC | #1

On 04.04.2017 14:18, Ladi Prosek wrote:
> L2 was running with uninitialized PML fields which led to incomplete
> dirty bitmap logging. This manifested as all kinds of subtle erratic
> behavior of the nested guest.
> 
> Fixes: 843e4330573c ("KVM: VMX: Add PML support in VMX")
> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
> ---
>  arch/x86/kvm/vmx.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2ee00db..f47d701 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -10267,6 +10267,18 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  
>  	}
>  
> +	if (enable_pml) {
> +		/*
> +		 * Conceptually we want to copy the PML address and index from
> +		 * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
> +		 * since we always flush the log on each vmexit, this happens

we == KVM running in g2?

If so, other hypervisors might handle this differently.

> +		 * to be equivalent to simply resetting the fields in vmcs02.
> +		 */
> +		ASSERT(vmx->pml_pg);
> +		vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
> +		vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
> +	}
> +
>  	if (nested_cpu_has_ept(vmcs12)) {
>  		kvm_mmu_unload(vcpu);
>  		nested_ept_init_mmu_context(vcpu);
>

Ladi Prosek April 4, 2017, 12:55 p.m. UTC | #2

On Tue, Apr 4, 2017 at 2:44 PM, David Hildenbrand <david@redhat.com> wrote:
> On 04.04.2017 14:18, Ladi Prosek wrote:
>> L2 was running with uninitialized PML fields which led to incomplete
>> dirty bitmap logging. This manifested as all kinds of subtle erratic
>> behavior of the nested guest.
>>
>> Fixes: 843e4330573c ("KVM: VMX: Add PML support in VMX")
>> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
>> ---
>>  arch/x86/kvm/vmx.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 2ee00db..f47d701 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -10267,6 +10267,18 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>>
>>       }
>>
>> +     if (enable_pml) {
>> +             /*
>> +              * Conceptually we want to copy the PML address and index from
>> +              * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
>> +              * since we always flush the log on each vmexit, this happens
>
> we == KVM running in g2?
>
> If so, other hypervisors might handle this differently.

No, we as KVM in L0. Hypervisors running in L1 do not see PML at all,
this is L0-only code.

I hope the comment is not confusing. The desired behavior is that PML
maintains the same state, regardless of whether we are in guest mode
or not. But the implementation allows for this shortcut where we just
reset the fields to their initial values on each nested entry.

>> +              * to be equivalent to simply resetting the fields in vmcs02.
>> +              */
>> +             ASSERT(vmx->pml_pg);
>> +             vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
>> +             vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
>> +     }
>> +
>>       if (nested_cpu_has_ept(vmcs12)) {
>>               kvm_mmu_unload(vcpu);
>>               nested_ept_init_mmu_context(vcpu);
>>
>
>
> --
>
> Thanks,
>
> David

David Hildenbrand April 4, 2017, 1:09 p.m. UTC | #3

>>> +     if (enable_pml) {
>>> +             /*
>>> +              * Conceptually we want to copy the PML address and index from
>>> +              * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
>>> +              * since we always flush the log on each vmexit, this happens
>>
>> we == KVM running in g2?
>>
>> If so, other hypervisors might handle this differently.
> 
> No, we as KVM in L0. Hypervisors running in L1 do not see PML at all,
> this is L0-only code.

Okay, was just confused why we enable PML for our nested guest (L2)
although not supported/enabled for guest hypervisors (L1). I would have
guessed that it is to be kept disabled completely for nested guests
(!SECONDARY_EXEC_ENABLE_PML).

But I assume that this a mysterious detail of the MMU code I still have
to look into in detail.

> 
> I hope the comment is not confusing. The desired behavior is that PML
> maintains the same state, regardless of whether we are in guest mode
> or not. But the implementation allows for this shortcut where we just
> reset the fields to their initial values on each nested entry.

If we really treat PML here just like ordinary L1 runs, than it makes
perfect sense and the comment is not confusing. vmcs01 says it all. Just
me being curious :)

> 
>>> +              * to be equivalent to simply resetting the fields in vmcs02.
>>> +              */
>>> +             ASSERT(vmx->pml_pg);

Looking at the code (especially the check in vmx_vcpu_setup()), I think
this ASSERT can be removed.

>>> +             vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
>>> +             vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);

So this really just mimics vmx_vcpu_setup() pml handling here.

>>> +     }
>>> +
>>>       if (nested_cpu_has_ept(vmcs12)) {
>>>               kvm_mmu_unload(vcpu);
>>>               nested_ept_init_mmu_context(vcpu);
>>>
>>

Ladi Prosek April 4, 2017, 1:19 p.m. UTC | #4

On Tue, Apr 4, 2017 at 3:09 PM, David Hildenbrand <david@redhat.com> wrote:
>
>>>> +     if (enable_pml) {
>>>> +             /*
>>>> +              * Conceptually we want to copy the PML address and index from
>>>> +              * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
>>>> +              * since we always flush the log on each vmexit, this happens
>>>
>>> we == KVM running in g2?
>>>
>>> If so, other hypervisors might handle this differently.
>>
>> No, we as KVM in L0. Hypervisors running in L1 do not see PML at all,
>> this is L0-only code.
>
> Okay, was just confused why we enable PML for our nested guest (L2)
> although not supported/enabled for guest hypervisors (L1). I would have
> guessed that it is to be kept disabled completely for nested guests
> (!SECONDARY_EXEC_ENABLE_PML).
>
> But I assume that this a mysterious detail of the MMU code I still have
> to look into in detail.

L1 doesn't see PML but L0 uses it for its own bookkeeping. So it's
enabled in vmcs02 (what the CPU uses) but not in vmcs12 (what L1
sees).

>>
>> I hope the comment is not confusing. The desired behavior is that PML
>> maintains the same state, regardless of whether we are in guest mode
>> or not. But the implementation allows for this shortcut where we just
>> reset the fields to their initial values on each nested entry.
>
> If we really treat PML here just like ordinary L1 runs, than it makes
> perfect sense and the comment is not confusing. vmcs01 says it all. Just
> me being curious :)
>
>>
>>>> +              * to be equivalent to simply resetting the fields in vmcs02.
>>>> +              */
>>>> +             ASSERT(vmx->pml_pg);
>
> Looking at the code (especially the check in vmx_vcpu_setup()), I think
> this ASSERT can be removed.
>
>>>> +             vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
>>>> +             vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
>
> So this really just mimics vmx_vcpu_setup() pml handling here.

Exactly. And I copied the assert as well, even though it's not
strictly necessary as you pointed out.

>>>> +     }
>>>> +
>>>>       if (nested_cpu_has_ept(vmcs12)) {
>>>>               kvm_mmu_unload(vcpu);
>>>>               nested_ept_init_mmu_context(vcpu);
>>>>
>>>
>
>
> --
>
> Thanks,
>
> David

David Hildenbrand April 4, 2017, 1:25 p.m. UTC | #5

On 04.04.2017 15:09, David Hildenbrand wrote:
> 
>>>> +     if (enable_pml) {
>>>> +             /*
>>>> +              * Conceptually we want to copy the PML address and index from
>>>> +              * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
>>>> +              * since we always flush the log on each vmexit, this happens
>>>
>>> we == KVM running in g2?
>>>
>>> If so, other hypervisors might handle this differently.
>>
>> No, we as KVM in L0. Hypervisors running in L1 do not see PML at all,
>> this is L0-only code.
> 
> Okay, was just confused why we enable PML for our nested guest (L2)
> although not supported/enabled for guest hypervisors (L1). I would have
> guessed that it is to be kept disabled completely for nested guests
> (!SECONDARY_EXEC_ENABLE_PML).
> 
> But I assume that this a mysterious detail of the MMU code I still have
> to look into in detail.
> 

So for secondary exec controls we:

1. enable almost any exec control enabled also for our L1 (except 4 of
them)
-> slightly scary, but I hope somebody thought well of this
2. blindly copy over whatever L2 gave us
-> very scary

Especially if I am not wrong:

PML available on HW but disabled by setting "enable_pml = 0".
L1 blindly enabling PML for L2.

We now run our vmcs02 with SECONDARY_EXEC_ENABLE_PML without pml regions
being set up.

Am I missing a whitelist somewhere? I hope so. Such things should always
have whitelists.

David Hildenbrand April 4, 2017, 1:34 p.m. UTC | #6

On 04.04.2017 15:19, Ladi Prosek wrote:
> On Tue, Apr 4, 2017 at 3:09 PM, David Hildenbrand <david@redhat.com> wrote:
>>
>>>>> +     if (enable_pml) {
>>>>> +             /*
>>>>> +              * Conceptually we want to copy the PML address and index from
>>>>> +              * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
>>>>> +              * since we always flush the log on each vmexit, this happens
>>>>
>>>> we == KVM running in g2?
>>>>
>>>> If so, other hypervisors might handle this differently.
>>>
>>> No, we as KVM in L0. Hypervisors running in L1 do not see PML at all,
>>> this is L0-only code.
>>
>> Okay, was just confused why we enable PML for our nested guest (L2)
>> although not supported/enabled for guest hypervisors (L1). I would have
>> guessed that it is to be kept disabled completely for nested guests
>> (!SECONDARY_EXEC_ENABLE_PML).
>>
>> But I assume that this a mysterious detail of the MMU code I still have
>> to look into in detail.
> 
> L1 doesn't see PML but L0 uses it for its own bookkeeping. So it's
> enabled in vmcs02 (what the CPU uses) but not in vmcs12 (what L1
> sees).

So this looks just fine to me. But as I said, haven't looked that
detailed into the MMU code yet.

Ladi Prosek April 4, 2017, 1:37 p.m. UTC | #7

On Tue, Apr 4, 2017 at 3:25 PM, David Hildenbrand <david@redhat.com> wrote:
> On 04.04.2017 15:09, David Hildenbrand wrote:
>>
>>>>> +     if (enable_pml) {
>>>>> +             /*
>>>>> +              * Conceptually we want to copy the PML address and index from
>>>>> +              * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
>>>>> +              * since we always flush the log on each vmexit, this happens
>>>>
>>>> we == KVM running in g2?
>>>>
>>>> If so, other hypervisors might handle this differently.
>>>
>>> No, we as KVM in L0. Hypervisors running in L1 do not see PML at all,
>>> this is L0-only code.
>>
>> Okay, was just confused why we enable PML for our nested guest (L2)
>> although not supported/enabled for guest hypervisors (L1). I would have
>> guessed that it is to be kept disabled completely for nested guests
>> (!SECONDARY_EXEC_ENABLE_PML).
>>
>> But I assume that this a mysterious detail of the MMU code I still have
>> to look into in detail.
>>
>
> So for secondary exec controls we:
>
> 1. enable almost any exec control enabled also for our L1 (except 4 of
> them)
> -> slightly scary, but I hope somebody thought well of this
> 2. blindly copy over whatever L2 gave us
> -> very scary
>
> Especially if I am not wrong:
>
> PML available on HW but disabled by setting "enable_pml = 0".
> L1 blindly enabling PML for L2.
>
> We now run our vmcs02 with SECONDARY_EXEC_ENABLE_PML without pml regions
> being set up.
>
> Am I missing a whitelist somewhere? I hope so. Such things should always
> have whitelists.

I believe that this is covered in check_vmentry_prereqs:

       !vmx_control_verify(vmcs12->secondary_vm_exec_control,
                    vmx->nested.nested_vmx_secondary_ctls_low,
                    vmx->nested.nested_vmx_secondary_ctls_high) ||

where vmx->nested.nested_vmx_secondary_ctls_* represent the whitelist.

> --
>
> Thanks,
>
> David

Paolo Bonzini April 4, 2017, 1:55 p.m. UTC | #8

On 04/04/2017 15:25, David Hildenbrand wrote:
> On 04.04.2017 15:09, David Hildenbrand wrote:
>>
>>>>> +     if (enable_pml) {
>>>>> +             /*
>>>>> +              * Conceptually we want to copy the PML address and index from
>>>>> +              * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
>>>>> +              * since we always flush the log on each vmexit, this happens
>>>>
>>>> we == KVM running in g2?
>>>>
>>>> If so, other hypervisors might handle this differently.
>>>
>>> No, we as KVM in L0. Hypervisors running in L1 do not see PML at all,
>>> this is L0-only code.
>>
>> Okay, was just confused why we enable PML for our nested guest (L2)
>> although not supported/enabled for guest hypervisors (L1). I would have
>> guessed that it is to be kept disabled completely for nested guests
>> (!SECONDARY_EXEC_ENABLE_PML).
>>
>> But I assume that this a mysterious detail of the MMU code I still have
>> to look into in detail.
>>
> 
> So for secondary exec controls we:
> 
> 1. enable almost any exec control enabled also for our L1 (except 4 of
> them)
> -> slightly scary, but I hope somebody thought well of this
> 2. blindly copy over whatever L2 gave us

You mean L1 here.  I am also not sure if you mean:

- blindly copy to vmcs02 whatever L1 gave us

- or, fill vmcs02 with vmcs01 contents, blindly ignoring whatever L1 gave us

but I can explain both.

1) As Ladi said, most VMCS fields are checked already earlier

2) Some features are not available to the L1 hypervisor, or they are
emulated by KVM.  When this happens, the relevant fields aren't copied
from vmcs12 to vmcs02.  An example of emulated feature is the preemption
timer; an example of unavailable feature is PML.

In fact, when we implement nested PML it will not use hardware PML;
rather it will be implemented by the KVM MMU.  Therefore it will still
be okay to overwrite these two fields and to process PML vmexits in L0.
Whenever the MMU will set a dirty bit, it will also write to the dirty
page log and possibly trigger an L1 PML vmexit.  But PML vmexits for L0
and L1 will be completely different---L0's comes from the processor
while L1's are injected by the parent hypervisor.

> Especially if I am not wrong:
> 
> PML available on HW but disabled by setting "enable_pml = 0".
> L1 blindly enabling PML for L2.
>
> We now run our vmcs02 with SECONDARY_EXEC_ENABLE_PML without pml regions
> being set up.
> 
> Am I missing a whitelist somewhere? I hope so. Such things should always
> have whitelists.

Does the above explain it?

Paolo

David Hildenbrand April 4, 2017, 2:22 p.m. UTC | #9

>>
>> So for secondary exec controls we:
>>
>> 1. enable almost any exec control enabled also for our L1 (except 4 of
>> them)
>> -> slightly scary, but I hope somebody thought well of this
>> 2. blindly copy over whatever L2 gave us
> 
> You mean L1 here.  I am also not sure if you mean:

I keep messing up L0/L1/... as the terminology I am used from s390x was
different.. (L0 == LPAR). But you had the right intention.

> 
> - blindly copy to vmcs02 whatever L1 gave us
> 
> - or, fill vmcs02 with vmcs01 contents, blindly ignoring whatever L1 gave us
> 
> but I can explain both.

It was actually a mixture of both :)

> 
> 1) As Ladi said, most VMCS fields are checked already earlier

It took longer to understand how these checks work than it should. But
this was what I was looking for.

nested_vmx_secondary_ctls_high == whitelist as Ladi pointed out (bits
that may be set). And PML should never be set.

> 
> 2) Some features are not available to the L1 hypervisor, or they are
> emulated by KVM.  When this happens, the relevant fields aren't copied
> from vmcs12 to vmcs02.  An example of emulated feature is the preemption
> timer; an example of unavailable feature is PML.
> 
> In fact, when we implement nested PML it will not use hardware PML;
> rather it will be implemented by the KVM MMU.  Therefore it will still
> be okay to overwrite these two fields and to process PML vmexits in L0.
> Whenever the MMU will set a dirty bit, it will also write to the dirty
> page log and possibly trigger an L1 PML vmexit.  But PML vmexits for L0
> and L1 will be completely different---L0's comes from the processor
> while L1's are injected by the parent hypervisor.
> 
>> Especially if I am not wrong:
>>
>> PML available on HW but disabled by setting "enable_pml = 0".
>> L1 blindly enabling PML for L2.
>>
>> We now run our vmcs02 with SECONDARY_EXEC_ENABLE_PML without pml regions
>> being set up.
>>
>> Am I missing a whitelist somewhere? I hope so. Such things should always
>> have whitelists.
> 
> Does the above explain it?

Yes, perfectly well, thanks a lot!

> 
> Paolo
>

Radim Krčmář April 5, 2017, 2:49 p.m. UTC | #10

2017-04-04 14:18+0200, Ladi Prosek:
> L2 was running with uninitialized PML fields which led to incomplete
> dirty bitmap logging. This manifested as all kinds of subtle erratic
> behavior of the nested guest.
> 
> Fixes: 843e4330573c ("KVM: VMX: Add PML support in VMX")
> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
> ---

Applied to kvm/master, thanks.

(I should get a newer test machine ...)

>  arch/x86/kvm/vmx.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 2ee00db..f47d701 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -10267,6 +10267,18 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  
>  	}
>  
> +	if (enable_pml) {
> +		/*
> +		 * Conceptually we want to copy the PML address and index from
> +		 * vmcs01 here, and then back to vmcs01 on nested vmexit. But,
> +		 * since we always flush the log on each vmexit, this happens
> +		 * to be equivalent to simply resetting the fields in vmcs02.
> +		 */
> +		ASSERT(vmx->pml_pg);
> +		vmcs_write64(PML_ADDRESS, page_to_phys(vmx->pml_pg));
> +		vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
> +	}
> +
>  	if (nested_cpu_has_ept(vmcs12)) {
>  		kvm_mmu_unload(vcpu);
>  		nested_ept_init_mmu_context(vcpu);
> -- 
> 2.9.3
>

KVM: nVMX: initialize PML fields in vmcs02

Commit Message

Comments

Patch