diff mbox series

[17/15] KVM: X86: Ensure pae_root to be reconstructed for shadow paging if the guest PDPTEs is changed

Message ID 20211111144634.88972-1-jiangshanlai@gmail.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86: Fix and clean up for register caches | expand

Commit Message

Lai Jiangshan Nov. 11, 2021, 2:46 p.m. UTC
From: Lai Jiangshan <laijs@linux.alibaba.com>

For shadow paging, the pae_root needs to be reconstructed before the
coming VMENTER if the guest PDPTEs is changed.

But not all paths that call load_pdptrs() will cause the pae_root to be
reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
are used to launch later reconstruction.

The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
when changing CR0.CD and CR0.NW.

The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
load_pdptrs() when rewriting the CR3 with the same value.

The commit a91a7c709600("KVM: X86: Don't reset mmu context when
toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
load_pdptrs() when changing CR4.PGE.

Normally, the guest doesn't change the PDPTEs before doing only the
above operation without touching other bits that can force pae_root to
be reconstructed.  Guests like linux would keep the PDPTEs unchaged
for every instance of pagetable.

Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Lai Jiangshan Nov. 23, 2021, 9:34 a.m. UTC | #1
Hello, Paolo

any thought/concern about this one?

Thanks
Lai


On 2021/11/11 22:46, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For shadow paging, the pae_root needs to be reconstructed before the
> coming VMENTER if the guest PDPTEs is changed.
> 
> But not all paths that call load_pdptrs() will cause the pae_root to be
> reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
> are used to launch later reconstruction.
> 
> The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
> CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
> when changing CR0.CD and CR0.NW.
> 
> The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
> PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
> load_pdptrs() when rewriting the CR3 with the same value.
> 
> The commit a91a7c709600("KVM: X86: Don't reset mmu context when
> toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
> load_pdptrs() when changing CR4.PGE.
> 
> Normally, the guest doesn't change the PDPTEs before doing only the
> above operation without touching other bits that can force pae_root to
> be reconstructed.  Guests like linux would keep the PDPTEs unchaged
> for every instance of pagetable.
> 
> Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
> Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
> Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/x86.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0176eaa86a35..cfba337e46ab 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> -		/* Ensure the dirty PDPTEs to be loaded. */
> -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +		/*
> +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> +		 * enabled or pae_root to be reconstructed for shadow paging.
> +		 */
> +		if (tdp_enabled)
> +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +		else
> +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
>   	}
>   	vcpu->arch.pdptrs_from_userspace = false;
>   
>
Sean Christopherson Dec. 8, 2021, 12:15 a.m. UTC | #2
On Thu, Nov 11, 2021, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> For shadow paging, the pae_root needs to be reconstructed before the
> coming VMENTER if the guest PDPTEs is changed.
> 
> But not all paths that call load_pdptrs() will cause the pae_root to be
> reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
> are used to launch later reconstruction.
> 
> The commit d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and
> CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
> when changing CR0.CD and CR0.NW.
> 
> The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
> PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
> load_pdptrs() when rewriting the CR3 with the same value.

This isn't accurate, prior to that commit KVM wasn't guaranteed to do
kvm_mmu_free_roots() if it got a hit on the current CR3 or if a previous CR3 in
the cache matched the new CR3 (the "cache" has done some odd things in the past).

So I think this particular flavor would be:

  Fixes: 7c390d350f8b ("kvm: x86: Add fast CR3 switch code path")

> The commit a91a7c709600("KVM: X86: Don't reset mmu context when
> toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
> load_pdptrs() when changing CR4.PGE.
> 
> Normally, the guest doesn't change the PDPTEs before doing only the
> above operation without touching other bits that can force pae_root to
> be reconstructed.  Guests like linux would keep the PDPTEs unchaged
> for every instance of pagetable.
> 
> Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
> Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
> Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>  arch/x86/kvm/x86.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0176eaa86a35..cfba337e46ab 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>  	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>  		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>  		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> -		/* Ensure the dirty PDPTEs to be loaded. */
> -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +		/*
> +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> +		 * enabled or pae_root to be reconstructed for shadow paging.
> +		 */
> +		if (tdp_enabled)
> +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> +		else
> +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);

Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
of vcpu->arch.mmuvcpu->arch.mmu.

To avoid a dependency on the previous patch, I think it makes sense to have this be:

	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);

before the memcpy().

Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
PDPTRs are unchanged with respect to the MMU is safe.
Lai Jiangshan Dec. 8, 2021, 4 a.m. UTC | #3
On 2021/12/8 08:15, Sean Christopherson wrote:
>>
>> The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
>> PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
>> load_pdptrs() when rewriting the CR3 with the same value.
> 
> This isn't accurate, prior to that commit KVM wasn't guaranteed to do
> kvm_mmu_free_roots() if it got a hit on the current CR3 or if a previous CR3 in
> the cache matched the new CR3 (the "cache" has done some odd things in the past).
> 
> So I think this particular flavor would be:
> 
>    Fixes: 7c390d350f8b ("kvm: x86: Add fast CR3 switch code path")

If guest is 32bit, fast_cr3_switch() always return false, and
kvm_mmu_free_roots() is always called, and no cr3 goes in prev_root.

And from 21823fbda552, fast_cr3_switch() and kvm_mmu_free_roots() are
both skipped when cr3 is unchanged.

> 
>> The commit a91a7c709600("KVM: X86: Don't reset mmu context when
>> toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
>> load_pdptrs() when changing CR4.PGE.
>>
>> Normally, the guest doesn't change the PDPTEs before doing only the
>> above operation without touching other bits that can force pae_root to
>> be reconstructed.  Guests like linux would keep the PDPTEs unchaged
>> for every instance of pagetable.
>>
>> Fixes: d81135a57aa6("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
>> Fixes: 21823fbda552("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
>> Fixes: a91a7c709600("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
>> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
>> ---
>>   arch/x86/kvm/x86.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 0176eaa86a35..cfba337e46ab 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>>   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>>   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>>   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
>> -		/* Ensure the dirty PDPTEs to be loaded. */
>> -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +		/*
>> +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
>> +		 * enabled or pae_root to be reconstructed for shadow paging.
>> +		 */
>> +		if (tdp_enabled)
>> +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +		else
>> +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> 
> Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> of vcpu->arch.mmuvcpu->arch.mmu.

@mmu is the "guest mmu" (vcpu->arch.walk_mmu), which is used to walk
including loading pdptr.

vcpu->arch.mmu is for host constructing mmu for shadowed or tdp mmu
which is used in host side management including kvm_mmu_free_roots().

Even they are the same pointer now for !tdp, the meaning is different.  I prefer
to distinguish them even before kvm_mmu is split different for guest mmu
(vcpu->arch.walk_mmu) and host constructing mmu (vcpu->arch.mmu).

(I once searched all the usage of undistinguished usage of kvm_mmu *mmu, and
found a bug, see "Use vcpu->arch.walk_mmu for kvm_mmu_invlpg()")

I think Paolo is doing the splitting, unless I would take the job because
I have some patches pending depended them.

> 
> To avoid a dependency on the previous patch, I think it makes sense to have this be:
> 
> 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> 

Yes, it is a good idea to add this first.

Thanks for review and suggestion.
Lai
Sean Christopherson Dec. 8, 2021, 3:29 p.m. UTC | #4
On Wed, Dec 08, 2021, Lai Jiangshan wrote:
> 
> 
> On 2021/12/8 08:15, Sean Christopherson wrote:
> > > 
> > > The commit 21823fbda552("KVM: x86: Invalidate all PGDs for the current
> > > PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
> > > load_pdptrs() when rewriting the CR3 with the same value.
> > 
> > This isn't accurate, prior to that commit KVM wasn't guaranteed to do
> > kvm_mmu_free_roots() if it got a hit on the current CR3 or if a previous CR3 in
> > the cache matched the new CR3 (the "cache" has done some odd things in the past).
> > 
> > So I think this particular flavor would be:
> > 
> >    Fixes: 7c390d350f8b ("kvm: x86: Add fast CR3 switch code path")
> 
> If guest is 32bit, fast_cr3_switch() always return false, and
> kvm_mmu_free_roots() is always called, and no cr3 goes in prev_root.

Oh, duh, PAE paging.

> > > +		/*
> > > +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> > > +		 * enabled or pae_root to be reconstructed for shadow paging.
> > > +		 */
> > > +		if (tdp_enabled)
> > > +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > +		else
> > > +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > 
> > Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> > of vcpu->arch.mmuvcpu->arch.mmu.
> 
> @mmu is the "guest mmu" (vcpu->arch.walk_mmu), which is used to walk
> including loading pdptr.
> 
> vcpu->arch.mmu is for host constructing mmu for shadowed or tdp mmu
> which is used in host side management including kvm_mmu_free_roots().
> 
> Even they are the same pointer now for !tdp, the meaning is different.  I prefer
> to distinguish them even before kvm_mmu is split different for guest mmu
> (vcpu->arch.walk_mmu) and host constructing mmu (vcpu->arch.mmu).

I see where you're coming from.  I was viewing it from the perspective of,
"they're the same thing for shadow paging, why reread mmu?".

I'm ok with explicitly referencing arch.mmu, but maybe add a comment?
Paolo Bonzini Dec. 9, 2021, 10:46 p.m. UTC | #5
On 12/8/21 01:15, Sean Christopherson wrote:
>> @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
>>   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
>>   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
>>   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
>> -		/* Ensure the dirty PDPTEs to be loaded. */
>> -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +		/*
>> +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
>> +		 * enabled or pae_root to be reconstructed for shadow paging.
>> +		 */
>> +		if (tdp_enabled)
>> +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
>> +		else
>> +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> of vcpu->arch.mmuvcpu->arch.mmu.

In kvm/next actually there's no mmu parameter to load_pdptrs, so it's 
okay to keep vcpu->arch.mmu.

> To avoid a dependency on the previous patch, I think it makes sense to have this be:
> 
> 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> 
> before the memcpy().
> 
> Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
> PDPTRs are unchanged with respect to the MMU is safe.

Do you disagree that there's already an invariant that the PDPTRs can 
only be dirty if KVM_REQ_LOAD_MMU_PGD---and therefore a previous change 
to the PDPTRs would have triggered KVM_REQ_LOAD_MMU_PGD?  This is 
opposed to the guest TLB flush due to MOV CR3; that one has to be done 
unconditionally for PAE paging, and it is handled separately within 
kvm_set_cr3.

Paolo
Sean Christopherson Dec. 10, 2021, 9:07 p.m. UTC | #6
On Thu, Dec 09, 2021, Paolo Bonzini wrote:
> On 12/8/21 01:15, Sean Christopherson wrote:
> > > @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
> > >   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
> > >   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> > >   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > > -		/* Ensure the dirty PDPTEs to be loaded. */
> > > -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > +		/*
> > > +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> > > +		 * enabled or pae_root to be reconstructed for shadow paging.
> > > +		 */
> > > +		if (tdp_enabled)
> > > +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > +		else
> > > +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> > of vcpu->arch.mmuvcpu->arch.mmu.
> 
> In kvm/next actually there's no mmu parameter to load_pdptrs, so it's okay
> to keep vcpu->arch.mmu.
> 
> > To avoid a dependency on the previous patch, I think it makes sense to have this be:
> > 
> > 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> > 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> > 
> > before the memcpy().
> > 
> > Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
> > PDPTRs are unchanged with respect to the MMU is safe.
> 
> Do you disagree that there's already an invariant that the PDPTRs can only
> be dirty if KVM_REQ_LOAD_MMU_PGD---and therefore a previous change to the
> PDPTRs would have triggered KVM_REQ_LOAD_MMU_PGD?

What I think is moot, because commit 24cd19a28cb7 ("KVM: X86: Update mmu->pdptrs
only when it is changed") breaks nested VMs with EPT in L0 and PAE shadow paging
in L2.  Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't
investigating how it breaks things, because why it's broken is secondary for me.

My primary concern is that we would even consider optimizing the PDPTR logic without
a mountain of evidence that any patch is correct for all scenarios.  We had to add
an entire ioctl() just to get PDPTRs functional.  This apparently wasn't validated
against a simple use case, let alone against things like migration with nested VMs,
multliple L2s, etc...
Sean Christopherson Dec. 10, 2021, 9:08 p.m. UTC | #7
On Fri, Dec 10, 2021, Sean Christopherson wrote:
> On Thu, Dec 09, 2021, Paolo Bonzini wrote:
> > On 12/8/21 01:15, Sean Christopherson wrote:
> > > > @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
> > > >   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
> > > >   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> > > >   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > > > -		/* Ensure the dirty PDPTEs to be loaded. */
> > > > -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > > +		/*
> > > > +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> > > > +		 * enabled or pae_root to be reconstructed for shadow paging.
> > > > +		 */
> > > > +		if (tdp_enabled)
> > > > +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > > +		else
> > > > +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > > Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> > > of vcpu->arch.mmuvcpu->arch.mmu.
> > 
> > In kvm/next actually there's no mmu parameter to load_pdptrs, so it's okay
> > to keep vcpu->arch.mmu.
> > 
> > > To avoid a dependency on the previous patch, I think it makes sense to have this be:
> > > 
> > > 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> > > 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> > > 
> > > before the memcpy().
> > > 
> > > Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
> > > PDPTRs are unchanged with respect to the MMU is safe.
> > 
> > Do you disagree that there's already an invariant that the PDPTRs can only
> > be dirty if KVM_REQ_LOAD_MMU_PGD---and therefore a previous change to the
> > PDPTRs would have triggered KVM_REQ_LOAD_MMU_PGD?
> 
> What I think is moot, because commit 24cd19a28cb7 ("KVM: X86: Update mmu->pdptrs
> only when it is changed") breaks nested VMs with EPT in L0 and PAE shadow paging
> in L2.  Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't

Doh, s/L2/L1

> investigating how it breaks things, because why it's broken is secondary for me.
> 
> My primary concern is that we would even consider optimizing the PDPTR logic without
> a mountain of evidence that any patch is correct for all scenarios.  We had to add
> an entire ioctl() just to get PDPTRs functional.  This apparently wasn't validated
> against a simple use case, let alone against things like migration with nested VMs,
> multliple L2s, etc...
Maxim Levitsky Dec. 11, 2021, 6:56 a.m. UTC | #8
On Fri, 2021-12-10 at 21:07 +0000, Sean Christopherson wrote:
> On Thu, Dec 09, 2021, Paolo Bonzini wrote:
> > On 12/8/21 01:15, Sean Christopherson wrote:
> > > > @@ -832,8 +832,14 @@ int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
> > > >   	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
> > > >   		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
> > > >   		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
> > > > -		/* Ensure the dirty PDPTEs to be loaded. */
> > > > -		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > > +		/*
> > > > +		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
> > > > +		 * enabled or pae_root to be reconstructed for shadow paging.
> > > > +		 */
> > > > +		if (tdp_enabled)
> > > > +			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
> > > > +		else
> > > > +			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
> > > Shouldn't matter since it's legacy shadow paging, but @mmu should be used instead
> > > of vcpu->arch.mmuvcpu->arch.mmu.
> > 
> > In kvm/next actually there's no mmu parameter to load_pdptrs, so it's okay
> > to keep vcpu->arch.mmu.
> > 
> > > To avoid a dependency on the previous patch, I think it makes sense to have this be:
> > > 
> > > 	if (!tdp_enabled && memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs)))
> > > 		kvm_mmu_free_roots(vcpu, mmu, KVM_MMU_ROOT_CURRENT);
> > > 
> > > before the memcpy().
> > > 
> > > Then we can decide independently if skipping the KVM_REQ_LOAD_MMU_PGD if the
> > > PDPTRs are unchanged with respect to the MMU is safe.
> > 
> > Do you disagree that there's already an invariant that the PDPTRs can only
> > be dirty if KVM_REQ_LOAD_MMU_PGD---and therefore a previous change to the
> > PDPTRs would have triggered KVM_REQ_LOAD_MMU_PGD?
> 
> What I think is moot, because commit 24cd19a28cb7 ("KVM: X86: Update mmu->pdptrs
> only when it is changed") breaks nested VMs with EPT in L0 and PAE shadow paging
> in L2.  Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't
> investigating how it breaks things, because why it's broken is secondary for me.
> 
> My primary concern is that we would even consider optimizing the PDPTR logic without
> a mountain of evidence that any patch is correct for all scenarios.  We had to add
> an entire ioctl() just to get PDPTRs functional.  This apparently wasn't validated
> against a simple use case, let alone against things like migration with nested VMs,
> multliple L2s, etc...

I did validate the *SREGS2* against all the cases I could (like migration, EPT/NPT disabled/etc.
I even started testing SMM to see how it affects PDPTRs, and patched seabios to use PAE paging.
I still could have missed something.

But note that qemu still doesn't use that ioctl (patch stuck in review).

Best regards,
	Maxim Levitsky


>
Paolo Bonzini Dec. 11, 2021, 8:22 a.m. UTC | #9
On 12/11/21 07:56, Maxim Levitsky wrote:
>> This apparently wasn't validated against a simple use case, let
>> alone against things like migration with nested VMs, multliple L2s,
>> etc...
> 
> I did validate the *SREGS2* against all the cases I could (like
> migration, EPT/NPT disabled/etc. I even started testing SMM to see
> how it affects PDPTRs, and patched seabios to use PAE paging. I still
> could have missed something.

Don't worry, I think Sean was talking about patch 16 and specifically
digging at me (who deserved it completely).

Paolo
Sean Christopherson Dec. 13, 2021, 4:54 p.m. UTC | #10
On Sat, Dec 11, 2021, Paolo Bonzini wrote:
> On 12/11/21 07:56, Maxim Levitsky wrote:
> > > This apparently wasn't validated against a simple use case, let
> > > alone against things like migration with nested VMs, multliple L2s,
> > > etc...
> > 
> > I did validate the *SREGS2* against all the cases I could (like
> > migration, EPT/NPT disabled/etc. I even started testing SMM to see
> > how it affects PDPTRs, and patched seabios to use PAE paging. I still
> > could have missed something.
> 
> Don't worry, I think Sean was talking about patch 16 and specifically
> digging at me (who deserved it completely).

Yes, patch 16.  My goal wasn't to dig at anyone, I just wanted to dramatically
emphasize how ridiculousy fragile and complex the PDPTR crud is due to the number
of edge cases.
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0176eaa86a35..cfba337e46ab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -832,8 +832,14 @@  int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
 	if (memcmp(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs))) {
 		memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
 		kvm_register_mark_dirty(vcpu, VCPU_EXREG_PDPTR);
-		/* Ensure the dirty PDPTEs to be loaded. */
-		kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
+		/*
+		 * Ensure the dirty PDPTEs to be loaded for VMX with EPT
+		 * enabled or pae_root to be reconstructed for shadow paging.
+		 */
+		if (tdp_enabled)
+			kvm_make_request(KVM_REQ_LOAD_MMU_PGD, vcpu);
+		else
+			kvm_mmu_free_roots(vcpu, vcpu->arch.mmu, KVM_MMU_ROOT_CURRENT);
 	}
 	vcpu->arch.pdptrs_from_userspace = false;