diff mbox series

[2/4] KVM: X86: Cache CR3 in prev_roots when PCID is disabled

Message ID 20211019110154.4091-3-jiangshanlai@gmail.com (mailing list archive)
State New, archived
Headers show
Series KVM: X86: Improve guest TLB flushing | expand

Commit Message

Lai Jiangshan Oct. 19, 2021, 11:01 a.m. UTC
From: Lai Jiangshan <laijs@linux.alibaba.com>

The commit 21823fbda5522 ("KVM: x86: Invalidate all PGDs for the
current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific
PCID and in the case of PCID is disabled, it includes all PGDs in the
prev_roots and the commit made prev_roots totally unused in this case.

Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1
before the said commit:
	(CR4.PCIDE=0, CR3=cr3_a, the page for the guest
	 kernel is global, cr3_b is cached in prev_roots)

	modify the user part of cr3_b
		the shadow root of cr3_b is unsync in kvm
	INVPCID single context
		the guest expects the TLB is clean for PCID=0
	change CR4.PCIDE 0 -> 1
	switch to cr3_b with PCID=0,NOFLUSH=1
		No sync in kvm, cr3_b is still unsync in kvm
	return to the user part (of cr3_b)
		the user accesses to wrong page

It is a very unlikely case, but it shows that virtualizing guest TLB in
prev_roots is not safe in this case and the said commit did fix the
problem.

But the said commit also disabled caching CR3 in prev_roots when PCID
is disabled and NOT all CPUs have PCID, especially the PCID support
for AMD CPUs is kind of recent.  To restore the original optimization,
we have to enable caching CR3 without re-introducing problems.

Actually, in short, the said commit just ensures prev_roots not part of
the virtualized TLB.  So this change caches CR3 in prev_roots, and
ensures prev_roots not part of the virtualized TLB by always flushing
the virtualized TLB when CR3 is switched from prev_roots to current
(it is already the current behavior) and by freeing prev_roots when
CR4.PCIDE is changed 0 -> 1.

Anyway:
PCID enabled: vTLB includes root_hpa, prev_roots and hardware TLB.
PCID disabled: vTLB includes root_hpa and hardware TLB, no prev_roots.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
---
 arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

Comments

Paolo Bonzini Oct. 21, 2021, 5:43 p.m. UTC | #1
On 19/10/21 13:01, Lai Jiangshan wrote:
> From: Lai Jiangshan <laijs@linux.alibaba.com>
> 
> The commit 21823fbda5522 ("KVM: x86: Invalidate all PGDs for the
> current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific
> PCID and in the case of PCID is disabled, it includes all PGDs in the
> prev_roots and the commit made prev_roots totally unused in this case.
> 
> Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1
> before the said commit:
> 	(CR4.PCIDE=0, CR3=cr3_a, the page for the guest
> 	 kernel is global, cr3_b is cached in prev_roots)
> 
> 	modify the user part of cr3_b
> 		the shadow root of cr3_b is unsync in kvm
> 	INVPCID single context
> 		the guest expects the TLB is clean for PCID=0
> 	change CR4.PCIDE 0 -> 1
> 	switch to cr3_b with PCID=0,NOFLUSH=1
> 		No sync in kvm, cr3_b is still unsync in kvm
> 	return to the user part (of cr3_b)
> 		the user accesses to wrong page
> 
> It is a very unlikely case, but it shows that virtualizing guest TLB in
> prev_roots is not safe in this case and the said commit did fix the
> problem.
> 
> But the said commit also disabled caching CR3 in prev_roots when PCID
> is disabled and NOT all CPUs have PCID, especially the PCID support
> for AMD CPUs is kind of recent.  To restore the original optimization,
> we have to enable caching CR3 without re-introducing problems.
> 
> Actually, in short, the said commit just ensures prev_roots not part of
> the virtualized TLB.  So this change caches CR3 in prev_roots, and
> ensures prev_roots not part of the virtualized TLB by always flushing
> the virtualized TLB when CR3 is switched from prev_roots to current
> (it is already the current behavior) and by freeing prev_roots when
> CR4.PCIDE is changed 0 -> 1.
> 
> Anyway:
> PCID enabled: vTLB includes root_hpa, prev_roots and hardware TLB.
> PCID disabled: vTLB includes root_hpa and hardware TLB, no prev_roots.
> 
> Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
> ---
>   arch/x86/kvm/x86.c | 32 ++++++++++++++++++++++++++++++--
>   1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 06169ed08db0..13df3ca88e09 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1022,10 +1022,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
>   
>   void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
>   {
> +	/*
> +	 * If any role bit is changed, the MMU needs to be reset.
> +	 *
> +	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the guest
> +	 * TLB per SDM, but the virtualized TLB doesn't include prev_roots when
> +	 * CR4.PCIDE is 0, so the prev_roots has to be freed to avoid later
> +	 * resuing without explicit flushing.
> +	 * If CR4.PCIDE is changed 1 -> 0, there is required to flush the guest
> +	 * TLB and KVM_REQ_MMU_RELOAD is fit for the both cases.  Although
> +	 * KVM_REQ_MMU_RELOAD is slow, changing CR4.PCIDE is a rare case.

          * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
          * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
          * according to the SDM; however, stale prev_roots could be reused
          * reused incorrectly by MOV to CR3 with NOFLUSH=1, so we free them
          * all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
          * is slow, but changing CR4.PCIDE is a rare case.

> +	 * If CR4.PGE is changed, there is required to just flush the guest TLB.
> +	 *
> +	 * Note: reseting MMU covers KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_RELOAD
> +	 * covers KVM_REQ_TLB_FLUSH_GUEST, so "else if" is used here and the
> +	 * check for later cases are skipped if the check for the preceding
> +	 * case is matched.

          * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
          * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
          * the usage of "else if".

> +	 */
>   	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
>   		kvm_mmu_reset_context(vcpu);
> -	else if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
> -		 (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
> +	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> +		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> +	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
>   		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>   }
>   EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
> @@ -1093,6 +1112,15 @@ static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
>   		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
>   	}
>   
> +	/*
> +	 * If PCID is disabled, there is no need to free prev_roots even the
> +	 * PCIDs for them are also 0.  The prev_roots are just not included
> +	 * in the "clean" virtualized TLB and a resync will happen anyway
> +	 * before switching to any other CR3.
> +	 */

         /*
          * If PCID is disabled, there is no need to free prev_roots even if the
          * PCIDs for them are also 0, because all moves to CR3 flush the TLB
          * with PCID=0.
          */

> +	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
> +		return;
> +
>   	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>   		if (kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd) == pcid)
>   			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);
> 


Can you confirm the above comments are accurate?

Paolo
Lai Jiangshan Oct. 22, 2021, 2:11 a.m. UTC | #2
On Fri, Oct 22, 2021 at 1:43 AM Paolo Bonzini <pbonzini@redhat.com> wrote:

>
>           * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
>           * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
>           * according to the SDM; however, stale prev_roots could be reused
>           * reused incorrectly by MOV to CR3 with NOFLUSH=1, so we free them
>           * all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
>           * is slow, but changing CR4.PCIDE is a rare case.
>

There is a double "reused" separated by "\".

>
>
> Can you confirm the above comments are accurate?
>

Yes, they are better and consistent with what I meant, only one redundant
"reused" in the comments.

thanks
Lai
diff mbox series

Patch

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 06169ed08db0..13df3ca88e09 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1022,10 +1022,29 @@  EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
 
 void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
 {
+	/*
+	 * If any role bit is changed, the MMU needs to be reset.
+	 *
+	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the guest
+	 * TLB per SDM, but the virtualized TLB doesn't include prev_roots when
+	 * CR4.PCIDE is 0, so the prev_roots has to be freed to avoid later
+	 * resuing without explicit flushing.
+	 * If CR4.PCIDE is changed 1 -> 0, there is required to flush the guest
+	 * TLB and KVM_REQ_MMU_RELOAD is fit for the both cases.  Although
+	 * KVM_REQ_MMU_RELOAD is slow, changing CR4.PCIDE is a rare case.
+	 *
+	 * If CR4.PGE is changed, there is required to just flush the guest TLB.
+	 *
+	 * Note: reseting MMU covers KVM_REQ_MMU_RELOAD and KVM_REQ_MMU_RELOAD
+	 * covers KVM_REQ_TLB_FLUSH_GUEST, so "else if" is used here and the
+	 * check for later cases are skipped if the check for the preceding
+	 * case is matched.
+	 */
 	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
 		kvm_mmu_reset_context(vcpu);
-	else if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
-		 (!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
+	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
+	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
@@ -1093,6 +1112,15 @@  static void kvm_invalidate_pcid(struct kvm_vcpu *vcpu, unsigned long pcid)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu);
 	}
 
+	/*
+	 * If PCID is disabled, there is no need to free prev_roots even the
+	 * PCIDs for them are also 0.  The prev_roots are just not included
+	 * in the "clean" virtualized TLB and a resync will happen anyway
+	 * before switching to any other CR3.
+	 */
+	if (!kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
+		return;
+
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		if (kvm_get_pcid(vcpu, mmu->prev_roots[i].pgd) == pcid)
 			roots_to_free |= KVM_MMU_ROOT_PREVIOUS(i);