diff mbox series

[12/12] KVM: x86: do not unload MMU roots on all role changes

Message ID 20220209170020.1775368-13-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show
Series KVM: MMU: do not unload MMU roots on all role changes | expand

Commit Message

Paolo Bonzini Feb. 9, 2022, 5 p.m. UTC
kvm_mmu_reset_context is called on all role changes and right now it
calls kvm_mmu_unload.  With the legacy MMU this is a relatively cheap
operation; the previous PGDs remains in the hash table and is picked
up immediately on the next page fault.  With the TDP MMU, however, the
roots are thrown away for good and a full rebuild of the page tables is
necessary, which is many times more expensive.

Fortunately, throwing away the roots is not necessary except when
the manual says a TLB flush is required:

- changing CR0.PG from 1 to 0 (because it flushes the TLB according to
  the x86 architecture specification)

- changing CPUID (which changes the interpretation of page tables in
  ways not reflected by the role).

- changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)

Except for these cases, once the MMU has updated the CPU/MMU roles
and metadata it is enough to force-reload the current value of CR3.
KVM will look up the cached roots for an entry with the right role and
PGD, and only if the cache misses a new root will be created.

Measuring with vmexit.flat from kvm-unit-tests shows the following
improvement:

             TDP         legacy       shadow
   before    46754       5096         5150
   after     4879        4875         5006

which is for very small page tables.  The impact is however much larger
when running as an L1 hypervisor, because the new page tables cause
extra work for L0 to shadow them.

Reported-by: Brad Spengler <spender@grsecurity.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c |  7 ++++---
 arch/x86/kvm/x86.c     | 27 ++++++++++++++++++---------
 2 files changed, 22 insertions(+), 12 deletions(-)

Comments

Nikunj A. Dadhania Feb. 11, 2022, 9:08 a.m. UTC | #1
On 2/9/2022 10:30 PM, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0d3646535cc5..97c4f5fc291f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  		kvm_async_pf_hash_reset(vcpu);
>  	}
>  
> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */

                                      ^^ CR0.PG here ?

> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> +			kvm_mmu_unload(vcpu);
>  		kvm_mmu_reset_context(vcpu);
> +	}

Regards
Nikunj
Sean Christopherson Feb. 11, 2022, 6:48 p.m. UTC | #2
On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> kvm_mmu_reset_context is called on all role changes and right now it
> calls kvm_mmu_unload.  With the legacy MMU this is a relatively cheap
> operation; the previous PGDs remains in the hash table and is picked
> up immediately on the next page fault.  With the TDP MMU, however, the
> roots are thrown away for good and a full rebuild of the page tables is
> necessary, which is many times more expensive.
> 
> Fortunately, throwing away the roots is not necessary except when
> the manual says a TLB flush is required:
> 
> - changing CR0.PG from 1 to 0 (because it flushes the TLB according to
>   the x86 architecture specification)
> 
> - changing CPUID (which changes the interpretation of page tables in
>   ways not reflected by the role).
> 
> - changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
> 
> Except for these cases, once the MMU has updated the CPU/MMU roles
> and metadata it is enough to force-reload the current value of CR3.
> KVM will look up the cached roots for an entry with the right role and
> PGD, and only if the cache misses a new root will be created.
> 
> Measuring with vmexit.flat from kvm-unit-tests shows the following
> improvement:
> 
>              TDP         legacy       shadow
>    before    46754       5096         5150
>    after     4879        4875         5006
> 
> which is for very small page tables.  The impact is however much larger
> when running as an L1 hypervisor, because the new page tables cause
> extra work for L0 to shadow them.
> 
> Reported-by: Brad Spengler <spender@grsecurity.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  arch/x86/kvm/mmu/mmu.c |  7 ++++---
>  arch/x86/kvm/x86.c     | 27 ++++++++++++++++++---------
>  2 files changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 38b40ddcaad7..dbd4e98ba426 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  {
>  	/*
> -	 * Invalidate all MMU roles to force them to reinitialize as CPUID
> -	 * information is factored into reserved bit calculations.
> +	 * Invalidate all MMU roles and roots to force them to reinitialize,
> +	 * as CPUID information is factored into reserved bit calculations.
>  	 *
>  	 * Correctly handling multiple vCPU models with respect to paging and
>  	 * physical address properties) in a single VM would require tracking
> @@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
>  	vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
>  	vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
> +	kvm_mmu_unload(vcpu);
>  	kvm_mmu_reset_context(vcpu);
>  
>  	/*
> @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  
>  void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>  {
> -	kvm_mmu_unload(vcpu);
>  	kvm_init_mmu(vcpu);
> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);

This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
affected, e.g. SMM transitions, KVM_SET_SREG, etc...

Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
The call to kvm_mmu_new_pgd() is also 

To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}().  In
the future we can/should work on avoiding unload in all paths, but again, future
problem.

>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 0d3646535cc5..97c4f5fc291f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -873,8 +873,12 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  		kvm_async_pf_hash_reset(vcpu);
>  	}
>  
> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> +			kvm_mmu_unload(vcpu);

Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
respect to the changelog.  Please elaborate :-)

>  		kvm_mmu_reset_context(vcpu);
> +	}
>  
>  	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
>  	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> @@ -1067,15 +1071,18 @@ void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
>  	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
>  	 * is slow, but changing CR4.PCIDE is a rare case.
>  	 *
> -	 * If CR4.PGE is changed, the guest TLB must be flushed.
> +	 * Setting SMEP also needs to flush the TLB, in addition to resetting
> +	 * the MMU.
>  	 *
> -	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
> -	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
> -	 * the usage of "else if".
> +	 * If CR4.PGE is changed, the guest TLB must be flushed.  Because
> +	 * the shadow MMU ignores global pages, this bit is not part of
> +	 * KVM_MMU_CR4_ROLE_BITS.
>  	 */
> -	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> +	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) {
>  		kvm_mmu_reset_context(vcpu);
> -	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> +		if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
> +			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);

This mishandles CR4.PGE.  Per the comment above, the if-elif-elif sequence relies
on kvm_mmu_reset_context being a superset of KVM_REQ_TLB_FLUSH_GUEST.

For both CR0 and CR4, I think we should disassociate the TLB flush logic from the
MMU role logic, e.g. CR4.PGE _could_ be part of the role, but it's not because KVM
doesn't emulate global pages.

This is what I'm thinking, assuming CR0.PG 1=>0 really only needs a flush.


---
 arch/x86/kvm/mmu/mmu.c |  4 ++--
 arch/x86/kvm/x86.c     | 42 +++++++++++++++++++++++++++++-------------
 2 files changed, 31 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e41834748d52..c477c519c784 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5041,8 +5041,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	/*
-	 * Invalidate all MMU roles to force them to reinitialize as CPUID
-	 * information is factored into reserved bit calculations.
+	 * Invalidate all MMU roles and roots to force them to reinitialize,
+	 * as CPUID information is factored into reserved bit calculations.
 	 *
 	 * Correctly handling multiple vCPU models with respect to paging and
 	 * physical address properties) in a single VM would require tracking
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 782dc9cd31d8..b8dad04301ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -863,15 +863,28 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
 }
 EXPORT_SYMBOL_GPL(load_pdptrs);

+static void kvm_post_set_cr_reinit_mmu(struct kvm_vcpu *vcpu)
+{
+	kvm_mmu_init(vcpu);
+	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
+}
+
 void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
 {
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
 		kvm_clear_async_pf_completion_queue(vcpu);
 		kvm_async_pf_hash_reset(vcpu);
+
+		/*
+		 * Clearing CR0.PG is architecturally defined as flushing the
+		 * TLB from the guest's perspective.
+		 */
+		if (!(cr0 & X86_CR0_PG))
+			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 	}

 	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
+		kvm_post_set_cr_reinit_mmu(vcpu);

 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
 	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
@@ -1055,26 +1068,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
 void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
 {
 	/*
-	 * If any role bit is changed, the MMU needs to be reset.
-	 *
 	 * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
 	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
 	 * according to the SDM; however, stale prev_roots could be reused
 	 * incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
 	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
 	 * is slow, but changing CR4.PCIDE is a rare case.
-	 *
-	 * If CR4.PGE is changed, the guest TLB must be flushed.
-	 *
-	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
-	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
-	 * the usage of "else if".
 	 */
-	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
-		kvm_mmu_reset_context(vcpu);
-	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+	if ((cr4 ^ old_cr4) & X86_CR4_PCIDE) {
 		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
-	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
+		return;
+	}
+
+	/* If any role bit is changed, the MMU needs to be reinitialized. */
+	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
+		kvm_post_set_cr_reinit_mmu(vcpu);
+
+	/*
+	 * Setting SMEP or toggling PGE is architecturally defined as flushing
+	 * the TLB from the guest's perspective.  Note, because the shadow MMU
+	 * ignores global pages, CR4.PGE is not part of KVM_MMU_CR4_ROLE_BITS.
+	 */
+	if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
+	    ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr4);

base-commit: a8c36d04d70d0b15e696561e1a2134fcbdd3a3bd
--
Paolo Bonzini Feb. 14, 2022, 4:34 p.m. UTC | #3
On 2/11/22 19:48, Sean Christopherson wrote:
> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>> kvm_mmu_reset_context is called on all role changes and right now it
>> calls kvm_mmu_unload.  With the legacy MMU this is a relatively cheap
>> operation; the previous PGDs remains in the hash table and is picked
>> up immediately on the next page fault.  With the TDP MMU, however, the
>> roots are thrown away for good and a full rebuild of the page tables is
>> necessary, which is many times more expensive.
>>
>> Fortunately, throwing away the roots is not necessary except when
>> the manual says a TLB flush is required:
>>
>> - changing CR0.PG from 1 to 0 (because it flushes the TLB according to
>>    the x86 architecture specification)
>>
>> - changing CPUID (which changes the interpretation of page tables in
>>    ways not reflected by the role).
>>
>> - changing CR4.SMEP from 0 to 1 (not doing so actually breaks access.c!)
>>
>> Except for these cases, once the MMU has updated the CPU/MMU roles
>> and metadata it is enough to force-reload the current value of CR3.
>> KVM will look up the cached roots for an entry with the right role and
>> PGD, and only if the cache misses a new root will be created.
>>
>> Measuring with vmexit.flat from kvm-unit-tests shows the following
>> improvement:
>>
>>               TDP         legacy       shadow
>>     before    46754       5096         5150
>>     after     4879        4875         5006
>>
>> which is for very small page tables.  The impact is however much larger
>> when running as an L1 hypervisor, because the new page tables cause
>> extra work for L0 to shadow them.
>>
>> Reported-by: Brad Spengler <spender@grsecurity.com>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   arch/x86/kvm/mmu/mmu.c |  7 ++++---
>>   arch/x86/kvm/x86.c     | 27 ++++++++++++++++++---------
>>   2 files changed, 22 insertions(+), 12 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 38b40ddcaad7..dbd4e98ba426 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -5020,8 +5020,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>>   void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>   {
>>   	/*
>> -	 * Invalidate all MMU roles to force them to reinitialize as CPUID
>> -	 * information is factored into reserved bit calculations.
>> +	 * Invalidate all MMU roles and roots to force them to reinitialize,
>> +	 * as CPUID information is factored into reserved bit calculations.
>>   	 *
>>   	 * Correctly handling multiple vCPU models with respect to paging and
>>   	 * physical address properties) in a single VM would require tracking
>> @@ -5034,6 +5034,7 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>   	vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
>>   	vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
>>   	vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
>> +	kvm_mmu_unload(vcpu);
>>   	kvm_mmu_reset_context(vcpu);
>>   
>>   	/*
>> @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>>   
>>   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
>>   {
>> -	kvm_mmu_unload(vcpu);
>>   	kvm_init_mmu(vcpu);
>> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> 
> This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
> affected, e.g. SMM transitions, KVM_SET_SREG, etc...

SMM exit does flush the TLB because RSM clears CR0.PG (I did check this 
:)).  SMM re-entry then does not need to flush.  But I don't think SMM 
exit should flush the TLB *for non-SMM roles*.

For KVM_SET_SREGS I'm not sure if it should flush the TLB, but I agree 
it is certainly safer to keep it that way.

> Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
> and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
> The call to kvm_mmu_new_pgd() is also

*white noise*

> To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
> necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}().  In
> the future we can/should work on avoiding unload in all paths, but again, future
> problem.

I disagree on this.  There aren't many calls to kvm_mmu_reset_context.

>>   
>> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
>> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
>> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
>> +			kvm_mmu_unload(vcpu);
> 
> Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
> to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
> respect to the changelog.  Please elaborate :-)

Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case 
below).  Using kvm_mmu_unload() avoids loading a cached root just to 
throw it away immediately after, but I can change this to a new 
KVM_REQ_MMU_UPDATE_ROOT flag that does

	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);

By the way, I have a possibly stupid question.  In kvm_set_cr3 (called 
e.g. from emulator_set_cr()) there is

  	if (cr3 != kvm_read_cr3(vcpu))
		kvm_mmu_new_pgd(vcpu, cr3);

What makes this work if mmu_is_nested(vcpu)?  Should this also have an 
"if (... & !tdp_enabled)"?

>> -	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
>> +		if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
>> +			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> 
> This mishandles CR4.PGE.  Per the comment above, the if-elif-elif sequence relies
> on kvm_mmu_reset_context being a superset of KVM_REQ_TLB_FLUSH_GUEST.
> 
> For both CR0 and CR4, I think we should disassociate the TLB flush logic from the
> MMU role logic, e.g. CR4.PGE _could_ be part of the role, but it's not because KVM
> doesn't emulate global pages.

Makes sense, yes.  It also needs to handle flushing the current PCID 
when changing CR4.PAE (previously done for "free" by 
kvm_mmu_reset_context), but I agree with the idea.

Paolo

> This is what I'm thinking, assuming CR0.PG 1=>0 really only needs a flush.
> 
> 
> ---
>   arch/x86/kvm/mmu/mmu.c |  4 ++--
>   arch/x86/kvm/x86.c     | 42 +++++++++++++++++++++++++++++-------------
>   2 files changed, 31 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e41834748d52..c477c519c784 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5041,8 +5041,8 @@ EXPORT_SYMBOL_GPL(kvm_init_mmu);
>   void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
>   {
>   	/*
> -	 * Invalidate all MMU roles to force them to reinitialize as CPUID
> -	 * information is factored into reserved bit calculations.
> +	 * Invalidate all MMU roles and roots to force them to reinitialize,
> +	 * as CPUID information is factored into reserved bit calculations.
>   	 *
>   	 * Correctly handling multiple vCPU models with respect to paging and
>   	 * physical address properties) in a single VM would require tracking
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 782dc9cd31d8..b8dad04301ee 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -863,15 +863,28 @@ int load_pdptrs(struct kvm_vcpu *vcpu, unsigned long cr3)
>   }
>   EXPORT_SYMBOL_GPL(load_pdptrs);
> 
> +static void kvm_post_set_cr_reinit_mmu(struct kvm_vcpu *vcpu)
> +{
> +	kvm_mmu_init(vcpu);
> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> +}
> +
>   void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
>   {
>   	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
>   		kvm_clear_async_pf_completion_queue(vcpu);
>   		kvm_async_pf_hash_reset(vcpu);
> +
> +		/*
> +		 * Clearing CR0.PG is architecturally defined as flushing the
> +		 * TLB from the guest's perspective.
> +		 */
> +		if (!(cr0 & X86_CR0_PG))
> +			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>   	}
> 
>   	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> -		kvm_mmu_reset_context(vcpu);
> +		kvm_post_set_cr_reinit_mmu(vcpu);
> 
>   	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
>   	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> @@ -1055,26 +1068,29 @@ EXPORT_SYMBOL_GPL(kvm_is_valid_cr4);
>   void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned long cr4)
>   {
>   	/*
> -	 * If any role bit is changed, the MMU needs to be reset.
> -	 *
>   	 * If CR4.PCIDE is changed 1 -> 0, the guest TLB must be flushed.
>   	 * If CR4.PCIDE is changed 0 -> 1, there is no need to flush the TLB
>   	 * according to the SDM; however, stale prev_roots could be reused
>   	 * incorrectly in the future after a MOV to CR3 with NOFLUSH=1, so we
>   	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
>   	 * is slow, but changing CR4.PCIDE is a rare case.
> -	 *
> -	 * If CR4.PGE is changed, the guest TLB must be flushed.
> -	 *
> -	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
> -	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
> -	 * the usage of "else if".
>   	 */
> -	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> -		kvm_mmu_reset_context(vcpu);
> -	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
> +	if ((cr4 ^ old_cr4) & X86_CR4_PCIDE) {
>   		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
> -	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
> +		return;
> +	}
> +
> +	/* If any role bit is changed, the MMU needs to be reinitialized. */
> +	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
> +		kvm_post_set_cr_reinit_mmu(vcpu);
> +
> +	/*
> +	 * Setting SMEP or toggling PGE is architecturally defined as flushing
> +	 * the TLB from the guest's perspective.  Note, because the shadow MMU
> +	 * ignores global pages, CR4.PGE is not part of KVM_MMU_CR4_ROLE_BITS.
> +	 */
> +	if (((cr4 ^ old_cr4) & X86_CR4_PGE) ||
> +	    ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP)))
>   		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
>   }
>   EXPORT_SYMBOL_GPL(kvm_post_set_cr4);
> 
> base-commit: a8c36d04d70d0b15e696561e1a2134fcbdd3a3bd
> --
>
Sean Christopherson Feb. 14, 2022, 7:24 p.m. UTC | #4
On Mon, Feb 14, 2022, Paolo Bonzini wrote:
> On 2/11/22 19:48, Sean Christopherson wrote:
> > On Wed, Feb 09, 2022, Paolo Bonzini wrote:
> > > @@ -5045,8 +5046,8 @@ void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >   void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
> > >   {
> > > -	kvm_mmu_unload(vcpu);
> > >   	kvm_init_mmu(vcpu);
> > > +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> > 
> > This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
> > affected, e.g. SMM transitions, KVM_SET_SREG, etc...
> 
> SMM exit does flush the TLB because RSM clears CR0.PG (I did check this :)).
> SMM re-entry then does not need to flush.  But I don't think SMM exit should
> flush the TLB *for non-SMM roles*.

I'm not concerned about the TLB flush aspects so much as the addition of
kvm_mmu_new_pgd() in new paths.

> For KVM_SET_SREGS I'm not sure if it should flush the TLB, but I agree it is
> certainly safer to keep it that way.
> 
> > Given that kvm_post_set_cr{0,4}() and kvm_vcpu_reset() explicitly handle CR0.PG
> > and CR4.SMEP toggling, I highly doubt the other flows are correct in all instances.
> > The call to kvm_mmu_new_pgd() is also
> 
> *white noise*
> 
> > To minimize risk, we should leave kvm_mmu_reset_context() as is (rename it if
> > necessary) and instead add a new helper to handle kvm_post_set_cr{0,4}().  In
> > the future we can/should work on avoiding unload in all paths, but again, future
> > problem.
> 
> I disagree on this.  There aren't many calls to kvm_mmu_reset_context.

All the more reason to do things incrementally.  I have no objection to allowing
all flows to reuse a cached (or current) root, I'm objecting to converting them
all in a single patch.  

> > > -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> > > +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
> > > +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
> > > +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
> > > +			kvm_mmu_unload(vcpu);
> > 
> > Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
> > to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
> > respect to the changelog.  Please elaborate :-)
> 
> Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case below).

Oh, you're freeing all roots to ensure a future MOV CR3 with NO_FLUSH and PCIDE=1
can't reuse a stale root.  That's necessary if and only if the MMU is shadowing
the guest, non-nested TDP MMUs just need to flush the guest's TLB.  The same is
true for the PCIDE case, i.e. we could optimize that too, though the main motivation
would be to clarify why all roots are unloaded.

> Using kvm_mmu_unload() avoids loading a cached root just to throw it away
> immediately after,

The shadow paging case will throw it away, but not the non-nested TDP MMU case?

> but I can change this to a new KVM_REQ_MMU_UPDATE_ROOT flag that does
> 
> 	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);

I don't think that's necessary, I was just confused by the discrepancy.

> By the way, I have a possibly stupid question.  In kvm_set_cr3 (called e.g.
> from emulator_set_cr()) there is
> 
>  	if (cr3 != kvm_read_cr3(vcpu))
> 		kvm_mmu_new_pgd(vcpu, cr3);
> 
> What makes this work if mmu_is_nested(vcpu)?

Hmm, nothing.  VMX is "broken" anyways because it will kick out to userspace with
X86EMUL_UNHANDLEABLE due to the CR3 intercept check.  SVM is also broken in that
it doesn't check INTERCEPT_CR3_WRITE, e.g. will do the wrong thing even if L1 wants
to intercept CR3 accesses.

> Should this also have an "if (... & !tdp_enabled)"?

Yes?  That should avoid the nested mess.  This patch also needs to handle CR0 and
CR4 modifications if L2 is active, e.g. if L1 choose not to intercept CR0/CR4.
kvm_post_set_cr_reinit_mmu() would be a lovely landing spot for that check :-D
Paolo Bonzini Feb. 15, 2022, 8:17 a.m. UTC | #5
On 2/14/22 20:24, Sean Christopherson wrote:
> On Mon, Feb 14, 2022, Paolo Bonzini wrote:
>> On 2/11/22 19:48, Sean Christopherson wrote:
>>> On Wed, Feb 09, 2022, Paolo Bonzini wrote:
>>>> -	kvm_mmu_unload(vcpu);
>>>>    	kvm_init_mmu(vcpu);
>>>> +	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
>>>
>>> This is too risky IMO, there are far more flows than just MOV CR0/CR4 that are
>>> affected, e.g. SMM transitions, KVM_SET_SREG, etc...
>
> I'm not concerned about the TLB flush aspects so much as the addition of
> kvm_mmu_new_pgd() in new paths.

Okay, yeah those are more complex and the existing ones are broken too.

>>>> -	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>>>> +	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
>>>> +		/* Flush the TLB if CR0 is changed 1 -> 0.  */
>>>> +		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
>>>> +			kvm_mmu_unload(vcpu);
>>>
>>> Calling kvm_mmu_unload() instead of requesting a flush isn't coherent with respect
>>> to the comment, or with SMEP handling.  And the SMEP handling isn't coherent with
>>> respect to the changelog.  Please elaborate :-)
>>
>> Yep, will do (the CR0.PG=0 case is similar to the CR0.PCIDE=0 case below).
> 
> Oh, you're freeing all roots to ensure a future MOV CR3 with NO_FLUSH and PCIDE=1
> can't reuse a stale root.  That's necessary if and only if the MMU is shadowing
> the guest, non-nested TDP MMUs just need to flush the guest's TLB.  The same is
> true for the PCIDE case, i.e. we could optimize that too, though the main motivation
> would be to clarify why all roots are unloaded.

Yes.  Clarifying all this should be done before the big change to 
kvm_mmu_reset_context().

>> Using kvm_mmu_unload() avoids loading a cached root just to throw it away
>> immediately after,
> 
> The shadow paging case will throw it away, but not the non-nested TDP MMU case?

Yes, the TDP case is okay since the role is the same.  kvm_init_mmu() is 
enough.

>> but I can change this to a new KVM_REQ_MMU_UPDATE_ROOT flag that does
>>
>> 	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
> 
> I don't think that's necessary, I was just confused by the discrepancy.

It may not be necessary but it is clearer IMO.  Let me post a new patch.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 38b40ddcaad7..dbd4e98ba426 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5020,8 +5020,8 @@  EXPORT_SYMBOL_GPL(kvm_init_mmu);
 void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	/*
-	 * Invalidate all MMU roles to force them to reinitialize as CPUID
-	 * information is factored into reserved bit calculations.
+	 * Invalidate all MMU roles and roots to force them to reinitialize,
+	 * as CPUID information is factored into reserved bit calculations.
 	 *
 	 * Correctly handling multiple vCPU models with respect to paging and
 	 * physical address properties) in a single VM would require tracking
@@ -5034,6 +5034,7 @@  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.root_mmu.mmu_role.ext.valid = 0;
 	vcpu->arch.guest_mmu.mmu_role.ext.valid = 0;
 	vcpu->arch.nested_mmu.mmu_role.ext.valid = 0;
+	kvm_mmu_unload(vcpu);
 	kvm_mmu_reset_context(vcpu);
 
 	/*
@@ -5045,8 +5046,8 @@  void kvm_mmu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 void kvm_mmu_reset_context(struct kvm_vcpu *vcpu)
 {
-	kvm_mmu_unload(vcpu);
 	kvm_init_mmu(vcpu);
+	kvm_mmu_new_pgd(vcpu, vcpu->arch.cr3);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_reset_context);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0d3646535cc5..97c4f5fc291f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -873,8 +873,12 @@  void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 		kvm_async_pf_hash_reset(vcpu);
 	}
 
-	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
+	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS) {
+		/* Flush the TLB if CR0 is changed 1 -> 0.  */
+		if ((old_cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PG))
+			kvm_mmu_unload(vcpu);
 		kvm_mmu_reset_context(vcpu);
+	}
 
 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
 	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
@@ -1067,15 +1071,18 @@  void kvm_post_set_cr4(struct kvm_vcpu *vcpu, unsigned long old_cr4, unsigned lon
 	 * free them all.  KVM_REQ_MMU_RELOAD is fit for the both cases; it
 	 * is slow, but changing CR4.PCIDE is a rare case.
 	 *
-	 * If CR4.PGE is changed, the guest TLB must be flushed.
+	 * Setting SMEP also needs to flush the TLB, in addition to resetting
+	 * the MMU.
 	 *
-	 * Note: resetting MMU is a superset of KVM_REQ_MMU_RELOAD and
-	 * KVM_REQ_MMU_RELOAD is a superset of KVM_REQ_TLB_FLUSH_GUEST, hence
-	 * the usage of "else if".
+	 * If CR4.PGE is changed, the guest TLB must be flushed.  Because
+	 * the shadow MMU ignores global pages, this bit is not part of
+	 * KVM_MMU_CR4_ROLE_BITS.
 	 */
-	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS)
+	if ((cr4 ^ old_cr4) & KVM_MMU_CR4_ROLE_BITS) {
 		kvm_mmu_reset_context(vcpu);
-	else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
+		if ((cr4 & X86_CR4_SMEP) && !(old_cr4 & X86_CR4_SMEP))
+			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+	} else if ((cr4 ^ old_cr4) & X86_CR4_PCIDE)
 		kvm_make_request(KVM_REQ_MMU_RELOAD, vcpu);
 	else if ((cr4 ^ old_cr4) & X86_CR4_PGE)
 		kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
@@ -11329,8 +11336,10 @@  void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	 * paging related bits are ignored if paging is disabled, i.e. CR0.WP,
 	 * CR4, and EFER changes are all irrelevant if CR0.PG was '0'.
 	 */
-	if (old_cr0 & X86_CR0_PG)
-		kvm_mmu_reset_context(vcpu);
+	if (old_cr0 & X86_CR0_PG) {
+		kvm_mmu_unload(vcpu);
+		kvm_init_mmu(vcpu);
+	}
 
 	/*
 	 * Intel's SDM states that all TLB entries are flushed on INIT.  AMD's