diff mbox series

[v3] KVM: x86: use wrpkru directly in kvm_load_{guest|host}_xsave_state

Message ID 20210511170508.40034-1-jon@nutanix.com (mailing list archive)
State New, archived
Headers show
Series [v3] KVM: x86: use wrpkru directly in kvm_load_{guest|host}_xsave_state | expand

Commit Message

Jon Kohler May 11, 2021, 5:05 p.m. UTC
kvm_load_host_xsave_state handles xsave on vm exit, part of which is
managing memory protection key state. The latest arch.pkru is updated
with a rdpkru, and if that doesn't match the base host_pkru (which
about 70% of the time), we issue a __write_pkru.

__write_pkru issues another rdpkru internally to try to avoid the
wrpkru, so we're reading the same value back to back when we 100% of
the time know that we need to go directly to wrpkru. This is a 100%
branch miss and extra work that can be skipped.

To improve performance, use wrpkru directly in KVM code and simplify
the uses of __write_pkru such that it can be removed completely.

While we're in this section of code, optimize if condition ordering
prior to wrpkru in both kvm_load_{guest|host}_xsave_state.

For both functions, flip the ordering of the || condition so that
arch.xcr0 & XFEATURE_MASK_PKRU is checked first, which when
instrumented in our evironment appeared to be always true and less
overall work than kvm_read_cr4_bits.

For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead
one position. When instrumented, I saw this be true roughly ~70% of
the time vs the other conditions which were almost always true.
With this change, we will avoid 3rd condition check ~30% of the time.

Cc: Babu Moger <babu.moger@amd.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
v1 -> v2:
 - Addressed comments on approach from Paolo and Dave.
v2 -> v3:
 - fix return-too-soon in switch_fpu_finish, which would
   have sometimes accidentally skipped update_pasid().
 - make sure write_pkru() users do not regress by adding
   pkru == rdpkru() optimization prior to wrpkru() call.

 arch/x86/include/asm/fpu/internal.h  |  8 +++++++-
 arch/x86/include/asm/pgtable.h       |  9 ++++++++-
 arch/x86/include/asm/special_insns.h | 20 +++++---------------
 arch/x86/kvm/x86.c                   | 14 +++++++-------
 4 files changed, 27 insertions(+), 24 deletions(-)

--
2.30.1 (Apple Git-130)

Comments

Dave Hansen May 11, 2021, 7:08 p.m. UTC | #1
On 5/11/21 10:05 AM, Jon Kohler wrote:
> To improve performance, use wrpkru directly in KVM code and simplify
> the uses of __write_pkru such that it can be removed completely.

I don't love the repeated:

+	if (pkru_val != rdpkru())
+		wrpkru(pkru_val);

But, all of the w*pkru() variants were getting confusing, even to me. :)

Being explicit about the "wkpkru suppression" in a limited number of
places looks fine.

Acked-by: Dave Hansen <dave.hansen@intel.com>
Peter Zijlstra May 12, 2021, 7:41 a.m. UTC | #2
On Tue, May 11, 2021 at 01:05:02PM -0400, Jon Kohler wrote:
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 8d33ad80704f..5bc4df3a4c27 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -583,7 +583,13 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>  		if (pk)
>  			pkru_val = pk->pkru;
>  	}
> -	__write_pkru(pkru_val);
> +
> +	/*
> +	 * WRPKRU is relatively expensive compared to RDPKRU.
> +	 * Avoid WRPKRU when it would not change the value.
> +	 */
> +	if (pkru_val != rdpkru())
> +		wrpkru(pkru_val);

Just wondering; why aren't we having that in a per-cpu variable? The
usual per-cpu MSR shadow approach avoids issuing any 'special' ops
entirely.
Dave Hansen May 12, 2021, 6:33 p.m. UTC | #3
On 5/12/21 12:41 AM, Peter Zijlstra wrote:
> On Tue, May 11, 2021 at 01:05:02PM -0400, Jon Kohler wrote:
>> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
>> index 8d33ad80704f..5bc4df3a4c27 100644
>> --- a/arch/x86/include/asm/fpu/internal.h
>> +++ b/arch/x86/include/asm/fpu/internal.h
>> @@ -583,7 +583,13 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>>  		if (pk)
>>  			pkru_val = pk->pkru;
>>  	}
>> -	__write_pkru(pkru_val);
>> +
>> +	/*
>> +	 * WRPKRU is relatively expensive compared to RDPKRU.
>> +	 * Avoid WRPKRU when it would not change the value.
>> +	 */
>> +	if (pkru_val != rdpkru())
>> +		wrpkru(pkru_val);
> Just wondering; why aren't we having that in a per-cpu variable? The
> usual per-cpu MSR shadow approach avoids issuing any 'special' ops
> entirely.

It could be a per-cpu variable.  When I wrote this originally I figured
that a rdpkru would be cheaper than a load from memory (even per-cpu
memory).

But, now that I think about it, assuming that 'prku_val' is in %rdi, doing:

	cmp	%gs:0x1234, %rdi

might end up being cheaper than clobbering a *pair* of GPRs with rdpkru:

	xor    %ecx,%ecx
	rdpkru
	cmp	%rax, %rdi

I'm too lazy to go figure out what would be faster in practice, though.
 Does anyone care?
Jon Kohler May 17, 2021, 2:58 a.m. UTC | #4
> On May 14, 2021, at 12:46 AM, Andy Lutomirski <luto@kernel.org> wrote:
> 
> 
> 
> On Wed, May 12, 2021, at 11:33 AM, Dave Hansen wrote:
>> On 5/12/21 12:41 AM, Peter Zijlstra wrote:
>> > On Tue, May 11, 2021 at 01:05:02PM -0400, Jon Kohler wrote:
>> >> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
>> >> index 8d33ad80704f..5bc4df3a4c27 100644
>> >> --- a/arch/x86/include/asm/fpu/internal.h
>> >> +++ b/arch/x86/include/asm/fpu/internal.h
>> >> @@ -583,7 +583,13 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>> >>  if (pk)
>> >>  pkru_val = pk->pkru;
>> >>  }
>> >> - __write_pkru(pkru_val);
>> >> +
>> >> + /*
>> >> + * WRPKRU is relatively expensive compared to RDPKRU.
>> >> + * Avoid WRPKRU when it would not change the value.
>> >> + */
>> >> + if (pkru_val != rdpkru())
>> >> + wrpkru(pkru_val);
>> > Just wondering; why aren't we having that in a per-cpu variable? The
>> > usual per-cpu MSR shadow approach avoids issuing any 'special' ops
>> > entirely.
>> 
>> It could be a per-cpu variable.  When I wrote this originally I figured
>> that a rdpkru would be cheaper than a load from memory (even per-cpu
>> memory).
>> 
>> But, now that I think about it, assuming that 'prku_val' is in %rdi, doing:
>> 
>> cmp %gs:0x1234, %rdi
>> 
>> might end up being cheaper than clobbering a *pair* of GPRs with rdpkru:
>> 
>> xor    %ecx,%ecx
>> rdpkru
>> cmp %rax, %rdi
>> 
>> I'm too lazy to go figure out what would be faster in practice, though.
>> Does anyone care?

Strictly from a profiling perspective, my observation is that the rdpkru
is pretty quick, its the wrpkru that seems heavier under the covers, so
any speedup in rdpkru would likely go unnoticed by comparison. Now
that said if this per cpu variable would somehow get rid of the underlying
instruction and just emulate the whole thing, that might be interesting.

From an incremental change perspective though, this patch puts
us in a better spot, happy to take a look at future work if y’all have
some tips on top of this.

> 
> RDPKRU gets bonus points for being impossible to get out of sync.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 8d33ad80704f..5bc4df3a4c27 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -583,7 +583,13 @@  static inline void switch_fpu_finish(struct fpu *new_fpu)
 		if (pk)
 			pkru_val = pk->pkru;
 	}
-	__write_pkru(pkru_val);
+
+	/*
+	 * WRPKRU is relatively expensive compared to RDPKRU.
+	 * Avoid WRPKRU when it would not change the value.
+	 */
+	if (pkru_val != rdpkru())
+		wrpkru(pkru_val);

 	/*
 	 * Expensive PASID MSR write will be avoided in update_pasid() because
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b1099f2d9800..0bf9da90baaf 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -151,7 +151,14 @@  static inline void write_pkru(u32 pkru)
 	fpregs_lock();
 	if (pk)
 		pk->pkru = pkru;
-	__write_pkru(pkru);
+
+	/*
+	 * WRPKRU is relatively expensive compared to RDPKRU.
+	 * Avoid WRPKRU when it would not change the value.
+	 */
+	if (pkru != rdpkru())
+		wrpkru(pkru);
+
 	fpregs_unlock();
 }

diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 2acd6cb62328..3c361b5cbed5 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -99,32 +99,22 @@  static inline void wrpkru(u32 pkru)
 	/*
 	 * "wrpkru" instruction.  Loads contents in EAX to PKRU,
 	 * requires that ecx = edx = 0.
+	 * WRPKRU is relatively expensive compared to RDPKRU, callers
+	 * should try to compare pkru == rdpkru() and avoid the call
+	 * when it will not change the value; however, there are no
+	 * correctness issues if a caller WRPKRU's for the same value.
 	 */
 	asm volatile(".byte 0x0f,0x01,0xef\n\t"
 		     : : "a" (pkru), "c"(ecx), "d"(edx));
 }

-static inline void __write_pkru(u32 pkru)
-{
-	/*
-	 * WRPKRU is relatively expensive compared to RDPKRU.
-	 * Avoid WRPKRU when it would not change the value.
-	 */
-	if (pkru == rdpkru())
-		return;
-
-	wrpkru(pkru);
-}
-
 #else
 static inline u32 rdpkru(void)
 {
 	return 0;
 }

-static inline void __write_pkru(u32 pkru)
-{
-}
+static inline void wrpkru(u32 pkru) {}
 #endif

 static inline void native_wbinvd(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cebdaa1e3cf5..3222c7f60f31 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -912,10 +912,10 @@  void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu)
 	}

 	if (static_cpu_has(X86_FEATURE_PKU) &&
-	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
-	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU)) &&
-	    vcpu->arch.pkru != vcpu->arch.host_pkru)
-		__write_pkru(vcpu->arch.pkru);
+	    vcpu->arch.pkru != vcpu->arch.host_pkru &&
+	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
+	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE)))
+		wrpkru(vcpu->arch.pkru);
 }
 EXPORT_SYMBOL_GPL(kvm_load_guest_xsave_state);

@@ -925,11 +925,11 @@  void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 		return;

 	if (static_cpu_has(X86_FEATURE_PKU) &&
-	    (kvm_read_cr4_bits(vcpu, X86_CR4_PKE) ||
-	     (vcpu->arch.xcr0 & XFEATURE_MASK_PKRU))) {
+	    ((vcpu->arch.xcr0 & XFEATURE_MASK_PKRU) ||
+	     kvm_read_cr4_bits(vcpu, X86_CR4_PKE))) {
 		vcpu->arch.pkru = rdpkru();
 		if (vcpu->arch.pkru != vcpu->arch.host_pkru)
-			__write_pkru(vcpu->arch.host_pkru);
+			wrpkru(vcpu->arch.host_pkru);
 	}

 	if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE)) {