diff mbox series

x86/entry/64: Disallow RDPID in paranoid entry if KVM is enabled

Message ID 20200821025050.32573-1-sean.j.christopherson@intel.com (mailing list archive)
State New, archived
Headers show
Series x86/entry/64: Disallow RDPID in paranoid entry if KVM is enabled | expand

Commit Message

Sean Christopherson Aug. 21, 2020, 2:50 a.m. UTC
Don't use RDPID in the paranoid entry flow if KVM is enabled as doing so
can consume a KVM guest's MSR_TSC_AUX value if an NMI arrives in KVM's
run loop.

As a performance optimization, KVM loads the guest's TSC_AUX when a CPU
first enters its run loop, and on AMD's SVM doesn't restore the host's
value until the CPU exits the run loop.  VMX is even more aggressive and
defers restoring the host's value until the CPU returns to userspace.
This optimization obviously relies on the kernel not consuming TSC_AUX,
which falls apart if an NMI arrives in the run loop.

Removing KVM's optimizaton would be painful, as both SVM and VMX would
need to context switch the MSR on every VM-Enter (2x WRMSR + 1x RDMSR),
whereas using LSL instead RDPID is a minor blip.

Fixes: eaad981291ee3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro")
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Chang Seok Bae <chang.seok.bae@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: kvm@vger.kernel.org
Reported-by: Tom Lendacky <thomas.lendacky@amd.com>
Debugged-by: Tom Lendacky <thomas.lendacky@amd.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---

Andy, I know you said "unconditionally", but it felt weird adding a
comment way down in GET_PERCPU_BASE without plumbing a param in to help
provide context.  But, paranoid_entry is the only user so adding a param
that is unconditional also felt weird.  That being said, I definitely
don't have a strong opinion one way or the other.

 arch/x86/entry/calling.h  | 10 +++++++---
 arch/x86/entry/entry_64.S |  7 ++++++-
 2 files changed, 13 insertions(+), 4 deletions(-)

Comments

Peter Zijlstra Aug. 21, 2020, 7:24 a.m. UTC | #1
On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote:
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 70dea93378162..fd915c46297c5 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -842,8 +842,13 @@ SYM_CODE_START_LOCAL(paranoid_entry)
>  	 *
>  	 * The MSR write ensures that no subsequent load is based on a
>  	 * mispredicted GSBASE. No extra FENCE required.
> +	 *
> +	 * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX
> +	 * if an NMI arrives in KVM's run loop.  KVM loads guest's TSC_AUX on
> +	 * VM-Enter and may not restore the host's value until the CPU returns
> +	 * to userspace, i.e. KVM depends on the kernel not using TSC_AUX.
>  	 */
> -	SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
> +	SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx no_rdpid=IS_ENABLED(CONFIG_KVM)
>  	ret

With distro configs that's going to be a guaranteed no_rdpid. Also with
a grand total of 0 performance numbers that RDPID is even worth it, I'd
suggest to just unconditionally remove that thing. Simpler code
all-around.
Borislav Petkov Aug. 21, 2020, 7:44 a.m. UTC | #2
On Fri, Aug 21, 2020 at 09:24:14AM +0200, peterz@infradead.org wrote:
> With distro configs that's going to be a guaranteed no_rdpid. Also with
> a grand total of 0 performance numbers that RDPID is even worth it, I'd
> suggest to just unconditionally remove that thing. Simpler code
> all-around.

I was just about to say the same thing - all this dancing around just to
keep RDPID better be backed by numbers, otherwise just remove the damn
thing in that path and use LSL and be done with it.
Borislav Petkov Aug. 21, 2020, 7:47 a.m. UTC | #3
On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote:
> +	 * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX
> +	 * if an NMI arrives in KVM's run loop.  KVM loads guest's TSC_AUX on
> +	 * VM-Enter and may not restore the host's value until the CPU returns
> +	 * to userspace, i.e. KVM depends on the kernel not using TSC_AUX.
>  	 */

And frankly, this is really unfair. The kernel should be able to use any
MSR. IOW, KVM needs to be fixed here. I'm sure it context-switches other
MSRs so one more MSR is not a big deal.
Paolo Bonzini Aug. 21, 2020, 8:09 a.m. UTC | #4
On 21/08/20 09:47, Borislav Petkov wrote:
> On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote:
>> +	 * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX
>> +	 * if an NMI arrives in KVM's run loop.  KVM loads guest's TSC_AUX on
>> +	 * VM-Enter and may not restore the host's value until the CPU returns
>> +	 * to userspace, i.e. KVM depends on the kernel not using TSC_AUX.
>>  	 */
> And frankly, this is really unfair. The kernel should be able to use any
> MSR. IOW, KVM needs to be fixed here. I'm sure it context-switches other
> MSRs so one more MSR is not a big deal.

The only MSR that KVM needs to context-switch manually are XSS and
SPEC_CTRL.  They tend to be the same on host and guest in which case
they can be optimized away.

All the other MSRs (EFER and PAT are those that come to mind) are
handled by the microcode and thus they don't have the slowness of
RDMSR/WRMSR

One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around 1000
cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase.

Paolo
Borislav Petkov Aug. 21, 2020, 8:16 a.m. UTC | #5
On Fri, Aug 21, 2020 at 10:09:01AM +0200, Paolo Bonzini wrote:
> One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around 1000
> cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase.

The kernel uses TSC_AUX so we can't reserve it to KVM either.
Paolo Bonzini Aug. 21, 2020, 9:05 a.m. UTC | #6
On 21/08/20 10:16, Borislav Petkov wrote:
> On Fri, Aug 21, 2020 at 10:09:01AM +0200, Paolo Bonzini wrote:
>> One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around 1000
>> cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase.
> 
> The kernel uses TSC_AUX so we can't reserve it to KVM either.

KVM only uses TSC_AUX while in kernel space, because the kernel hadn't
used it until now.  That's for a good reason:

* if possible, __this_cpu_read(cpu_number) is always faster.

* The kernel can just block preemption at its will and has no need for
the atomic rdtsc+vgetcpu provided by RDTSCP.

So far, the kernel had always used LSL instead of RDPID when
__this_cpu_read was not available.  In one place, RDTSCP is used as an
ordered rdtsc but it discards the TSC_AUX value.  RDPID is also used in
the vDSO but it isn't kernel space.

Hence the assumption that KVM makes (and has made ever since TSC_AUX was
introduced.  What is the difference in speed between LSL and RDPID?  I
don't have a machine that has RDPID to test it, unfortunately.

Paolo
Borislav Petkov Aug. 21, 2020, 9:22 a.m. UTC | #7
On Fri, Aug 21, 2020 at 11:05:24AM +0200, Paolo Bonzini wrote:
> ... RDTSCP is used as an ordered rdtsc but it discards the TSC_AUX
> value.

... and now because KVM is using it, the kernel can forget using
TSC_AUX. Are you kidding me?!

> Hence the assumption that KVM makes (and has made ever since TSC_AUX was
> introduced.

And *this* is the problem. KVM can't just be grabbing MSRs and claiming
them because it started using them first. You guys need to stop this. If
it is a shared resource, there better be an agreement about sharing it.
Thomas Gleixner Aug. 21, 2020, 9:28 a.m. UTC | #8
On Fri, Aug 21 2020 at 10:09, Paolo Bonzini wrote:
> On 21/08/20 09:47, Borislav Petkov wrote:
>> On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote:
>>> +	 * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX
>>> +	 * if an NMI arrives in KVM's run loop.  KVM loads guest's TSC_AUX on
>>> +	 * VM-Enter and may not restore the host's value until the CPU returns
>>> +	 * to userspace, i.e. KVM depends on the kernel not using TSC_AUX.
>>>  	 */
>> And frankly, this is really unfair. The kernel should be able to use any
>> MSR. IOW, KVM needs to be fixed here. I'm sure it context-switches other
>> MSRs so one more MSR is not a big deal.
>
> The only MSR that KVM needs to context-switch manually are XSS and
> SPEC_CTRL.  They tend to be the same on host and guest in which case
> they can be optimized away.
>
> All the other MSRs (EFER and PAT are those that come to mind) are
> handled by the microcode and thus they don't have the slowness of
> RDMSR/WRMSR
>
> One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around 1000
> cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase.

We all know that MSRs are slow, but as a general rule I have to make it
entirely clear that the kernel has precedence over KVM.

If the kernel wants to use an MSR for it's own purposes then KVM has to
deal with that and not the other way round. Preventing the kernel from
using a facility freely is not an option ever.

The insanities of KVM performance optimizations have bitten us more than
once.

For this particular case at hand I don't care much and we should just
rip the whole RDPID thing out unconditionally. We still have zero
numbers about the performance difference vs. LSL.

Thanks,

        tglx
Paolo Bonzini Aug. 21, 2020, 9:37 a.m. UTC | #9
On 21/08/20 11:28, Thomas Gleixner wrote:
> We all know that MSRs are slow, but as a general rule I have to make it
> entirely clear that the kernel has precedence over KVM.

I totally agree.  I just don't think that it matters _in this case_,
because the kernel hardly has any reason to use TSC_AUX while in ring0.

Paolo

> If the kernel wants to use an MSR for it's own purposes then KVM has to
> deal with that and not the other way round. Preventing the kernel from
> using a facility freely is not an option ever.
> 
> The insanities of KVM performance optimizations have bitten us more than
> once.
> 
> For this particular case at hand I don't care much and we should just
> rip the whole RDPID thing out unconditionally. We still have zero
> numbers about the performance difference vs. LSL.
Paolo Bonzini Aug. 21, 2020, 9:44 a.m. UTC | #10
On 21/08/20 11:22, Borislav Petkov wrote:
>> Hence the assumption that KVM makes (and has made ever since TSC_AUX was
>> introduced.
> And *this* is the problem. KVM can't just be grabbing MSRs and claiming
> them because it started using them first. You guys need to stop this. If
> it is a shared resource, there better be an agreement about sharing it.

It's not like we grab MSRs every day.  The user-return notifier restores
6 MSRs (7 on very old processors).  The last two that were added were
MSR_TSC_AUX itself in 2009 (!) and MSR_IA32_TSX_CTRL last year.

Paolo
Borislav Petkov Aug. 21, 2020, 9:48 a.m. UTC | #11
On Fri, Aug 21, 2020 at 11:44:33AM +0200, Paolo Bonzini wrote:
> It's not like we grab MSRs every day.  The user-return notifier restores
> 6 MSRs (7 on very old processors).  The last two that were added were
> MSR_TSC_AUX itself in 2009 (!) and MSR_IA32_TSX_CTRL last year.

What about "If it is a shared resource, there better be an agreement
about sharing it." is not clear?

It doesn't matter how many or which resources - there needs to be a
contract for shared use so that shared use is possible. It is that
simple.
Paolo Bonzini Aug. 21, 2020, 10:07 a.m. UTC | #12
On 21/08/20 11:48, Borislav Petkov wrote:
>> It's not like we grab MSRs every day.  The user-return notifier restores
>> 6 MSRs (7 on very old processors).  The last two that were added were
>> MSR_TSC_AUX itself in 2009 (!) and MSR_IA32_TSX_CTRL last year.
> What about "If it is a shared resource, there better be an agreement
> about sharing it." is not clear?
> 
> It doesn't matter how many or which resources - there needs to be a
> contract for shared use so that shared use is possible. It is that
> simple.

Sure, and I'll make sure to have that discussion the next time we add a
shared MSR in 2029.

In the meanwhile:

* for the syscall MSRs, patches to share them were reviewed by hpa and
peterz so I guess there's no problem.

* for MSR_TSC_AUX, which is the one that is causing problems, everybody
seems to agree with just using LSL (in the lack specific numbers on
performance improvements from RDPID).

* for MSR_IA32_TSX_CTRL.RTM_DISABLE, which is the only one that was
added in the last 10 years, I'm pretty sure there are no plans for using
the Trusty Sidechannel eXtension in the kernel.  So that should be fine
too.  (The CPUID_CLEAR bit of the MSR is not shared).

Thanks,

Paolo
H. Peter Anvin Aug. 21, 2020, 7:55 p.m. UTC | #13
On August 21, 2020 2:28:32 AM PDT, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Fri, Aug 21 2020 at 10:09, Paolo Bonzini wrote:
>> On 21/08/20 09:47, Borislav Petkov wrote:
>>> On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote:
>>>> +	 * Disallow RDPID if KVM is enabled as it may consume a guest's
>TSC_AUX
>>>> +	 * if an NMI arrives in KVM's run loop.  KVM loads guest's
>TSC_AUX on
>>>> +	 * VM-Enter and may not restore the host's value until the CPU
>returns
>>>> +	 * to userspace, i.e. KVM depends on the kernel not using
>TSC_AUX.
>>>>  	 */
>>> And frankly, this is really unfair. The kernel should be able to use
>any
>>> MSR. IOW, KVM needs to be fixed here. I'm sure it context-switches
>other
>>> MSRs so one more MSR is not a big deal.
>>
>> The only MSR that KVM needs to context-switch manually are XSS and
>> SPEC_CTRL.  They tend to be the same on host and guest in which case
>> they can be optimized away.
>>
>> All the other MSRs (EFER and PAT are those that come to mind) are
>> handled by the microcode and thus they don't have the slowness of
>> RDMSR/WRMSR
>>
>> One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around
>1000
>> cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase.
>
>We all know that MSRs are slow, but as a general rule I have to make it
>entirely clear that the kernel has precedence over KVM.
>
>If the kernel wants to use an MSR for it's own purposes then KVM has to
>deal with that and not the other way round. Preventing the kernel from
>using a facility freely is not an option ever.
>
>The insanities of KVM performance optimizations have bitten us more
>than
>once.
>
>For this particular case at hand I don't care much and we should just
>rip the whole RDPID thing out unconditionally. We still have zero
>numbers about the performance difference vs. LSL.
>
>Thanks,
>
>        tglx

It is hardly going to be a performance difference for paranoid entry, which is hopefully rare enough that it falls into the noise.
Peter Zijlstra Aug. 21, 2020, 8:02 p.m. UTC | #14
On Fri, Aug 21, 2020 at 12:55:53PM -0700, hpa@zytor.com wrote:

> It is hardly going to be a performance difference for paranoid entry,
> which is hopefully rare enough that it falls into the noise.

Try perf some day ;-)

But yeah, given the utter trainwreck that NMIs are anyway, this is all
not going to matter much.
Andy Lutomirski Aug. 22, 2020, 4:42 p.m. UTC | #15
On Fri, Aug 21, 2020 at 3:07 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 21/08/20 11:48, Borislav Petkov wrote:
> >> It's not like we grab MSRs every day.  The user-return notifier restores
> >> 6 MSRs (7 on very old processors).  The last two that were added were
> >> MSR_TSC_AUX itself in 2009 (!) and MSR_IA32_TSX_CTRL last year.
> > What about "If it is a shared resource, there better be an agreement
> > about sharing it." is not clear?
> >
> > It doesn't matter how many or which resources - there needs to be a
> > contract for shared use so that shared use is possible. It is that
> > simple.
>
> Sure, and I'll make sure to have that discussion the next time we add a
> shared MSR in 2029.
>
> In the meanwhile:
>
> * for the syscall MSRs, patches to share them were reviewed by hpa and
> peterz so I guess there's no problem.
>
> * for MSR_TSC_AUX, which is the one that is causing problems, everybody
> seems to agree with just using LSL (in the lack specific numbers on
> performance improvements from RDPID).
>
> * for MSR_IA32_TSX_CTRL.RTM_DISABLE, which is the only one that was
> added in the last 10 years, I'm pretty sure there are no plans for using
> the Trusty Sidechannel eXtension in the kernel.  So that should be fine
> too.  (The CPUID_CLEAR bit of the MSR is not shared).
>

Regardless of how anyone feels about who owns what in arch/x86 vs
arch/x86/kvm, the x86 architecture is a mess that comes from Intel and
AMD, and we have to deal with it.  On VMX, when a VM exits, the VM's
value of MSR_TSC_AUX is live, and we can take an NMI, MCE, or
abominable new #SX, #VE, #VC, etc on the next instruction boundary.
And unless we use the atomic MSR switch mechanism, the result is that
we're going through the entry path with guest-controlled MSRs.

So I think we can chalk this one up to obnoxious hardware.
Paolo Bonzini Sept. 16, 2020, 4:54 p.m. UTC | #16
On 22/08/20 18:42, Andy Lutomirski wrote:
> On VMX, when a VM exits, the VM's
> value of MSR_TSC_AUX is live, and we can take an NMI, MCE, or
> abominable new #SX, #VE, #VC, etc on the next instruction boundary.
> And unless we use the atomic MSR switch mechanism, the result is that
> we're going through the entry path with guest-controlled MSRs.

If anything of that is a problem, we can and will use the atomic MSR
switching; it's not worth doing complicated stuff if you're going to pay
the price of rdmsr/wrmsr anyway.

The remaining cases are MSRs that are really meant for usermode (such as
the syscall MSRs) and especially the edge cases of these two MSRs that
the kernel doesn't mind too much about.  But they are really really
rare, I don't expect any new one coming soon and if they are ever needed
(by SGX perhaps?!?) I'll certainly loop you guys in.

Paolo
diff mbox series

Patch

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 98e4d8886f11c..a925c0cf89c1a 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -342,9 +342,9 @@  For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
-.macro SAVE_AND_SET_GSBASE scratch_reg:req save_reg:req
+.macro SAVE_AND_SET_GSBASE scratch_reg:req save_reg:req no_rdpid=0
 	rdgsbase \save_reg
-	GET_PERCPU_BASE \scratch_reg
+	GET_PERCPU_BASE \scratch_reg \no_rdpid
 	wrgsbase \scratch_reg
 .endm
 
@@ -375,11 +375,15 @@  For 32-bit we have the following conventions - kernel is built with
  * We normally use %gs for accessing per-CPU data, but we are setting up
  * %gs here and obviously can not use %gs itself to access per-CPU data.
  */
-.macro GET_PERCPU_BASE reg:req
+.macro GET_PERCPU_BASE reg:req no_rdpid=0
+	.if \no_rdpid
+	LOAD_CPU_AND_NODE_SEG_LIMIT \reg
+	.else
 	ALTERNATIVE \
 		"LOAD_CPU_AND_NODE_SEG_LIMIT \reg", \
 		"RDPID	\reg", \
 		X86_FEATURE_RDPID
+	.endif
 	andq	$VDSO_CPUNODE_MASK, \reg
 	movq	__per_cpu_offset(, \reg, 8), \reg
 .endm
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 70dea93378162..fd915c46297c5 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -842,8 +842,13 @@  SYM_CODE_START_LOCAL(paranoid_entry)
 	 *
 	 * The MSR write ensures that no subsequent load is based on a
 	 * mispredicted GSBASE. No extra FENCE required.
+	 *
+	 * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX
+	 * if an NMI arrives in KVM's run loop.  KVM loads guest's TSC_AUX on
+	 * VM-Enter and may not restore the host's value until the CPU returns
+	 * to userspace, i.e. KVM depends on the kernel not using TSC_AUX.
 	 */
-	SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx
+	SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx no_rdpid=IS_ENABLED(CONFIG_KVM)
 	ret
 
 .Lparanoid_entry_checkgs: