Message ID | 20200821025050.32573-1-sean.j.christopherson@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/entry/64: Disallow RDPID in paranoid entry if KVM is enabled | expand |
On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote: > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S > index 70dea93378162..fd915c46297c5 100644 > --- a/arch/x86/entry/entry_64.S > +++ b/arch/x86/entry/entry_64.S > @@ -842,8 +842,13 @@ SYM_CODE_START_LOCAL(paranoid_entry) > * > * The MSR write ensures that no subsequent load is based on a > * mispredicted GSBASE. No extra FENCE required. > + * > + * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX > + * if an NMI arrives in KVM's run loop. KVM loads guest's TSC_AUX on > + * VM-Enter and may not restore the host's value until the CPU returns > + * to userspace, i.e. KVM depends on the kernel not using TSC_AUX. > */ > - SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx > + SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx no_rdpid=IS_ENABLED(CONFIG_KVM) > ret With distro configs that's going to be a guaranteed no_rdpid. Also with a grand total of 0 performance numbers that RDPID is even worth it, I'd suggest to just unconditionally remove that thing. Simpler code all-around.
On Fri, Aug 21, 2020 at 09:24:14AM +0200, peterz@infradead.org wrote: > With distro configs that's going to be a guaranteed no_rdpid. Also with > a grand total of 0 performance numbers that RDPID is even worth it, I'd > suggest to just unconditionally remove that thing. Simpler code > all-around. I was just about to say the same thing - all this dancing around just to keep RDPID better be backed by numbers, otherwise just remove the damn thing in that path and use LSL and be done with it.
On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote: > + * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX > + * if an NMI arrives in KVM's run loop. KVM loads guest's TSC_AUX on > + * VM-Enter and may not restore the host's value until the CPU returns > + * to userspace, i.e. KVM depends on the kernel not using TSC_AUX. > */ And frankly, this is really unfair. The kernel should be able to use any MSR. IOW, KVM needs to be fixed here. I'm sure it context-switches other MSRs so one more MSR is not a big deal.
On 21/08/20 09:47, Borislav Petkov wrote: > On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote: >> + * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX >> + * if an NMI arrives in KVM's run loop. KVM loads guest's TSC_AUX on >> + * VM-Enter and may not restore the host's value until the CPU returns >> + * to userspace, i.e. KVM depends on the kernel not using TSC_AUX. >> */ > And frankly, this is really unfair. The kernel should be able to use any > MSR. IOW, KVM needs to be fixed here. I'm sure it context-switches other > MSRs so one more MSR is not a big deal. The only MSR that KVM needs to context-switch manually are XSS and SPEC_CTRL. They tend to be the same on host and guest in which case they can be optimized away. All the other MSRs (EFER and PAT are those that come to mind) are handled by the microcode and thus they don't have the slowness of RDMSR/WRMSR One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around 1000 cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase. Paolo
On Fri, Aug 21, 2020 at 10:09:01AM +0200, Paolo Bonzini wrote: > One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around 1000 > cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase. The kernel uses TSC_AUX so we can't reserve it to KVM either.
On 21/08/20 10:16, Borislav Petkov wrote: > On Fri, Aug 21, 2020 at 10:09:01AM +0200, Paolo Bonzini wrote: >> One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around 1000 >> cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase. > > The kernel uses TSC_AUX so we can't reserve it to KVM either. KVM only uses TSC_AUX while in kernel space, because the kernel hadn't used it until now. That's for a good reason: * if possible, __this_cpu_read(cpu_number) is always faster. * The kernel can just block preemption at its will and has no need for the atomic rdtsc+vgetcpu provided by RDTSCP. So far, the kernel had always used LSL instead of RDPID when __this_cpu_read was not available. In one place, RDTSCP is used as an ordered rdtsc but it discards the TSC_AUX value. RDPID is also used in the vDSO but it isn't kernel space. Hence the assumption that KVM makes (and has made ever since TSC_AUX was introduced. What is the difference in speed between LSL and RDPID? I don't have a machine that has RDPID to test it, unfortunately. Paolo
On Fri, Aug 21, 2020 at 11:05:24AM +0200, Paolo Bonzini wrote: > ... RDTSCP is used as an ordered rdtsc but it discards the TSC_AUX > value. ... and now because KVM is using it, the kernel can forget using TSC_AUX. Are you kidding me?! > Hence the assumption that KVM makes (and has made ever since TSC_AUX was > introduced. And *this* is the problem. KVM can't just be grabbing MSRs and claiming them because it started using them first. You guys need to stop this. If it is a shared resource, there better be an agreement about sharing it.
On Fri, Aug 21 2020 at 10:09, Paolo Bonzini wrote: > On 21/08/20 09:47, Borislav Petkov wrote: >> On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote: >>> + * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX >>> + * if an NMI arrives in KVM's run loop. KVM loads guest's TSC_AUX on >>> + * VM-Enter and may not restore the host's value until the CPU returns >>> + * to userspace, i.e. KVM depends on the kernel not using TSC_AUX. >>> */ >> And frankly, this is really unfair. The kernel should be able to use any >> MSR. IOW, KVM needs to be fixed here. I'm sure it context-switches other >> MSRs so one more MSR is not a big deal. > > The only MSR that KVM needs to context-switch manually are XSS and > SPEC_CTRL. They tend to be the same on host and guest in which case > they can be optimized away. > > All the other MSRs (EFER and PAT are those that come to mind) are > handled by the microcode and thus they don't have the slowness of > RDMSR/WRMSR > > One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around 1000 > cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase. We all know that MSRs are slow, but as a general rule I have to make it entirely clear that the kernel has precedence over KVM. If the kernel wants to use an MSR for it's own purposes then KVM has to deal with that and not the other way round. Preventing the kernel from using a facility freely is not an option ever. The insanities of KVM performance optimizations have bitten us more than once. For this particular case at hand I don't care much and we should just rip the whole RDPID thing out unconditionally. We still have zero numbers about the performance difference vs. LSL. Thanks, tglx
On 21/08/20 11:28, Thomas Gleixner wrote: > We all know that MSRs are slow, but as a general rule I have to make it > entirely clear that the kernel has precedence over KVM. I totally agree. I just don't think that it matters _in this case_, because the kernel hardly has any reason to use TSC_AUX while in ring0. Paolo > If the kernel wants to use an MSR for it's own purposes then KVM has to > deal with that and not the other way round. Preventing the kernel from > using a facility freely is not an option ever. > > The insanities of KVM performance optimizations have bitten us more than > once. > > For this particular case at hand I don't care much and we should just > rip the whole RDPID thing out unconditionally. We still have zero > numbers about the performance difference vs. LSL.
On 21/08/20 11:22, Borislav Petkov wrote: >> Hence the assumption that KVM makes (and has made ever since TSC_AUX was >> introduced. > And *this* is the problem. KVM can't just be grabbing MSRs and claiming > them because it started using them first. You guys need to stop this. If > it is a shared resource, there better be an agreement about sharing it. It's not like we grab MSRs every day. The user-return notifier restores 6 MSRs (7 on very old processors). The last two that were added were MSR_TSC_AUX itself in 2009 (!) and MSR_IA32_TSX_CTRL last year. Paolo
On Fri, Aug 21, 2020 at 11:44:33AM +0200, Paolo Bonzini wrote: > It's not like we grab MSRs every day. The user-return notifier restores > 6 MSRs (7 on very old processors). The last two that were added were > MSR_TSC_AUX itself in 2009 (!) and MSR_IA32_TSX_CTRL last year. What about "If it is a shared resource, there better be an agreement about sharing it." is not clear? It doesn't matter how many or which resources - there needs to be a contract for shared use so that shared use is possible. It is that simple.
On 21/08/20 11:48, Borislav Petkov wrote: >> It's not like we grab MSRs every day. The user-return notifier restores >> 6 MSRs (7 on very old processors). The last two that were added were >> MSR_TSC_AUX itself in 2009 (!) and MSR_IA32_TSX_CTRL last year. > What about "If it is a shared resource, there better be an agreement > about sharing it." is not clear? > > It doesn't matter how many or which resources - there needs to be a > contract for shared use so that shared use is possible. It is that > simple. Sure, and I'll make sure to have that discussion the next time we add a shared MSR in 2029. In the meanwhile: * for the syscall MSRs, patches to share them were reviewed by hpa and peterz so I guess there's no problem. * for MSR_TSC_AUX, which is the one that is causing problems, everybody seems to agree with just using LSL (in the lack specific numbers on performance improvements from RDPID). * for MSR_IA32_TSX_CTRL.RTM_DISABLE, which is the only one that was added in the last 10 years, I'm pretty sure there are no plans for using the Trusty Sidechannel eXtension in the kernel. So that should be fine too. (The CPUID_CLEAR bit of the MSR is not shared). Thanks, Paolo
On August 21, 2020 2:28:32 AM PDT, Thomas Gleixner <tglx@linutronix.de> wrote: >On Fri, Aug 21 2020 at 10:09, Paolo Bonzini wrote: >> On 21/08/20 09:47, Borislav Petkov wrote: >>> On Thu, Aug 20, 2020 at 07:50:50PM -0700, Sean Christopherson wrote: >>>> + * Disallow RDPID if KVM is enabled as it may consume a guest's >TSC_AUX >>>> + * if an NMI arrives in KVM's run loop. KVM loads guest's >TSC_AUX on >>>> + * VM-Enter and may not restore the host's value until the CPU >returns >>>> + * to userspace, i.e. KVM depends on the kernel not using >TSC_AUX. >>>> */ >>> And frankly, this is really unfair. The kernel should be able to use >any >>> MSR. IOW, KVM needs to be fixed here. I'm sure it context-switches >other >>> MSRs so one more MSR is not a big deal. >> >> The only MSR that KVM needs to context-switch manually are XSS and >> SPEC_CTRL. They tend to be the same on host and guest in which case >> they can be optimized away. >> >> All the other MSRs (EFER and PAT are those that come to mind) are >> handled by the microcode and thus they don't have the slowness of >> RDMSR/WRMSR >> >> One more MSR *is* a big deal: KVM's vmentry+vmexit cost is around >1000 >> cycles, adding 100 clock cycles for 2 WRMSRs is a 10% increase. > >We all know that MSRs are slow, but as a general rule I have to make it >entirely clear that the kernel has precedence over KVM. > >If the kernel wants to use an MSR for it's own purposes then KVM has to >deal with that and not the other way round. Preventing the kernel from >using a facility freely is not an option ever. > >The insanities of KVM performance optimizations have bitten us more >than >once. > >For this particular case at hand I don't care much and we should just >rip the whole RDPID thing out unconditionally. We still have zero >numbers about the performance difference vs. LSL. > >Thanks, > > tglx It is hardly going to be a performance difference for paranoid entry, which is hopefully rare enough that it falls into the noise.
On Fri, Aug 21, 2020 at 12:55:53PM -0700, hpa@zytor.com wrote: > It is hardly going to be a performance difference for paranoid entry, > which is hopefully rare enough that it falls into the noise. Try perf some day ;-) But yeah, given the utter trainwreck that NMIs are anyway, this is all not going to matter much.
On Fri, Aug 21, 2020 at 3:07 AM Paolo Bonzini <pbonzini@redhat.com> wrote: > > On 21/08/20 11:48, Borislav Petkov wrote: > >> It's not like we grab MSRs every day. The user-return notifier restores > >> 6 MSRs (7 on very old processors). The last two that were added were > >> MSR_TSC_AUX itself in 2009 (!) and MSR_IA32_TSX_CTRL last year. > > What about "If it is a shared resource, there better be an agreement > > about sharing it." is not clear? > > > > It doesn't matter how many or which resources - there needs to be a > > contract for shared use so that shared use is possible. It is that > > simple. > > Sure, and I'll make sure to have that discussion the next time we add a > shared MSR in 2029. > > In the meanwhile: > > * for the syscall MSRs, patches to share them were reviewed by hpa and > peterz so I guess there's no problem. > > * for MSR_TSC_AUX, which is the one that is causing problems, everybody > seems to agree with just using LSL (in the lack specific numbers on > performance improvements from RDPID). > > * for MSR_IA32_TSX_CTRL.RTM_DISABLE, which is the only one that was > added in the last 10 years, I'm pretty sure there are no plans for using > the Trusty Sidechannel eXtension in the kernel. So that should be fine > too. (The CPUID_CLEAR bit of the MSR is not shared). > Regardless of how anyone feels about who owns what in arch/x86 vs arch/x86/kvm, the x86 architecture is a mess that comes from Intel and AMD, and we have to deal with it. On VMX, when a VM exits, the VM's value of MSR_TSC_AUX is live, and we can take an NMI, MCE, or abominable new #SX, #VE, #VC, etc on the next instruction boundary. And unless we use the atomic MSR switch mechanism, the result is that we're going through the entry path with guest-controlled MSRs. So I think we can chalk this one up to obnoxious hardware.
On 22/08/20 18:42, Andy Lutomirski wrote: > On VMX, when a VM exits, the VM's > value of MSR_TSC_AUX is live, and we can take an NMI, MCE, or > abominable new #SX, #VE, #VC, etc on the next instruction boundary. > And unless we use the atomic MSR switch mechanism, the result is that > we're going through the entry path with guest-controlled MSRs. If anything of that is a problem, we can and will use the atomic MSR switching; it's not worth doing complicated stuff if you're going to pay the price of rdmsr/wrmsr anyway. The remaining cases are MSRs that are really meant for usermode (such as the syscall MSRs) and especially the edge cases of these two MSRs that the kernel doesn't mind too much about. But they are really really rare, I don't expect any new one coming soon and if they are ever needed (by SGX perhaps?!?) I'll certainly loop you guys in. Paolo
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index 98e4d8886f11c..a925c0cf89c1a 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -342,9 +342,9 @@ For 32-bit we have the following conventions - kernel is built with #endif .endm -.macro SAVE_AND_SET_GSBASE scratch_reg:req save_reg:req +.macro SAVE_AND_SET_GSBASE scratch_reg:req save_reg:req no_rdpid=0 rdgsbase \save_reg - GET_PERCPU_BASE \scratch_reg + GET_PERCPU_BASE \scratch_reg \no_rdpid wrgsbase \scratch_reg .endm @@ -375,11 +375,15 @@ For 32-bit we have the following conventions - kernel is built with * We normally use %gs for accessing per-CPU data, but we are setting up * %gs here and obviously can not use %gs itself to access per-CPU data. */ -.macro GET_PERCPU_BASE reg:req +.macro GET_PERCPU_BASE reg:req no_rdpid=0 + .if \no_rdpid + LOAD_CPU_AND_NODE_SEG_LIMIT \reg + .else ALTERNATIVE \ "LOAD_CPU_AND_NODE_SEG_LIMIT \reg", \ "RDPID \reg", \ X86_FEATURE_RDPID + .endif andq $VDSO_CPUNODE_MASK, \reg movq __per_cpu_offset(, \reg, 8), \reg .endm diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 70dea93378162..fd915c46297c5 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -842,8 +842,13 @@ SYM_CODE_START_LOCAL(paranoid_entry) * * The MSR write ensures that no subsequent load is based on a * mispredicted GSBASE. No extra FENCE required. + * + * Disallow RDPID if KVM is enabled as it may consume a guest's TSC_AUX + * if an NMI arrives in KVM's run loop. KVM loads guest's TSC_AUX on + * VM-Enter and may not restore the host's value until the CPU returns + * to userspace, i.e. KVM depends on the kernel not using TSC_AUX. */ - SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx + SAVE_AND_SET_GSBASE scratch_reg=%rax save_reg=%rbx no_rdpid=IS_ENABLED(CONFIG_KVM) ret .Lparanoid_entry_checkgs:
Don't use RDPID in the paranoid entry flow if KVM is enabled as doing so can consume a KVM guest's MSR_TSC_AUX value if an NMI arrives in KVM's run loop. As a performance optimization, KVM loads the guest's TSC_AUX when a CPU first enters its run loop, and on AMD's SVM doesn't restore the host's value until the CPU exits the run loop. VMX is even more aggressive and defers restoring the host's value until the CPU returns to userspace. This optimization obviously relies on the kernel not consuming TSC_AUX, which falls apart if an NMI arrives in the run loop. Removing KVM's optimizaton would be painful, as both SVM and VMX would need to context switch the MSR on every VM-Enter (2x WRMSR + 1x RDMSR), whereas using LSL instead RDPID is a minor blip. Fixes: eaad981291ee3 ("x86/entry/64: Introduce the FIND_PERCPU_BASE macro") Cc: Dave Hansen <dave.hansen@intel.com> Cc: Chang Seok Bae <chang.seok.bae@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sasha Levin <sashal@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Reported-by: Tom Lendacky <thomas.lendacky@amd.com> Debugged-by: Tom Lendacky <thomas.lendacky@amd.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> --- Andy, I know you said "unconditionally", but it felt weird adding a comment way down in GET_PERCPU_BASE without plumbing a param in to help provide context. But, paranoid_entry is the only user so adding a param that is unconditional also felt weird. That being said, I definitely don't have a strong opinion one way or the other. arch/x86/entry/calling.h | 10 +++++++--- arch/x86/entry/entry_64.S | 7 ++++++- 2 files changed, 13 insertions(+), 4 deletions(-)