diff mbox

RFC: Getting rid of LTR in VMX

Message ID 38d613dc-5b46-1f4f-04d1-53c01932e6d6@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paolo Bonzini Feb. 20, 2017, 11:05 a.m. UTC
On 18/02/2017 04:29, Andy Lutomirski wrote:
> There's no code here because the patch is trivial, but I want to run
> the idea by you all first to see if there are any issues.
> 
> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
> wastes lots of cycles bumping it back up to accomodate the io bitmap.

Actually looked at the code now...

reload_tss is only invoked for userspace exits, so it is a nice-to-have
but it wouldn't show on most workloads.  Still it does save 150-200
clock cycles to remove it (I just commented out reload_tss() from
__vmx_load_host_state to test).

Another 100-150 could be saved if we could just use rdgsbase/wrgsbase, 
instead of rdmsr/wrmsr, to read and write the kernel GS.  Super hacky
patch after sig.

vmx_save_host_state is really slow...  It would be nice if Intel
defined an XSAVES state to do it (just like AMD's vmload/vmsave).

Paolo


> I propose that we rework this.  Add a percpu variable that indicates
> whether the TSS limit needs to be refreshed.  On task switch, if the
> new task has TIF_IO_BITMAP set, then check that flag and, if set,
> refresh TR and clear the flag.  On VMX exit, set the flag.
> 
> The TSS limit is (phew!) invisible to userspace, so we don't have ABI
> issues to worry about here.  We also shouldn't have security issues
> because a too-low TSS limit just results in unprivileged IO operations
> generating #GP, which is exactly what we want.
> 
> What do you all think?  I expect a speedup of a couple hundred cycles
> on each VM exit.

Comments

Andy Lutomirski Feb. 20, 2017, 4:46 p.m. UTC | #1
On Mon, Feb 20, 2017 at 3:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 18/02/2017 04:29, Andy Lutomirski wrote:
>> There's no code here because the patch is trivial, but I want to run
>> the idea by you all first to see if there are any issues.
>>
>> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
>> wastes lots of cycles bumping it back up to accomodate the io bitmap.
>
> Actually looked at the code now...
>
> reload_tss is only invoked for userspace exits, so it is a nice-to-have
> but it wouldn't show on most workloads.  Still it does save 150-200
> clock cycles to remove it (I just commented out reload_tss() from
> __vmx_load_host_state to test).

That's for anything involving userspace or preemption, right?

>
> Another 100-150 could be saved if we could just use rdgsbase/wrgsbase,
> instead of rdmsr/wrmsr, to read and write the kernel GS.  Super hacky
> patch after sig.

I have a Real Patch Series (tm) to do that, but it has a couple of
unresolved corner cases so far.

That being said, vmx_save_host_state() is, um, poorly optimized.  I'll
try to find some time to fix the obvious things.  Meanwhile, I'll send
real patches for TR.

> +       cr4_set_bits(X86_CR4_FSGSBASE);

Nice root hole :-p

--Andy
Paolo Bonzini Feb. 20, 2017, 4:51 p.m. UTC | #2
On 20/02/2017 17:46, Andy Lutomirski wrote:
> On Mon, Feb 20, 2017 at 3:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>
>>
>> On 18/02/2017 04:29, Andy Lutomirski wrote:
>>> There's no code here because the patch is trivial, but I want to run
>>> the idea by you all first to see if there are any issues.
>>>
>>> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
>>> wastes lots of cycles bumping it back up to accomodate the io bitmap.
>>
>> Actually looked at the code now...
>>
>> reload_tss is only invoked for userspace exits, so it is a nice-to-have
>> but it wouldn't show on most workloads.  Still it does save 150-200
>> clock cycles to remove it (I just commented out reload_tss() from
>> __vmx_load_host_state to test).
> 
> That's for anything involving userspace or preemption, right?

Yes.  But 150-200 clock cycles are nothing compared to the cache misses
you get from preemption, so I'd ignore that.  Saving 300 clock cycles on
userspace exits from TR+GSBASE would be about 5% on my Haswell.

> That being said, vmx_save_host_state() is, um, poorly optimized.

Yeah, but again it doesn't run that often in practical cases.

Paolo
Andy Lutomirski Feb. 20, 2017, 4:55 p.m. UTC | #3
On Mon, Feb 20, 2017 at 8:51 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 20/02/2017 17:46, Andy Lutomirski wrote:
>> On Mon, Feb 20, 2017 at 3:05 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>>>
>>>
>>> On 18/02/2017 04:29, Andy Lutomirski wrote:
>>>> There's no code here because the patch is trivial, but I want to run
>>>> the idea by you all first to see if there are any issues.
>>>>
>>>> VMX is silly and forces the TSS limit to the minimum on VM exits.  KVM
>>>> wastes lots of cycles bumping it back up to accomodate the io bitmap.
>>>
>>> Actually looked at the code now...
>>>
>>> reload_tss is only invoked for userspace exits, so it is a nice-to-have
>>> but it wouldn't show on most workloads.  Still it does save 150-200
>>> clock cycles to remove it (I just commented out reload_tss() from
>>> __vmx_load_host_state to test).
>>
>> That's for anything involving userspace or preemption, right?
>
> Yes.  But 150-200 clock cycles are nothing compared to the cache misses
> you get from preemption, so I'd ignore that.  Saving 300 clock cycles on
> userspace exits from TR+GSBASE would be about 5% on my Haswell.

That's still 5% :)

>
>> That being said, vmx_save_host_state() is, um, poorly optimized.
>
> Yeah, but again it doesn't run that often in practical cases.

But it will if someone ever tries to upstream Google's approach of not
emulating anything in the kernel. :)
Paolo Bonzini Feb. 20, 2017, 10:02 p.m. UTC | #4
> > Yes.  But 150-200 clock cycles are nothing compared to the cache misses
> > you get from preemption, so I'd ignore that.  Saving 300 clock cycles on
> > userspace exits from TR+GSBASE would be about 5% on my Haswell.
> 
> That's still 5% :)

Yes, 5% on userspace exits is good (though they're rare).

OTOH, 300 clock cycles on preemption are nothing to write home about.

Paolo
H. Peter Anvin Feb. 20, 2017, 10:23 p.m. UTC | #5
On February 20, 2017 2:02:53 PM PST, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>> > Yes.  But 150-200 clock cycles are nothing compared to the cache
>misses
>> > you get from preemption, so I'd ignore that.  Saving 300 clock
>cycles on
>> > userspace exits from TR+GSBASE would be about 5% on my Haswell.
>> 
>> That's still 5% :)
>
>Yes, 5% on userspace exits is good (though they're rare).
>
>OTOH, 300 clock cycles on preemption are nothing to write home about.
>
>Paolo

Oh, I would say it is... especially when it's basically free.
diff mbox

Patch

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9856b73a21ad..e76bfec463bc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2138,9 +2138,12 @@  static void vmx_save_host_state(struct kvm_vcpu *vcpu)
 #endif
 
 #ifdef CONFIG_X86_64
-	rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
+	local_irq_disable();
+	asm volatile("swapgs; rdgsbase %0" : "=r" (vmx->msr_host_kernel_gs_base));
 	if (is_long_mode(&vmx->vcpu))
-		wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+		asm volatile("wrgsbase %0" : : "r" (vmx->msr_guest_kernel_gs_base));
+	asm volatile("swapgs");
+	local_irq_enable();
 #endif
 	if (boot_cpu_has(X86_FEATURE_MPX))
 		rdmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
@@ -2152,14 +2155,20 @@  static void vmx_save_host_state(struct kvm_vcpu *vcpu)
 
 static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 {
+	unsigned long flags;
+
 	if (!vmx->host_state.loaded)
 		return;
 
 	++vmx->vcpu.stat.host_state_reload;
 	vmx->host_state.loaded = 0;
 #ifdef CONFIG_X86_64
+	local_irq_save(flags);
+	asm("swapgs");
 	if (is_long_mode(&vmx->vcpu))
-		rdmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+		asm volatile("rdgsbase %0" : "=r" (vmx->msr_guest_kernel_gs_base));
+	asm volatile("wrgsbase %0; swapgs" : : "r" (vmx->msr_host_kernel_gs_base));
+	local_irq_restore(flags);
 #endif
 	if (vmx->host_state.gs_ldt_reload_needed) {
 		kvm_load_ldt(vmx->host_state.ldt_sel);
@@ -2177,10 +2186,7 @@  static void __vmx_load_host_state(struct vcpu_vmx *vmx)
 		loadsegment(es, vmx->host_state.es_sel);
 	}
 #endif
-	reload_tss();
-#ifdef CONFIG_X86_64
-	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
-#endif
+	//reload_tss();
 	if (vmx->host_state.msr_host_bndcfgs)
 		wrmsrl(MSR_IA32_BNDCFGS, vmx->host_state.msr_host_bndcfgs);
 	load_gdt(this_cpu_ptr(&host_gdt));
@@ -3469,6 +3475,7 @@  static int hardware_enable(void)
 		wrmsrl(MSR_IA32_FEATURE_CONTROL, old | test_bits);
 	}
 	cr4_set_bits(X86_CR4_VMXE);
+	cr4_set_bits(X86_CR4_FSGSBASE);
 
 	if (vmm_exclusive) {
 		kvm_cpu_vmxon(phys_addr);