diff mbox

KVM: APIC: avoid instruction emulation for EOI writes

Message ID 625BA99ED14B2D499DC4E29D8138F15063045B0C0C@shsmsx502.ccr.corp.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Tian, Kevin Aug. 29, 2011, 6:09 a.m. UTC
Hi, Avi,

Here comes the patch:

    KVM: APIC: avoid instruction emulation for EOI writes
    
    Instruction emulation for EOI writes can be skipped, since sane
    guest simply uses MOV instead of string operations. This is a nice
    improvement when guest doesn't support x2apic or hyper-V EOI
    support.
    
    a single VM bandwidth is observed with ~8% bandwidth improvement
    (7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.
    
    Signed-off-by: Kevin Tian <kevin.tian@intel.com>
    <Based on earlier work from>:
    Signed-off-by: Eddie Dong <eddie.dong@intel.com>
    Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Thanks
Kevin

> -----Original Message-----
> From: Avi Kivity [mailto:avi@redhat.com]
> Sent: Thursday, August 25, 2011 12:39 PM
> To: Tian, Kevin
> Cc: kvm@vger.kernel.org; Nakajima, Jun
> Subject: Re: about vEOI optimization
> 
> On 08/25/2011 05:24 AM, Tian, Kevin wrote:
> > >
> > >  Another option is the hyper-V EOI support, which can also eliminate the
> > >  EOI exit when no additional interrupt is pending.  This can improve EOI
> > >  performance even more.
> > >
> >
> > yes, and this is an orthogonal option.
> >
> > So if you agree, I'll send out an updated patch atop their work.
> >
> >
> 
> Thanks.
> 
> --
> I have a truly marvellous patch that fixes the bug which this
> signature is too narrow to contain.

Comments

Avi Kivity Aug. 29, 2011, 7:23 a.m. UTC | #1
On 08/29/2011 09:09 AM, Tian, Kevin wrote:
>
>   static int handle_apic_access(struct kvm_vcpu *vcpu)
>   {
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	int access_type, offset;
> +
> +	access_type = (exit_qualification>>  12)&  0xf;
> +	offset = exit_qualification&  0xfff;
> +	/*
> +	 * Sane guest uses MOV instead of string operations to
> +	 * write EOI, with written value not cared. So make a
> +	 * short-circuit here by avoiding heavy instruction
> +	 * emulation.
> +	 */
> +	if ((access_type == 1)&&  (offset == APIC_EOI)) {

Please replace this 1 with a #define.

> +		kvm_lapic_set_eoi(vcpu);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}
>   	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
>   }
>
>

Looks good otherwise.
Jan Kiszka Aug. 29, 2011, 10:24 a.m. UTC | #2
On 2011-08-29 08:09, Tian, Kevin wrote:
> Hi, Avi,
> 
> Here comes the patch:
> 
>     KVM: APIC: avoid instruction emulation for EOI writes
>     
>     Instruction emulation for EOI writes can be skipped, since sane
>     guest simply uses MOV instead of string operations. This is a nice
>     improvement when guest doesn't support x2apic or hyper-V EOI
>     support.
>     
>     a single VM bandwidth is observed with ~8% bandwidth improvement
>     (7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.
>     
>     Signed-off-by: Kevin Tian <kevin.tian@intel.com>
>     <Based on earlier work from>:
>     Signed-off-by: Eddie Dong <eddie.dong@intel.com>
>     Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 57dcbd4..933187e 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this,
>  	return 0;
>  }
>  
> +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_lapic *apic = vcpu->arch.apic;
> +
> +	if (apic)
> +		apic_set_eoi(vcpu->arch.apic);
> +}
> +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
> +
>  void kvm_free_lapic(struct kvm_vcpu *vcpu)
>  {
>  	if (!vcpu->arch.apic)
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 52c9e6b..8287243 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
>  void kvm_lapic_reset(struct kvm_vcpu *vcpu);
>  u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
> +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
>  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
>  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5e8d411..35e4af7 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu)
>  
>  static int handle_apic_access(struct kvm_vcpu *vcpu)
>  {
> +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> +	int access_type, offset;
> +
> +	access_type = (exit_qualification >> 12) & 0xf;
> +	offset = exit_qualification & 0xfff;
> +	/*
> +	 * Sane guest uses MOV instead of string operations to
> +	 * write EOI, with written value not cared. So make a
> +	 * short-circuit here by avoiding heavy instruction 
> +	 * emulation.
> +	 */

Is there no cheap way to validate this assumption and fall back to the
slow path in case it doesn't apply? E.g. reading the first instruction
byte and matching it against a whitelist? Even if the ignored scenarios
are highly unlikely, I think we so far tried hard to provide both fast
and accurate results to the guest in all cases.

> +	if ((access_type == 1) && (offset == APIC_EOI)) {
> +		kvm_lapic_set_eoi(vcpu);
> +		skip_emulated_instruction(vcpu);
> +		return 1;
> +	}
>  	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
>  }
> 
> Thanks
> Kevin

Nice optimization otherwise!

Jan
Avi Kivity Aug. 29, 2011, 10:59 a.m. UTC | #3
On 08/29/2011 01:24 PM, Jan Kiszka wrote:
> >
> >   static int handle_apic_access(struct kvm_vcpu *vcpu)
> >   {
> >  +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> >  +	int access_type, offset;
> >  +
> >  +	access_type = (exit_qualification>>  12)&  0xf;
> >  +	offset = exit_qualification&  0xfff;
> >  +	/*
> >  +	 * Sane guest uses MOV instead of string operations to
> >  +	 * write EOI, with written value not cared. So make a
> >  +	 * short-circuit here by avoiding heavy instruction
> >  +	 * emulation.
> >  +	 */
>
> Is there no cheap way to validate this assumption and fall back to the
> slow path in case it doesn't apply? E.g. reading the first instruction
> byte and matching it against a whitelist? Even if the ignored scenarios
> are highly unlikely, I think we so far tried hard to provide both fast
> and accurate results to the guest in all cases.
>

Just reading the first byte requires a guest page table walk.  This is 
probably the highest cost in emulation (which also requires a walk for 
the data access).
Jan Kiszka Aug. 29, 2011, 11:03 a.m. UTC | #4
On 2011-08-29 12:59, Avi Kivity wrote:
> On 08/29/2011 01:24 PM, Jan Kiszka wrote:
>>>
>>>   static int handle_apic_access(struct kvm_vcpu *vcpu)
>>>   {
>>>  +	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>>  +	int access_type, offset;
>>>  +
>>>  +	access_type = (exit_qualification>>  12)&  0xf;
>>>  +	offset = exit_qualification&  0xfff;
>>>  +	/*
>>>  +	 * Sane guest uses MOV instead of string operations to
>>>  +	 * write EOI, with written value not cared. So make a
>>>  +	 * short-circuit here by avoiding heavy instruction
>>>  +	 * emulation.
>>>  +	 */
>>
>> Is there no cheap way to validate this assumption and fall back to the
>> slow path in case it doesn't apply? E.g. reading the first instruction
>> byte and matching it against a whitelist? Even if the ignored scenarios
>> are highly unlikely, I think we so far tried hard to provide both fast
>> and accurate results to the guest in all cases.
>>
> 
> Just reading the first byte requires a guest page table walk.  This is 
> probably the highest cost in emulation (which also requires a walk for 
> the data access).

And what about caching the result of the first walk? Usually, a "sane
guest" won't have many code pages that issue the EIO.

Jan
Avi Kivity Aug. 29, 2011, 11:11 a.m. UTC | #5
On 08/29/2011 02:03 PM, Jan Kiszka wrote:
> >
> >  Just reading the first byte requires a guest page table walk.  This is
> >  probably the highest cost in emulation (which also requires a walk for
> >  the data access).
>
> And what about caching the result of the first walk? Usually, a "sane
> guest" won't have many code pages that issue the EIO.
>

There's no way to know when to invalidate the cache.

We could go a bit further, and cache the the whole thing.  On the first 
exit, do the entire emulation, and remember %rip.  On the second exit, 
if %rip matches, skip directly to kvm_lapic_eoi().

But I don't think it's worth it.  This also has failure modes, and 
really, no guest will ever write to EOI with stosl.
Jan Kiszka Aug. 29, 2011, 1:55 p.m. UTC | #6
On 2011-08-29 13:11, Avi Kivity wrote:
> On 08/29/2011 02:03 PM, Jan Kiszka wrote:
>>>
>>>  Just reading the first byte requires a guest page table walk.  This is
>>>  probably the highest cost in emulation (which also requires a walk for
>>>  the data access).
>>
>> And what about caching the result of the first walk? Usually, a "sane
>> guest" won't have many code pages that issue the EIO.
>>
> 
> There's no way to know when to invalidate the cache.

Set the affected code page read-only?

> 
> We could go a bit further, and cache the the whole thing.  On the first 
> exit, do the entire emulation, and remember %rip.  On the second exit, 
> if %rip matches, skip directly to kvm_lapic_eoi().
> 
> But I don't think it's worth it.  This also has failure modes, and 
> really, no guest will ever write to EOI with stosl.

...or add/sub/and/or etc. Well, we've done other crazy things in the
past just to keep even the unlikely case correct. I was just wondering
if that policy changed.

However, I just realized that user space is able to avoid this
inaccuracy for potentially insane guests by not using in-kernel
irqchips. So we have at least a knob.

Jan
Avi Kivity Aug. 29, 2011, 2:14 p.m. UTC | #7
On 08/29/2011 04:55 PM, Jan Kiszka wrote:
> On 2011-08-29 13:11, Avi Kivity wrote:
> >  On 08/29/2011 02:03 PM, Jan Kiszka wrote:
> >>>
> >>>   Just reading the first byte requires a guest page table walk.  This is
> >>>   probably the highest cost in emulation (which also requires a walk for
> >>>   the data access).
> >>
> >>  And what about caching the result of the first walk? Usually, a "sane
> >>  guest" won't have many code pages that issue the EIO.
> >>
> >
> >  There's no way to know when to invalidate the cache.
>
> Set the affected code page read-only?

The virt-phys mapping could change too.  And please, don't think of new 
reasons to write protect pages, they break up my lovely 2M maps.

> >
> >  We could go a bit further, and cache the the whole thing.  On the first
> >  exit, do the entire emulation, and remember %rip.  On the second exit,
> >  if %rip matches, skip directly to kvm_lapic_eoi().
> >
> >  But I don't think it's worth it.  This also has failure modes, and
> >  really, no guest will ever write to EOI with stosl.
>
> ...or add/sub/and/or etc.

Argh, yes, flags can be updated.

Actually, this might work - if we get a read access first as part of the 
RMW, we'll emulate the instruction.  No idea what the hardware does in 
this case.

>   Well, we've done other crazy things in the
> past just to keep even the unlikely case correct. I was just wondering
> if that policy changed.

I can't answer yes to that question.  But I see no way to make it work 
both fast and correct.

>
> However, I just realized that user space is able to avoid this
> inaccuracy for potentially insane guests by not using in-kernel
> irqchips. So we have at least a knob.

Could/should have a flag to disable this in the kernel as well.
diff mbox

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 57dcbd4..933187e 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -864,6 +864,15 @@  static int apic_mmio_write(struct kvm_io_device *this,
 	return 0;
 }
 
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	if (apic)
+		apic_set_eoi(vcpu->arch.apic);
+}
+EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi);
+
 void kvm_free_lapic(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->arch.apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 52c9e6b..8287243 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -26,6 +26,7 @@  int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu);
 void kvm_lapic_reset(struct kvm_vcpu *vcpu);
 u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8);
+void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
 void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
 u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5e8d411..35e4af7 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4540,6 +4540,22 @@  static int handle_xsetbv(struct kvm_vcpu *vcpu)
 
 static int handle_apic_access(struct kvm_vcpu *vcpu)
 {
+	unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
+	int access_type, offset;
+
+	access_type = (exit_qualification >> 12) & 0xf;
+	offset = exit_qualification & 0xfff;
+	/*
+	 * Sane guest uses MOV instead of string operations to
+	 * write EOI, with written value not cared. So make a
+	 * short-circuit here by avoiding heavy instruction 
+	 * emulation.
+	 */
+	if ((access_type == 1) && (offset == APIC_EOI)) {
+		kvm_lapic_set_eoi(vcpu);
+		skip_emulated_instruction(vcpu);
+		return 1;
+	}
 	return emulate_instruction(vcpu, 0) == EMULATE_DONE;
 }