Message ID | 625BA99ED14B2D499DC4E29D8138F15063045B0C0C@shsmsx502.ccr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On 08/29/2011 09:09 AM, Tian, Kevin wrote: > > static int handle_apic_access(struct kvm_vcpu *vcpu) > { > + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); > + int access_type, offset; > + > + access_type = (exit_qualification>> 12)& 0xf; > + offset = exit_qualification& 0xfff; > + /* > + * Sane guest uses MOV instead of string operations to > + * write EOI, with written value not cared. So make a > + * short-circuit here by avoiding heavy instruction > + * emulation. > + */ > + if ((access_type == 1)&& (offset == APIC_EOI)) { Please replace this 1 with a #define. > + kvm_lapic_set_eoi(vcpu); > + skip_emulated_instruction(vcpu); > + return 1; > + } > return emulate_instruction(vcpu, 0) == EMULATE_DONE; > } > > Looks good otherwise.
On 2011-08-29 08:09, Tian, Kevin wrote: > Hi, Avi, > > Here comes the patch: > > KVM: APIC: avoid instruction emulation for EOI writes > > Instruction emulation for EOI writes can be skipped, since sane > guest simply uses MOV instead of string operations. This is a nice > improvement when guest doesn't support x2apic or hyper-V EOI > support. > > a single VM bandwidth is observed with ~8% bandwidth improvement > (7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation. > > Signed-off-by: Kevin Tian <kevin.tian@intel.com> > <Based on earlier work from>: > Signed-off-by: Eddie Dong <eddie.dong@intel.com> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c > index 57dcbd4..933187e 100644 > --- a/arch/x86/kvm/lapic.c > +++ b/arch/x86/kvm/lapic.c > @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this, > return 0; > } > > +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) > +{ > + struct kvm_lapic *apic = vcpu->arch.apic; > + > + if (apic) > + apic_set_eoi(vcpu->arch.apic); > +} > +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); > + > void kvm_free_lapic(struct kvm_vcpu *vcpu) > { > if (!vcpu->arch.apic) > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h > index 52c9e6b..8287243 100644 > --- a/arch/x86/kvm/lapic.h > +++ b/arch/x86/kvm/lapic.h > @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); > void kvm_lapic_reset(struct kvm_vcpu *vcpu); > u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); > void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); > +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); > void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); > u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); > void kvm_apic_set_version(struct kvm_vcpu *vcpu); > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index 5e8d411..35e4af7 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu) > > static int handle_apic_access(struct kvm_vcpu *vcpu) > { > + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); > + int access_type, offset; > + > + access_type = (exit_qualification >> 12) & 0xf; > + offset = exit_qualification & 0xfff; > + /* > + * Sane guest uses MOV instead of string operations to > + * write EOI, with written value not cared. So make a > + * short-circuit here by avoiding heavy instruction > + * emulation. > + */ Is there no cheap way to validate this assumption and fall back to the slow path in case it doesn't apply? E.g. reading the first instruction byte and matching it against a whitelist? Even if the ignored scenarios are highly unlikely, I think we so far tried hard to provide both fast and accurate results to the guest in all cases. > + if ((access_type == 1) && (offset == APIC_EOI)) { > + kvm_lapic_set_eoi(vcpu); > + skip_emulated_instruction(vcpu); > + return 1; > + } > return emulate_instruction(vcpu, 0) == EMULATE_DONE; > } > > Thanks > Kevin Nice optimization otherwise! Jan
On 08/29/2011 01:24 PM, Jan Kiszka wrote: > > > > static int handle_apic_access(struct kvm_vcpu *vcpu) > > { > > + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); > > + int access_type, offset; > > + > > + access_type = (exit_qualification>> 12)& 0xf; > > + offset = exit_qualification& 0xfff; > > + /* > > + * Sane guest uses MOV instead of string operations to > > + * write EOI, with written value not cared. So make a > > + * short-circuit here by avoiding heavy instruction > > + * emulation. > > + */ > > Is there no cheap way to validate this assumption and fall back to the > slow path in case it doesn't apply? E.g. reading the first instruction > byte and matching it against a whitelist? Even if the ignored scenarios > are highly unlikely, I think we so far tried hard to provide both fast > and accurate results to the guest in all cases. > Just reading the first byte requires a guest page table walk. This is probably the highest cost in emulation (which also requires a walk for the data access).
On 2011-08-29 12:59, Avi Kivity wrote: > On 08/29/2011 01:24 PM, Jan Kiszka wrote: >>> >>> static int handle_apic_access(struct kvm_vcpu *vcpu) >>> { >>> + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); >>> + int access_type, offset; >>> + >>> + access_type = (exit_qualification>> 12)& 0xf; >>> + offset = exit_qualification& 0xfff; >>> + /* >>> + * Sane guest uses MOV instead of string operations to >>> + * write EOI, with written value not cared. So make a >>> + * short-circuit here by avoiding heavy instruction >>> + * emulation. >>> + */ >> >> Is there no cheap way to validate this assumption and fall back to the >> slow path in case it doesn't apply? E.g. reading the first instruction >> byte and matching it against a whitelist? Even if the ignored scenarios >> are highly unlikely, I think we so far tried hard to provide both fast >> and accurate results to the guest in all cases. >> > > Just reading the first byte requires a guest page table walk. This is > probably the highest cost in emulation (which also requires a walk for > the data access). And what about caching the result of the first walk? Usually, a "sane guest" won't have many code pages that issue the EIO. Jan
On 08/29/2011 02:03 PM, Jan Kiszka wrote: > > > > Just reading the first byte requires a guest page table walk. This is > > probably the highest cost in emulation (which also requires a walk for > > the data access). > > And what about caching the result of the first walk? Usually, a "sane > guest" won't have many code pages that issue the EIO. > There's no way to know when to invalidate the cache. We could go a bit further, and cache the the whole thing. On the first exit, do the entire emulation, and remember %rip. On the second exit, if %rip matches, skip directly to kvm_lapic_eoi(). But I don't think it's worth it. This also has failure modes, and really, no guest will ever write to EOI with stosl.
On 2011-08-29 13:11, Avi Kivity wrote: > On 08/29/2011 02:03 PM, Jan Kiszka wrote: >>> >>> Just reading the first byte requires a guest page table walk. This is >>> probably the highest cost in emulation (which also requires a walk for >>> the data access). >> >> And what about caching the result of the first walk? Usually, a "sane >> guest" won't have many code pages that issue the EIO. >> > > There's no way to know when to invalidate the cache. Set the affected code page read-only? > > We could go a bit further, and cache the the whole thing. On the first > exit, do the entire emulation, and remember %rip. On the second exit, > if %rip matches, skip directly to kvm_lapic_eoi(). > > But I don't think it's worth it. This also has failure modes, and > really, no guest will ever write to EOI with stosl. ...or add/sub/and/or etc. Well, we've done other crazy things in the past just to keep even the unlikely case correct. I was just wondering if that policy changed. However, I just realized that user space is able to avoid this inaccuracy for potentially insane guests by not using in-kernel irqchips. So we have at least a knob. Jan
On 08/29/2011 04:55 PM, Jan Kiszka wrote: > On 2011-08-29 13:11, Avi Kivity wrote: > > On 08/29/2011 02:03 PM, Jan Kiszka wrote: > >>> > >>> Just reading the first byte requires a guest page table walk. This is > >>> probably the highest cost in emulation (which also requires a walk for > >>> the data access). > >> > >> And what about caching the result of the first walk? Usually, a "sane > >> guest" won't have many code pages that issue the EIO. > >> > > > > There's no way to know when to invalidate the cache. > > Set the affected code page read-only? The virt-phys mapping could change too. And please, don't think of new reasons to write protect pages, they break up my lovely 2M maps. > > > > We could go a bit further, and cache the the whole thing. On the first > > exit, do the entire emulation, and remember %rip. On the second exit, > > if %rip matches, skip directly to kvm_lapic_eoi(). > > > > But I don't think it's worth it. This also has failure modes, and > > really, no guest will ever write to EOI with stosl. > > ...or add/sub/and/or etc. Argh, yes, flags can be updated. Actually, this might work - if we get a read access first as part of the RMW, we'll emulate the instruction. No idea what the hardware does in this case. > Well, we've done other crazy things in the > past just to keep even the unlikely case correct. I was just wondering > if that policy changed. I can't answer yes to that question. But I see no way to make it work both fast and correct. > > However, I just realized that user space is able to avoid this > inaccuracy for potentially insane guests by not using in-kernel > irqchips. So we have at least a knob. Could/should have a flag to disable this in the kernel as well.
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 57dcbd4..933187e 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -864,6 +864,15 @@ static int apic_mmio_write(struct kvm_io_device *this, return 0; } +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu) +{ + struct kvm_lapic *apic = vcpu->arch.apic; + + if (apic) + apic_set_eoi(vcpu->arch.apic); +} +EXPORT_SYMBOL_GPL(kvm_lapic_set_eoi); + void kvm_free_lapic(struct kvm_vcpu *vcpu) { if (!vcpu->arch.apic) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 52c9e6b..8287243 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -26,6 +26,7 @@ int kvm_get_apic_interrupt(struct kvm_vcpu *vcpu); void kvm_lapic_reset(struct kvm_vcpu *vcpu); u64 kvm_lapic_get_cr8(struct kvm_vcpu *vcpu); void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long cr8); +void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); void kvm_apic_set_version(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5e8d411..35e4af7 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4540,6 +4540,22 @@ static int handle_xsetbv(struct kvm_vcpu *vcpu) static int handle_apic_access(struct kvm_vcpu *vcpu) { + unsigned long exit_qualification = vmcs_readl(EXIT_QUALIFICATION); + int access_type, offset; + + access_type = (exit_qualification >> 12) & 0xf; + offset = exit_qualification & 0xfff; + /* + * Sane guest uses MOV instead of string operations to + * write EOI, with written value not cared. So make a + * short-circuit here by avoiding heavy instruction + * emulation. + */ + if ((access_type == 1) && (offset == APIC_EOI)) { + kvm_lapic_set_eoi(vcpu); + skip_emulated_instruction(vcpu); + return 1; + } return emulate_instruction(vcpu, 0) == EMULATE_DONE; }