Message ID | 20200219174354.84726-8-roger.pau@citrix.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | x86: improve assisted tlb flush and use it in guest mode | expand |
On 19.02.2020 18:43, Roger Pau Monne wrote: > Use Xen's L0 HVMOP_flush_tlbs hypercall in order to perform flushes. > This greatly increases the performance of TLB flushes when running > with a high amount of vCPUs as a Xen guest, and is specially important > when running in shim mode. > > The following figures are from a PV guest running `make -j32 xen` in > shim mode with 32 vCPUs and HAP. > > Using x2APIC and ALLBUT shorthand: > real 4m35.973s > user 4m35.110s > sys 36m24.117s > > Using L0 assisted flush: > real 1m2.596s > user 4m34.818s > sys 5m16.374s > > The implementation adds a new hook to hypervisor_ops so other > enlightenments can also implement such assisted flush just by filling > the hook. Note that the Xen implementation completely ignores the > dirty CPU mask and the linear address passed in, and always performs a > global TLB flush on all vCPUs. This isn't because of an implementation choice of yours, but because of how HVMOP_flush_tlbs works. I think the statement should somehow express this. I also think it wants clarifying that using the hypercall is indeed faster even in the case of single-page, single- CPU flush (which I suspect may not be the case especially as vCPU count grows). The stats above prove a positive overall effect, but they don't say whether the effect could be even bigger by being at least a little selective. > @@ -73,6 +74,15 @@ void __init hypervisor_e820_fixup(struct e820map *e820) > ops.e820_fixup(e820); > } > > +int hypervisor_flush_tlb(const cpumask_t *mask, const void *va, > + unsigned int order) > +{ > + if ( ops.flush_tlb ) > + return alternative_call(ops.flush_tlb, mask, va, order); > + > + return -ENOSYS; > +} Please no new -ENOSYS anywhere (except in new ports' top level hypercall handlers). > @@ -256,6 +257,16 @@ void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int flags) > if ( (flags & ~FLUSH_ORDER_MASK) && > !cpumask_subset(mask, cpumask_of(cpu)) ) > { > + if ( cpu_has_hypervisor && > + !(flags & ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_VA_VALID | > + FLUSH_ORDER_MASK)) && > + !hypervisor_flush_tlb(mask, va, (flags - 1) & FLUSH_ORDER_MASK) ) > + { > + if ( tlb_clk_enabled ) > + tlb_clk_enabled = false; Why does this need doing here? Couldn't Xen guest setup code clear the flag? Jan
On Fri, Feb 28, 2020 at 06:00:44PM +0100, Jan Beulich wrote: > On 19.02.2020 18:43, Roger Pau Monne wrote: > > Use Xen's L0 HVMOP_flush_tlbs hypercall in order to perform flushes. > > This greatly increases the performance of TLB flushes when running > > with a high amount of vCPUs as a Xen guest, and is specially important > > when running in shim mode. > > > > The following figures are from a PV guest running `make -j32 xen` in > > shim mode with 32 vCPUs and HAP. > > > > Using x2APIC and ALLBUT shorthand: > > real 4m35.973s > > user 4m35.110s > > sys 36m24.117s > > > > Using L0 assisted flush: > > real 1m2.596s > > user 4m34.818s > > sys 5m16.374s > > > > The implementation adds a new hook to hypervisor_ops so other > > enlightenments can also implement such assisted flush just by filling > > the hook. Note that the Xen implementation completely ignores the > > dirty CPU mask and the linear address passed in, and always performs a > > global TLB flush on all vCPUs. > > This isn't because of an implementation choice of yours, but because > of how HVMOP_flush_tlbs works. I think the statement should somehow > express this. I also think it wants clarifying that using the > hypercall is indeed faster even in the case of single-page, single- > CPU flush Are you sure about this? I think taking a vmexit is going to be more costly than executing a local invlpg? > (which I suspect may not be the case especially as vCPU > count grows). The stats above prove a positive overall effect, but > they don't say whether the effect could be even bigger by being at > least a little selective. I assume that being able to provide a bitmap with the vCPUs on whether the TLB flush should be performed would give us some more performance, but I haven't looked into it yet. > > @@ -73,6 +74,15 @@ void __init hypervisor_e820_fixup(struct e820map *e820) > > ops.e820_fixup(e820); > > } > > > > +int hypervisor_flush_tlb(const cpumask_t *mask, const void *va, > > + unsigned int order) > > +{ > > + if ( ops.flush_tlb ) > > + return alternative_call(ops.flush_tlb, mask, va, order); > > + > > + return -ENOSYS; > > +} > > Please no new -ENOSYS anywhere (except in new ports' top level > hypercall handlers). Ack. Is EOPNOTSUPP OK? > > @@ -256,6 +257,16 @@ void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int flags) > > if ( (flags & ~FLUSH_ORDER_MASK) && > > !cpumask_subset(mask, cpumask_of(cpu)) ) > > { > > + if ( cpu_has_hypervisor && > > + !(flags & ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_VA_VALID | > > + FLUSH_ORDER_MASK)) && > > + !hypervisor_flush_tlb(mask, va, (flags - 1) & FLUSH_ORDER_MASK) ) > > + { > > + if ( tlb_clk_enabled ) > > + tlb_clk_enabled = false; > > Why does this need doing here? Couldn't Xen guest setup code > clear the flag? Because it's possible that the hypercalls fails, and hence the tlb clock should be kept enabled. There's no reason to disable it until Xen knows the assisted flush is indeed available. I don't mind moving it to Xen guest setup code, but I'm not sure I see why it would be any better than doing it here. The only reason I guess is to avoid checking tlb_clk_enabled on every successful assisted flush? Thanks, Roger.
On 28.02.2020 18:23, Roger Pau Monné wrote: > On Fri, Feb 28, 2020 at 06:00:44PM +0100, Jan Beulich wrote: >> On 19.02.2020 18:43, Roger Pau Monne wrote: >>> Use Xen's L0 HVMOP_flush_tlbs hypercall in order to perform flushes. >>> This greatly increases the performance of TLB flushes when running >>> with a high amount of vCPUs as a Xen guest, and is specially important >>> when running in shim mode. >>> >>> The following figures are from a PV guest running `make -j32 xen` in >>> shim mode with 32 vCPUs and HAP. >>> >>> Using x2APIC and ALLBUT shorthand: >>> real 4m35.973s >>> user 4m35.110s >>> sys 36m24.117s >>> >>> Using L0 assisted flush: >>> real 1m2.596s >>> user 4m34.818s >>> sys 5m16.374s >>> >>> The implementation adds a new hook to hypervisor_ops so other >>> enlightenments can also implement such assisted flush just by filling >>> the hook. Note that the Xen implementation completely ignores the >>> dirty CPU mask and the linear address passed in, and always performs a >>> global TLB flush on all vCPUs. >> >> This isn't because of an implementation choice of yours, but because >> of how HVMOP_flush_tlbs works. I think the statement should somehow >> express this. I also think it wants clarifying that using the >> hypercall is indeed faster even in the case of single-page, single- >> CPU flush > > Are you sure about this? I think taking a vmexit is going to be more > costly than executing a local invlpg? The answer to this was already ... >> (which I suspect may not be the case especially as vCPU >> count grows). ... here (or at least it was meant to address questions back like this one). >> The stats above prove a positive overall effect, but >> they don't say whether the effect could be even bigger by being at >> least a little selective. > > I assume that being able to provide a bitmap with the vCPUs on whether > the TLB flush should be performed would give us some more performance, > but I haven't looked into it yet. > >>> @@ -73,6 +74,15 @@ void __init hypervisor_e820_fixup(struct e820map *e820) >>> ops.e820_fixup(e820); >>> } >>> >>> +int hypervisor_flush_tlb(const cpumask_t *mask, const void *va, >>> + unsigned int order) >>> +{ >>> + if ( ops.flush_tlb ) >>> + return alternative_call(ops.flush_tlb, mask, va, order); >>> + >>> + return -ENOSYS; >>> +} >> >> Please no new -ENOSYS anywhere (except in new ports' top level >> hypercall handlers). > > Ack. Is EOPNOTSUPP OK? Sure. >>> @@ -256,6 +257,16 @@ void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int flags) >>> if ( (flags & ~FLUSH_ORDER_MASK) && >>> !cpumask_subset(mask, cpumask_of(cpu)) ) >>> { >>> + if ( cpu_has_hypervisor && >>> + !(flags & ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_VA_VALID | >>> + FLUSH_ORDER_MASK)) && >>> + !hypervisor_flush_tlb(mask, va, (flags - 1) & FLUSH_ORDER_MASK) ) >>> + { >>> + if ( tlb_clk_enabled ) >>> + tlb_clk_enabled = false; >> >> Why does this need doing here? Couldn't Xen guest setup code >> clear the flag? > > Because it's possible that the hypercalls fails, and hence the tlb > clock should be kept enabled. There's no reason to disable it until > Xen knows the assisted flush is indeed available. > > I don't mind moving it to Xen guest setup code, but I'm not sure I see > why it would be any better than doing it here. The only reason I guess > is to avoid checking tlb_clk_enabled on every successful assisted > flush? Well, iirc there had already been a question on why the if() here is needed. I second the reason, but the whole construct looks misplaced, i.e. is prone to cause further questions down the road. I think if it was to stay here, a comment would be needed to address any such possible questions. But I still think it should live in the init code. This isn't a feature that simply lacks a CPUID bit (or alike) and hence needs probing (unless of course we expect people to want to put a modern Xen [shim] on top of a pre-3.<whatever> Xen; and even then you could probe the underlying hypercall once at boot time). Jan
diff --git a/xen/arch/x86/guest/hypervisor.c b/xen/arch/x86/guest/hypervisor.c index 647cdb1367..47e938e287 100644 --- a/xen/arch/x86/guest/hypervisor.c +++ b/xen/arch/x86/guest/hypervisor.c @@ -18,6 +18,7 @@ * * Copyright (c) 2019 Microsoft. */ +#include <xen/cpumask.h> #include <xen/init.h> #include <xen/types.h> @@ -73,6 +74,15 @@ void __init hypervisor_e820_fixup(struct e820map *e820) ops.e820_fixup(e820); } +int hypervisor_flush_tlb(const cpumask_t *mask, const void *va, + unsigned int order) +{ + if ( ops.flush_tlb ) + return alternative_call(ops.flush_tlb, mask, va, order); + + return -ENOSYS; +} + /* * Local variables: * mode: C diff --git a/xen/arch/x86/guest/xen/xen.c b/xen/arch/x86/guest/xen/xen.c index f151b07548..5d3427a713 100644 --- a/xen/arch/x86/guest/xen/xen.c +++ b/xen/arch/x86/guest/xen/xen.c @@ -324,12 +324,18 @@ static void __init e820_fixup(struct e820map *e820) pv_shim_fixup_e820(e820); } +static int flush_tlb(const cpumask_t *mask, const void *va, unsigned int order) +{ + return xen_hypercall_hvm_op(HVMOP_flush_tlbs, NULL); +} + static const struct hypervisor_ops __initdata ops = { .name = "Xen", .setup = setup, .ap_setup = ap_setup, .resume = resume, .e820_fixup = e820_fixup, + .flush_tlb = flush_tlb, }; const struct hypervisor_ops *__init xg_probe(void) diff --git a/xen/arch/x86/smp.c b/xen/arch/x86/smp.c index fac295fa6f..55d08c9d52 100644 --- a/xen/arch/x86/smp.c +++ b/xen/arch/x86/smp.c @@ -15,6 +15,7 @@ #include <xen/perfc.h> #include <xen/spinlock.h> #include <asm/current.h> +#include <asm/guest.h> #include <asm/smp.h> #include <asm/mc146818rtc.h> #include <asm/flushtlb.h> @@ -256,6 +257,16 @@ void flush_area_mask(const cpumask_t *mask, const void *va, unsigned int flags) if ( (flags & ~FLUSH_ORDER_MASK) && !cpumask_subset(mask, cpumask_of(cpu)) ) { + if ( cpu_has_hypervisor && + !(flags & ~(FLUSH_TLB | FLUSH_TLB_GLOBAL | FLUSH_VA_VALID | + FLUSH_ORDER_MASK)) && + !hypervisor_flush_tlb(mask, va, (flags - 1) & FLUSH_ORDER_MASK) ) + { + if ( tlb_clk_enabled ) + tlb_clk_enabled = false; + return; + } + spin_lock(&flush_lock); cpumask_and(&flush_cpumask, mask, &cpu_online_map); cpumask_clear_cpu(cpu, &flush_cpumask); diff --git a/xen/include/asm-x86/guest/hypervisor.h b/xen/include/asm-x86/guest/hypervisor.h index ade10e74ea..432e57c2a0 100644 --- a/xen/include/asm-x86/guest/hypervisor.h +++ b/xen/include/asm-x86/guest/hypervisor.h @@ -19,6 +19,8 @@ #ifndef __X86_HYPERVISOR_H__ #define __X86_HYPERVISOR_H__ +#include <xen/cpumask.h> + #include <asm/e820.h> struct hypervisor_ops { @@ -32,6 +34,8 @@ struct hypervisor_ops { void (*resume)(void); /* Fix up e820 map */ void (*e820_fixup)(struct e820map *e820); + /* L0 assisted TLB flush */ + int (*flush_tlb)(const cpumask_t *mask, const void *va, unsigned int order); }; #ifdef CONFIG_GUEST @@ -41,6 +45,14 @@ void hypervisor_setup(void); int hypervisor_ap_setup(void); void hypervisor_resume(void); void hypervisor_e820_fixup(struct e820map *e820); +/* + * L0 assisted TLB flush. + * mask: cpumask of the dirty vCPUs that should be flushed. + * va: linear address to flush, or NULL for global flushes. + * order: order of the linear address pointed by va. + */ +int hypervisor_flush_tlb(const cpumask_t *mask, const void *va, + unsigned int order); #else @@ -52,6 +64,11 @@ static inline void hypervisor_setup(void) { ASSERT_UNREACHABLE(); } static inline int hypervisor_ap_setup(void) { return 0; } static inline void hypervisor_resume(void) { ASSERT_UNREACHABLE(); } static inline void hypervisor_e820_fixup(struct e820map *e820) {} +static inline int hypervisor_flush_tlb(const cpumask_t *mask, const void *va, + unsigned int order) +{ + return -ENOSYS; +} #endif /* CONFIG_GUEST */