diff mbox series

[39/60] x86: optimize loading of GDT at context switch

Message ID 20190528103313.1343-40-jgross@suse.com (mailing list archive)
State New, archived
Headers show
Series xen: add core scheduling support | expand

Commit Message

Jürgen Groß May 28, 2019, 10:32 a.m. UTC
Instead of dynamically decide whether the previous vcpu was using full
or default GDT just add a percpu variable for that purpose. This at
once removes the need for testing vcpu_ids to differ twice.

Cache the need_full_gdt(nd) value in a local variable.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
RFC V2: new patch (split from previous one)
V1: init percpu flag at cpu startup
    rename variable (Jan Beulich)
---
 xen/arch/x86/cpu/common.c  |  3 +++
 xen/arch/x86/domain.c      | 16 +++++++++++-----
 xen/include/asm-x86/desc.h |  1 +
 3 files changed, 15 insertions(+), 5 deletions(-)

Comments

Andrew Cooper July 2, 2019, 4:09 p.m. UTC | #1
On 28/05/2019 11:32, Juergen Gross wrote:
> Instead of dynamically decide whether the previous vcpu was using full

"deciding"

> or default GDT just add a percpu variable for that purpose. This at

"was using a full or default GDT, just add"

> once removes the need for testing vcpu_ids to differ twice.
>
> Cache the need_full_gdt(nd) value in a local variable.

What's the point of doing this?  I know the logic is rather complicated
in __context_switch(), but at least it is visually consistent.  After
this change, it is asymmetric and harder to follow.

>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC V2: new patch (split from previous one)
> V1: init percpu flag at cpu startup
>     rename variable (Jan Beulich)
> ---
>  xen/arch/x86/cpu/common.c  |  3 +++
>  xen/arch/x86/domain.c      | 16 +++++++++++-----
>  xen/include/asm-x86/desc.h |  1 +
>  3 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
> index 33f5d32557..8b90356fe5 100644
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -49,6 +49,8 @@ unsigned int vaddr_bits __read_mostly = VADDR_BITS;
>  static unsigned int cleared_caps[NCAPINTS];
>  static unsigned int forced_caps[NCAPINTS];
>  
> +DEFINE_PER_CPU(bool, full_gdt_loaded);
> +
>  void __init setup_clear_cpu_cap(unsigned int cap)
>  {
>  	const uint32_t *dfs;
> @@ -745,6 +747,7 @@ void load_system_tables(void)
>  		offsetof(struct tss_struct, __cacheline_filler) - 1,
>  		SYS_DESC_tss_busy);
>  
> +        per_cpu(full_gdt_loaded, cpu) = false;

Indentation.  (Although I've got half a mind to do a blanket convert of
files like this to Xen style.  They are almost completely diverged from
their Linux heritage.)

~Andrew
Jürgen Groß July 3, 2019, 6:30 a.m. UTC | #2
On 02.07.19 18:09, Andrew Cooper wrote:
> On 28/05/2019 11:32, Juergen Gross wrote:
>> Instead of dynamically decide whether the previous vcpu was using full
> 
> "deciding"
> 
>> or default GDT just add a percpu variable for that purpose. This at
> 
> "was using a full or default GDT, just add"
> 
>> once removes the need for testing vcpu_ids to differ twice.
>>
>> Cache the need_full_gdt(nd) value in a local variable.
> 
> What's the point of doing this?  I know the logic is rather complicated
> in __context_switch(), but at least it is visually consistent.  After
> this change, it is asymmetric and harder to follow.

This is a hot path. need_full_gdt() needs two compares, of which one is
using evaluate_nospec().


Juergen
Andrew Cooper July 3, 2019, 12:21 p.m. UTC | #3
On 03/07/2019 07:30, Juergen Gross wrote:
> On 02.07.19 18:09, Andrew Cooper wrote:
>> On 28/05/2019 11:32, Juergen Gross wrote:
>>> Instead of dynamically decide whether the previous vcpu was using full
>>
>> "deciding"
>>
>>> or default GDT just add a percpu variable for that purpose. This at
>>
>> "was using a full or default GDT, just add"
>>
>>> once removes the need for testing vcpu_ids to differ twice.
>>>
>>> Cache the need_full_gdt(nd) value in a local variable.
>>
>> What's the point of doing this?  I know the logic is rather complicated
>> in __context_switch(), but at least it is visually consistent.  After
>> this change, it is asymmetric and harder to follow.
>
> This is a hot path. need_full_gdt() needs two compares, of which one is
> using evaluate_nospec().

Urgh.  So evalute_nospec() is already broken here because
need_full_gdt() isn't always_inline, but surely this isn't the only
example impacted in __context_switch()?  The choice of 'gdt' is
similarly impacted by the looks of things.

I'd recommend not worrying about evalute_nospec() for now.  There are
several fundamental problems atm, and Xen 4.13 cannot ship with it in
this state.

~Andrew
Jürgen Groß July 5, 2019, 7:30 a.m. UTC | #4
On 03.07.19 14:21, Andrew Cooper wrote:
> On 03/07/2019 07:30, Juergen Gross wrote:
>> On 02.07.19 18:09, Andrew Cooper wrote:
>>> On 28/05/2019 11:32, Juergen Gross wrote:
>>>> Instead of dynamically decide whether the previous vcpu was using full
>>>
>>> "deciding"
>>>
>>>> or default GDT just add a percpu variable for that purpose. This at
>>>
>>> "was using a full or default GDT, just add"
>>>
>>>> once removes the need for testing vcpu_ids to differ twice.
>>>>
>>>> Cache the need_full_gdt(nd) value in a local variable.
>>>
>>> What's the point of doing this?  I know the logic is rather complicated
>>> in __context_switch(), but at least it is visually consistent.  After
>>> this change, it is asymmetric and harder to follow.
>>
>> This is a hot path. need_full_gdt() needs two compares, of which one is
>> using evaluate_nospec().
> 
> Urgh.  So evalute_nospec() is already broken here because
> need_full_gdt() isn't always_inline, but surely this isn't the only
> example impacted in __context_switch()?  The choice of 'gdt' is
> similarly impacted by the looks of things.
> 
> I'd recommend not worrying about evalute_nospec() for now.  There are
> several fundamental problems atm, and Xen 4.13 cannot ship with it in
> this state.

I did a small performance test with this patch and then removed latching
of need_full_gdt(nd) in the local variable:

On a 8 cpu system I started 2 mini-os domains (1 vcpu each) doing a busy
loop sending events to dom0. On dom0 I did a build of the hypervisor via
"make -j 8" and measured the time for that build, then took the average
of 5 such builds (doing a make clean in between).

            elapsed  user   system
Unpatched:  66.51  232.93  109.21
latched:    64.82  232.33  109.18
unlatched:  63.39  231.81  107.49

As there is a small advantage for not latching I'll remove the full_gdt
local variable.


Juergen
diff mbox series

Patch

diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 33f5d32557..8b90356fe5 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -49,6 +49,8 @@  unsigned int vaddr_bits __read_mostly = VADDR_BITS;
 static unsigned int cleared_caps[NCAPINTS];
 static unsigned int forced_caps[NCAPINTS];
 
+DEFINE_PER_CPU(bool, full_gdt_loaded);
+
 void __init setup_clear_cpu_cap(unsigned int cap)
 {
 	const uint32_t *dfs;
@@ -745,6 +747,7 @@  void load_system_tables(void)
 		offsetof(struct tss_struct, __cacheline_filler) - 1,
 		SYS_DESC_tss_busy);
 
+        per_cpu(full_gdt_loaded, cpu) = false;
 	lgdt(&gdtr);
 	lidt(&idtr);
 	ltr(TSS_ENTRY << 3);
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index adc06154ee..98d2939daf 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1645,6 +1645,8 @@  static inline void load_full_gdt(const struct vcpu *v, unsigned int cpu)
     };
 
     lgdt(&gdt_desc);
+
+    per_cpu(full_gdt_loaded, cpu) = true;
 }
 
 static inline void load_default_gdt(const seg_desc_t *gdt, unsigned int cpu)
@@ -1655,6 +1657,8 @@  static inline void load_default_gdt(const seg_desc_t *gdt, unsigned int cpu)
     };
 
     lgdt(&gdt_desc);
+
+    per_cpu(full_gdt_loaded, cpu) = false;
 }
 
 static void __context_switch(void)
@@ -1665,6 +1669,7 @@  static void __context_switch(void)
     struct vcpu          *n = current;
     struct domain        *pd = p->domain, *nd = n->domain;
     seg_desc_t           *gdt;
+    bool                  full_gdt;
 
     ASSERT(p != n);
     ASSERT(!vcpu_cpu_dirty(n));
@@ -1707,11 +1712,13 @@  static void __context_switch(void)
     gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) :
                                     per_cpu(compat_gdt_table, cpu);
 
-    if ( need_full_gdt(nd) )
+    full_gdt = need_full_gdt(nd);
+
+    if ( full_gdt )
         write_full_gdt_ptes(gdt, n);
 
-    if ( need_full_gdt(pd) &&
-         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) )
+    if ( per_cpu(full_gdt_loaded, cpu) &&
+         ((p->vcpu_id != n->vcpu_id) || !full_gdt) )
         load_default_gdt(gdt, cpu);
 
     write_ptbase(n);
@@ -1723,8 +1730,7 @@  static void __context_switch(void)
         svm_load_segs(0, 0, 0, 0, 0, 0, 0);
 #endif
 
-    if ( need_full_gdt(nd) &&
-         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) )
+    if ( full_gdt && !per_cpu(full_gdt_loaded, cpu) )
         load_full_gdt(n, cpu);
 
     if ( pd != nd )
diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h
index 85e83bcefb..ff9ac5f15d 100644
--- a/xen/include/asm-x86/desc.h
+++ b/xen/include/asm-x86/desc.h
@@ -208,6 +208,7 @@  extern seg_desc_t boot_cpu_gdt_table[];
 DECLARE_PER_CPU(seg_desc_t *, gdt_table);
 extern seg_desc_t boot_cpu_compat_gdt_table[];
 DECLARE_PER_CPU(seg_desc_t *, compat_gdt_table);
+DECLARE_PER_CPU(bool, full_gdt_loaded);
 
 extern void load_TR(void);