[39/60] x86: optimize loading of GDT at context switch

Message ID	20190528103313.1343-40-jgross@suse.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <xen-devel-bounces@lists.xenproject.org> From: Juergen Gross <jgross@suse.com> To: xen-devel@lists.xenproject.org Date: Tue, 28 May 2019 12:32:52 +0200 Message-Id: <20190528103313.1343-40-jgross@suse.com> In-Reply-To: <20190528103313.1343-1-jgross@suse.com> References: <20190528103313.1343-1-jgross@suse.com> Subject: [Xen-devel] [PATCH 39/60] x86: optimize loading of GDT at context switch Precedence: list Cc: Juergen Gross <jgross@suse.com>, Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>, Jan Beulich <jbeulich@suse.com>, =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: xen-devel-bounces@lists.xenproject.org Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>
Series	xen: add core scheduling support \| expand [00/60] xen: add core scheduling support [01/60] xen/sched: only allow schedulers with all mandatory functions available [02/60] xen/sched: add inline wrappers for calling per-scheduler functions [03/60] xen/sched: let sched_switch_sched() return new lock address [04/60] xen/sched: use new sched_unit instead of vcpu in scheduler interfaces [05/60] xen/sched: alloc struct sched_unit for each vcpu [06/60] xen/sched: move per-vcpu scheduler private data pointer to sched_unit [07/60] xen/sched: build a linked list of struct sched_unit [08/60] xen/sched: introduce struct sched_resource [09/60] xen/sched: let pick_cpu return a scheduler resource [10/60] xen/sched: switch schedule_data.curr to point at sched_unit [11/60] xen/sched: move per cpu scheduler private data into struct sched_resource [12/60] xen/sched: switch vcpu_schedule_lock to unit_schedule_lock [13/60] xen/sched: move some per-vcpu items to struct sched_unit [14/60] xen/sched: add scheduler helpers hiding vcpu [15/60] xen/sched: add domain pointer to struct sched_unit [16/60] xen/sched: add id to struct sched_unit [17/60] xen/sched: rename scheduler related perf counters [18/60] xen/sched: switch struct task_slice from vcpu to sched_unit [19/60] xen/sched: add is_running indicator to struct sched_unit [20/60] xen/sched: make null scheduler vcpu agnostic. [21/60] xen/sched: make rt scheduler vcpu agnostic. [22/60] xen/sched: make credit scheduler vcpu agnostic. [23/60] xen/sched: make credit2 scheduler vcpu agnostic. [24/60] xen/sched: make arinc653 scheduler vcpu agnostic. [25/60] xen: add sched_unit_pause_nosync() and sched_unit_unpause() [26/60] xen: let vcpu_create() select processor [27/60] xen/sched: use sched_resource cpu instead smp_processor_id in schedulers [28/60] xen/sched: switch schedule() from vcpus to sched_units [29/60] xen/sched: switch sched_move_irqs() to take sched_unit as parameter [30/60] xen: switch from for_each_vcpu() to for_each_sched_unit() [31/60] xen/sched: add runstate counters to struct sched_unit [32/60] xen/sched: rework and rename vcpu_force_reschedule() [33/60] xen/sched: Change vcpu_migrate_*() to operate on schedule unit [34/60] xen/sched: move struct task_slice into struct sched_unit [35/60] xen/sched: add code to sync scheduling of all vcpus of a sched unit [36/60] xen/sched: introduce unit_runnable_state() [37/60] xen/sched: add support for multiple vcpus per sched unit where missing [38/60] x86: make loading of GDT at context switch more modular [39/60] x86: optimize loading of GDT at context switch [40/60] xen/sched: modify cpupool_domain_cpumask() to be an unit mask [41/60] xen/sched: support allocating multiple vcpus into one sched unit [42/60] xen/sched: add a scheduler_percpu_init() function [43/60] xen/sched: add a percpu resource index [44/60] xen/sched: add fall back to idle vcpu when scheduling unit [45/60] xen/sched: make vcpu_wake() and vcpu_sleep() core scheduling aware [46/60] xen/sched: carve out freeing sched_unit memory into dedicated function [47/60] xen/sched: move per-cpu variable scheduler to struct sched_resource [48/60] xen/sched: move per-cpu variable cpupool to struct sched_resource [49/60] xen/sched: reject switching smt on/off with core scheduling active [50/60] xen/sched: prepare per-cpupool scheduling granularity [51/60] xen/sched: use one schedule lock for all free cpus [52/60] xen/sched: populate cpupool0 only after all cpus are up [53/60] xen/sched: remove cpu from pool0 before removing it [54/60] xen/sched: add minimalistic idle scheduler for free cpus [55/60] xen/sched: split schedule_cpu_switch() [56/60] xen/sched: protect scheduling resource via rcu [57/60] xen/sched: support multiple cpus per scheduling resource [58/60] xen/sched: support differing granularity in schedule_cpu_[add/rm]() [59/60] xen/sched: support core scheduling for moving cpus to/from cpupools [60/60] xen/sched: add scheduling granularity enum

Message ID

20190528103313.1343-40-jgross@suse.com (mailing list archive)

State

New, archived

Headers

From: Juergen Gross <jgross@suse.com>
To: xen-devel@lists.xenproject.org
Date: Tue, 28 May 2019 12:32:52 +0200
Message-Id: <20190528103313.1343-40-jgross@suse.com>
In-Reply-To: <20190528103313.1343-1-jgross@suse.com>
References: <20190528103313.1343-1-jgross@suse.com>
Subject: [Xen-devel] [PATCH 39/60] x86: optimize loading of GDT at context
 switch
Precedence: list
Cc: Juergen Gross <jgross@suse.com>,
 Andrew Cooper <andrew.cooper3@citrix.com>, Wei Liu <wl@xen.org>,
 Jan Beulich <jbeulich@suse.com>,
 =?utf-8?q?Roger_Pau_Monn=C3=A9?= <roger.pau@citrix.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: xen-devel-bounces@lists.xenproject.org
Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org>

Series

xen: add core scheduling support | expand

Commit Message

Jürgen Groß May 28, 2019, 10:32 a.m. UTC

Instead of dynamically decide whether the previous vcpu was using full
or default GDT just add a percpu variable for that purpose. This at
once removes the need for testing vcpu_ids to differ twice.

Cache the need_full_gdt(nd) value in a local variable.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
RFC V2: new patch (split from previous one)
V1: init percpu flag at cpu startup
    rename variable (Jan Beulich)
---
 xen/arch/x86/cpu/common.c  |  3 +++
 xen/arch/x86/domain.c      | 16 +++++++++++-----
 xen/include/asm-x86/desc.h |  1 +
 3 files changed, 15 insertions(+), 5 deletions(-)

Comments

Andrew Cooper July 2, 2019, 4:09 p.m. UTC | #1

On 28/05/2019 11:32, Juergen Gross wrote:
> Instead of dynamically decide whether the previous vcpu was using full

"deciding"

> or default GDT just add a percpu variable for that purpose. This at

"was using a full or default GDT, just add"

> once removes the need for testing vcpu_ids to differ twice.
>
> Cache the need_full_gdt(nd) value in a local variable.

What's the point of doing this?  I know the logic is rather complicated
in __context_switch(), but at least it is visually consistent.  After
this change, it is asymmetric and harder to follow.

>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
> ---
> RFC V2: new patch (split from previous one)
> V1: init percpu flag at cpu startup
>     rename variable (Jan Beulich)
> ---
>  xen/arch/x86/cpu/common.c  |  3 +++
>  xen/arch/x86/domain.c      | 16 +++++++++++-----
>  xen/include/asm-x86/desc.h |  1 +
>  3 files changed, 15 insertions(+), 5 deletions(-)
>
> diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
> index 33f5d32557..8b90356fe5 100644
> --- a/xen/arch/x86/cpu/common.c
> +++ b/xen/arch/x86/cpu/common.c
> @@ -49,6 +49,8 @@ unsigned int vaddr_bits __read_mostly = VADDR_BITS;
>  static unsigned int cleared_caps[NCAPINTS];
>  static unsigned int forced_caps[NCAPINTS];
>  
> +DEFINE_PER_CPU(bool, full_gdt_loaded);
> +
>  void __init setup_clear_cpu_cap(unsigned int cap)
>  {
>  	const uint32_t *dfs;
> @@ -745,6 +747,7 @@ void load_system_tables(void)
>  		offsetof(struct tss_struct, __cacheline_filler) - 1,
>  		SYS_DESC_tss_busy);
>  
> +        per_cpu(full_gdt_loaded, cpu) = false;

Indentation.  (Although I've got half a mind to do a blanket convert of
files like this to Xen style.  They are almost completely diverged from
their Linux heritage.)

~Andrew

Jürgen Groß July 3, 2019, 6:30 a.m. UTC | #2

On 02.07.19 18:09, Andrew Cooper wrote:
> On 28/05/2019 11:32, Juergen Gross wrote:
>> Instead of dynamically decide whether the previous vcpu was using full
> 
> "deciding"
> 
>> or default GDT just add a percpu variable for that purpose. This at
> 
> "was using a full or default GDT, just add"
> 
>> once removes the need for testing vcpu_ids to differ twice.
>>
>> Cache the need_full_gdt(nd) value in a local variable.
> 
> What's the point of doing this?  I know the logic is rather complicated
> in __context_switch(), but at least it is visually consistent.  After
> this change, it is asymmetric and harder to follow.

This is a hot path. need_full_gdt() needs two compares, of which one is
using evaluate_nospec().


Juergen

Andrew Cooper July 3, 2019, 12:21 p.m. UTC | #3

On 03/07/2019 07:30, Juergen Gross wrote:
> On 02.07.19 18:09, Andrew Cooper wrote:
>> On 28/05/2019 11:32, Juergen Gross wrote:
>>> Instead of dynamically decide whether the previous vcpu was using full
>>
>> "deciding"
>>
>>> or default GDT just add a percpu variable for that purpose. This at
>>
>> "was using a full or default GDT, just add"
>>
>>> once removes the need for testing vcpu_ids to differ twice.
>>>
>>> Cache the need_full_gdt(nd) value in a local variable.
>>
>> What's the point of doing this?  I know the logic is rather complicated
>> in __context_switch(), but at least it is visually consistent.  After
>> this change, it is asymmetric and harder to follow.
>
> This is a hot path. need_full_gdt() needs two compares, of which one is
> using evaluate_nospec().

Urgh.  So evalute_nospec() is already broken here because
need_full_gdt() isn't always_inline, but surely this isn't the only
example impacted in __context_switch()?  The choice of 'gdt' is
similarly impacted by the looks of things.

I'd recommend not worrying about evalute_nospec() for now.  There are
several fundamental problems atm, and Xen 4.13 cannot ship with it in
this state.

~Andrew

Jürgen Groß July 5, 2019, 7:30 a.m. UTC | #4

On 03.07.19 14:21, Andrew Cooper wrote:
> On 03/07/2019 07:30, Juergen Gross wrote:
>> On 02.07.19 18:09, Andrew Cooper wrote:
>>> On 28/05/2019 11:32, Juergen Gross wrote:
>>>> Instead of dynamically decide whether the previous vcpu was using full
>>>
>>> "deciding"
>>>
>>>> or default GDT just add a percpu variable for that purpose. This at
>>>
>>> "was using a full or default GDT, just add"
>>>
>>>> once removes the need for testing vcpu_ids to differ twice.
>>>>
>>>> Cache the need_full_gdt(nd) value in a local variable.
>>>
>>> What's the point of doing this?  I know the logic is rather complicated
>>> in __context_switch(), but at least it is visually consistent.  After
>>> this change, it is asymmetric and harder to follow.
>>
>> This is a hot path. need_full_gdt() needs two compares, of which one is
>> using evaluate_nospec().
> 
> Urgh.  So evalute_nospec() is already broken here because
> need_full_gdt() isn't always_inline, but surely this isn't the only
> example impacted in __context_switch()?  The choice of 'gdt' is
> similarly impacted by the looks of things.
> 
> I'd recommend not worrying about evalute_nospec() for now.  There are
> several fundamental problems atm, and Xen 4.13 cannot ship with it in
> this state.

I did a small performance test with this patch and then removed latching
of need_full_gdt(nd) in the local variable:

On a 8 cpu system I started 2 mini-os domains (1 vcpu each) doing a busy
loop sending events to dom0. On dom0 I did a build of the hypervisor via
"make -j 8" and measured the time for that build, then took the average
of 5 such builds (doing a make clean in between).

            elapsed  user   system
Unpatched:  66.51  232.93  109.21
latched:    64.82  232.33  109.18
unlatched:  63.39  231.81  107.49

As there is a small advantage for not latching I'll remove the full_gdt
local variable.


Juergen

diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 33f5d32557..8b90356fe5 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -49,6 +49,8 @@  unsigned int vaddr_bits __read_mostly = VADDR_BITS;
 static unsigned int cleared_caps[NCAPINTS];
 static unsigned int forced_caps[NCAPINTS];
 
+DEFINE_PER_CPU(bool, full_gdt_loaded);
+
 void __init setup_clear_cpu_cap(unsigned int cap)
 {
 	const uint32_t *dfs;
@@ -745,6 +747,7 @@  void load_system_tables(void)
 		offsetof(struct tss_struct, __cacheline_filler) - 1,
 		SYS_DESC_tss_busy);
 
+        per_cpu(full_gdt_loaded, cpu) = false;
 	lgdt(&gdtr);
 	lidt(&idtr);
 	ltr(TSS_ENTRY << 3);
diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index adc06154ee..98d2939daf 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1645,6 +1645,8 @@  static inline void load_full_gdt(const struct vcpu *v, unsigned int cpu)
     };
 
     lgdt(&gdt_desc);
+
+    per_cpu(full_gdt_loaded, cpu) = true;
 }
 
 static inline void load_default_gdt(const seg_desc_t *gdt, unsigned int cpu)
@@ -1655,6 +1657,8 @@  static inline void load_default_gdt(const seg_desc_t *gdt, unsigned int cpu)
     };
 
     lgdt(&gdt_desc);
+
+    per_cpu(full_gdt_loaded, cpu) = false;
 }
 
 static void __context_switch(void)
@@ -1665,6 +1669,7 @@  static void __context_switch(void)
     struct vcpu          *n = current;
     struct domain        *pd = p->domain, *nd = n->domain;
     seg_desc_t           *gdt;
+    bool                  full_gdt;
 
     ASSERT(p != n);
     ASSERT(!vcpu_cpu_dirty(n));
@@ -1707,11 +1712,13 @@  static void __context_switch(void)
     gdt = !is_pv_32bit_domain(nd) ? per_cpu(gdt_table, cpu) :
                                     per_cpu(compat_gdt_table, cpu);
 
-    if ( need_full_gdt(nd) )
+    full_gdt = need_full_gdt(nd);
+
+    if ( full_gdt )
         write_full_gdt_ptes(gdt, n);
 
-    if ( need_full_gdt(pd) &&
-         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(nd)) )
+    if ( per_cpu(full_gdt_loaded, cpu) &&
+         ((p->vcpu_id != n->vcpu_id) || !full_gdt) )
         load_default_gdt(gdt, cpu);
 
     write_ptbase(n);
@@ -1723,8 +1730,7 @@  static void __context_switch(void)
         svm_load_segs(0, 0, 0, 0, 0, 0, 0);
 #endif
 
-    if ( need_full_gdt(nd) &&
-         ((p->vcpu_id != n->vcpu_id) || !need_full_gdt(pd)) )
+    if ( full_gdt && !per_cpu(full_gdt_loaded, cpu) )
         load_full_gdt(n, cpu);
 
     if ( pd != nd )
diff --git a/xen/include/asm-x86/desc.h b/xen/include/asm-x86/desc.h
index 85e83bcefb..ff9ac5f15d 100644
--- a/xen/include/asm-x86/desc.h
+++ b/xen/include/asm-x86/desc.h
@@ -208,6 +208,7 @@  extern seg_desc_t boot_cpu_gdt_table[];
 DECLARE_PER_CPU(seg_desc_t *, gdt_table);
 extern seg_desc_t boot_cpu_compat_gdt_table[];
 DECLARE_PER_CPU(seg_desc_t *, compat_gdt_table);
+DECLARE_PER_CPU(bool, full_gdt_loaded);
 
 extern void load_TR(void);

[39/60] x86: optimize loading of GDT at context switch

Commit Message

Comments

Patch