diff mbox series

[4/5] x86/pv: Optimise to the segment context switching paths

Message ID 20200909095920.25495-5-andrew.cooper3@citrix.com
State New
Headers show
Series x86/pv: Minor perf improvements in segment handling | expand

Commit Message

Andrew Cooper Sept. 9, 2020, 9:59 a.m. UTC
Save the segment selectors with explicit asm, rather than with read_sreg().
This permits the use of MOV's m16 encoding, which avoids indirecting the
selector value through a register.

For {save,load}_segments(), opencode the fs/gs helpers, as the optimiser is
unable to rearrange the logic down to a single X86_CR4_FSGSBASE check.  This
removes several jumps and creates bigger basic blocks.

In load_segments(), optimise GS base handling substantially.  The call to
svm_load_segs() already needs gsb/gss the correct way around, so hoist the
logic for the later path to use it as well.  Swapping the inputs in GPRs is
far more efficient than using SWAPGS.

Previously, there was optionally one SWAPGS from the user/kernel mode check,
two SWAPGS's in write_gs_shadow() and two WRGSBASE's.  Updates to GS (4 or 5
here) in quick succession stall all contemporary pipelines repeatedly.  (Intel
Core/Xeon pipelines have segment register renaming[1], so can continue to
speculatively execute with one GS update in flight.  Other pipelines cannot
have two updates in flight concurrently, and must stall dispatch of the second
until the first has retired.)

Rewrite the logic to have exactly two WRGSBASEs and one SWAPGS, which removes
two stalles all contemporary processors.

Although modest, the resulting delta is:

  add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-106 (-106)
  Function                                     old     new   delta
  paravirt_ctxt_switch_from                    235     198     -37
  context_switch                              3582    3513     -69

in a common path.

[1] https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-speculative-behavior-swapgs-and-segment-registers

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/domain.c | 70 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 21 deletions(-)

Comments

Jan Beulich Sept. 11, 2020, 9:49 a.m. UTC | #1
On 09.09.2020 11:59, Andrew Cooper wrote:
> Save the segment selectors with explicit asm, rather than with read_sreg().
> This permits the use of MOV's m16 encoding, which avoids indirecting the
> selector value through a register.

Instead of this, how about making read_sreg() look like

#define read_sreg(val, name) ({                                  \
    if ( sizeof(val) == 2 )                                      \
        asm volatile ( "mov %%" STR(name) ",%0" : "=m" (val) );  \
    else                                                         \
        asm volatile ( "mov %%" STR(name) ",%k0" : "=r" (val) ); \
})

which then also covers read_registers()? I have a full patch
ready to send, if you agree.

> For {save,load}_segments(), opencode the fs/gs helpers, as the optimiser is
> unable to rearrange the logic down to a single X86_CR4_FSGSBASE check.  This
> removes several jumps and creates bigger basic blocks.
> 
> In load_segments(), optimise GS base handling substantially.  The call to
> svm_load_segs() already needs gsb/gss the correct way around, so hoist the
> logic for the later path to use it as well.  Swapping the inputs in GPRs is
> far more efficient than using SWAPGS.
> 
> Previously, there was optionally one SWAPGS from the user/kernel mode check,
> two SWAPGS's in write_gs_shadow() and two WRGSBASE's.  Updates to GS (4 or 5
> here) in quick succession stall all contemporary pipelines repeatedly.  (Intel
> Core/Xeon pipelines have segment register renaming[1], so can continue to
> speculatively execute with one GS update in flight.  Other pipelines cannot
> have two updates in flight concurrently, and must stall dispatch of the second
> until the first has retired.)
> 
> Rewrite the logic to have exactly two WRGSBASEs and one SWAPGS, which removes
> two stalles all contemporary processors.
> 
> Although modest, the resulting delta is:
> 
>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-106 (-106)
>   Function                                     old     new   delta
>   paravirt_ctxt_switch_from                    235     198     -37
>   context_switch                              3582    3513     -69
> 
> in a common path.
> 
> [1] https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-speculative-behavior-swapgs-and-segment-registers
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

Preferably re-based onto said change, and maybe with another adjustment
(see below)
Reviewed-by: Jan Beulich <jbeulich@suse.com>

> @@ -1556,18 +1557,24 @@ static void load_segments(struct vcpu *n)
>                     : [ok] "+r" (all_segs_okay)          \
>                     : [_val] "rm" (val) )
>  
> -#ifdef CONFIG_HVM
> -    if ( cpu_has_svm && !compat && (uregs->fs | uregs->gs) <= 3 )
> +    if ( !compat )
>      {
> -        unsigned long gsb = n->arch.flags & TF_kernel_mode
> -            ? n->arch.pv.gs_base_kernel : n->arch.pv.gs_base_user;
> -        unsigned long gss = n->arch.flags & TF_kernel_mode
> -            ? n->arch.pv.gs_base_user : n->arch.pv.gs_base_kernel;
> +        gsb = n->arch.pv.gs_base_kernel;
> +        gss = n->arch.pv.gs_base_user;
> +
> +        /*
> +         * Figure out which way around gsb/gss want to be.  gsb needs to be
> +         * the active context, and gss needs to be the inactive context.
> +         */
> +        if ( !(n->arch.flags & TF_kernel_mode) )
> +            SWAP(gsb, gss);
>  
> -        fs_gs_done = svm_load_segs(n->arch.pv.ldt_ents, LDT_VIRT_START(n),
> -                                   n->arch.pv.fs_base, gsb, gss);
> +        if ( IS_ENABLED(CONFIG_HVM) && cpu_has_svm &&

The change from #ifdef to IS_ENABLED() wants mirroring to the
prefetching site imo. I wonder though whether the adjustment is a
good idea: The declaration lives in svm.h, and I would view it as
quite reasonable for that header to not get included in !HVM builds
(there may be a lot of disentangling to do to get there, but still).

Jan
Andrew Cooper Sept. 11, 2020, 12:53 p.m. UTC | #2
On 11/09/2020 10:49, Jan Beulich wrote:
> On 09.09.2020 11:59, Andrew Cooper wrote:
>> Save the segment selectors with explicit asm, rather than with read_sreg().
>> This permits the use of MOV's m16 encoding, which avoids indirecting the
>> selector value through a register.
> Instead of this, how about making read_sreg() look like
>
> #define read_sreg(val, name) ({                                  \
>     if ( sizeof(val) == 2 )                                      \
>         asm volatile ( "mov %%" STR(name) ",%0" : "=m" (val) );  \
>     else                                                         \
>         asm volatile ( "mov %%" STR(name) ",%k0" : "=r" (val) ); \
> })
>
> which then also covers read_registers()? I have a full patch
> ready to send, if you agree.

That will go wrong for

uint16_t ds; read_sreg(ds, ds);

as it will force the variable to be spilled onto the stack.

I don't think this is a clever move.


Furthermore, it is bad enough that read_sreg() already takes one magic
parameter which doesn't follow normal C rules - renaming to READ_SREG()
would be an improvement - but this is now adding a second which is a
capture by name.

I'm afraid that is a firm no from me.


There is one other place where we read all segment registers at once. 
Maybe having a static inline save_sregs(struct cpu_user_regs *) hiding
the asm block, but I'm not necessarily convinced of this plan either. 
At least it would cleanly separate the "I've obviously got a memory
operand" and "I almost certainly want it in a register anyway" logic.

>
>> For {save,load}_segments(), opencode the fs/gs helpers, as the optimiser is
>> unable to rearrange the logic down to a single X86_CR4_FSGSBASE check.  This
>> removes several jumps and creates bigger basic blocks.
>>
>> In load_segments(), optimise GS base handling substantially.  The call to
>> svm_load_segs() already needs gsb/gss the correct way around, so hoist the
>> logic for the later path to use it as well.  Swapping the inputs in GPRs is
>> far more efficient than using SWAPGS.
>>
>> Previously, there was optionally one SWAPGS from the user/kernel mode check,
>> two SWAPGS's in write_gs_shadow() and two WRGSBASE's.  Updates to GS (4 or 5
>> here) in quick succession stall all contemporary pipelines repeatedly.  (Intel
>> Core/Xeon pipelines have segment register renaming[1], so can continue to
>> speculatively execute with one GS update in flight.  Other pipelines cannot
>> have two updates in flight concurrently, and must stall dispatch of the second
>> until the first has retired.)
>>
>> Rewrite the logic to have exactly two WRGSBASEs and one SWAPGS, which removes
>> two stalles all contemporary processors.
>>
>> Although modest, the resulting delta is:
>>
>>   add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-106 (-106)
>>   Function                                     old     new   delta
>>   paravirt_ctxt_switch_from                    235     198     -37
>>   context_switch                              3582    3513     -69
>>
>> in a common path.
>>
>> [1] https://software.intel.com/security-software-guidance/insights/deep-dive-intel-analysis-speculative-behavior-swapgs-and-segment-registers
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Preferably re-based onto said change, and maybe with another adjustment
> (see below)
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
>> @@ -1556,18 +1557,24 @@ static void load_segments(struct vcpu *n)
>>                     : [ok] "+r" (all_segs_okay)          \
>>                     : [_val] "rm" (val) )
>>  
>> -#ifdef CONFIG_HVM
>> -    if ( cpu_has_svm && !compat && (uregs->fs | uregs->gs) <= 3 )
>> +    if ( !compat )
>>      {
>> -        unsigned long gsb = n->arch.flags & TF_kernel_mode
>> -            ? n->arch.pv.gs_base_kernel : n->arch.pv.gs_base_user;
>> -        unsigned long gss = n->arch.flags & TF_kernel_mode
>> -            ? n->arch.pv.gs_base_user : n->arch.pv.gs_base_kernel;
>> +        gsb = n->arch.pv.gs_base_kernel;
>> +        gss = n->arch.pv.gs_base_user;
>> +
>> +        /*
>> +         * Figure out which way around gsb/gss want to be.  gsb needs to be
>> +         * the active context, and gss needs to be the inactive context.
>> +         */
>> +        if ( !(n->arch.flags & TF_kernel_mode) )
>> +            SWAP(gsb, gss);
>>  
>> -        fs_gs_done = svm_load_segs(n->arch.pv.ldt_ents, LDT_VIRT_START(n),
>> -                                   n->arch.pv.fs_base, gsb, gss);
>> +        if ( IS_ENABLED(CONFIG_HVM) && cpu_has_svm &&
> The change from #ifdef to IS_ENABLED() wants mirroring to the
> prefetching site imo. I wonder though whether the adjustment is a
> good idea: The declaration lives in svm.h, and I would view it as
> quite reasonable for that header to not get included in !HVM builds
> (there may be a lot of disentangling to do to get there, but still).

I'm not overly fussed, but there will absolutely have to be HVM stubs
for normal code.  I don't see why we should special case svm_load_segs()
to not have a stub.

~Andrew
Jan Beulich Sept. 11, 2020, 1:28 p.m. UTC | #3
On 11.09.2020 14:53, Andrew Cooper wrote:
> On 11/09/2020 10:49, Jan Beulich wrote:
>> On 09.09.2020 11:59, Andrew Cooper wrote:
>>> Save the segment selectors with explicit asm, rather than with read_sreg().
>>> This permits the use of MOV's m16 encoding, which avoids indirecting the
>>> selector value through a register.
>> Instead of this, how about making read_sreg() look like
>>
>> #define read_sreg(val, name) ({                                  \
>>     if ( sizeof(val) == 2 )                                      \
>>         asm volatile ( "mov %%" STR(name) ",%0" : "=m" (val) );  \
>>     else                                                         \
>>         asm volatile ( "mov %%" STR(name) ",%k0" : "=r" (val) ); \
>> })
>>
>> which then also covers read_registers()? I have a full patch
>> ready to send, if you agree.
> 
> That will go wrong for
> 
> uint16_t ds; read_sreg(ds, ds);
> 
> as it will force the variable to be spilled onto the stack.


Let me quote the main part of the description of the patch then:

"Under the assumption that standalone variables preferably wouldn't be
 of uint16_t (or unsigned short) type, build in a heuristic to do a
 store to memory when the output expression is two bytes wide. In the
 register variant, add a 'k' modifier to avoid assemblers possibly
 generating operand size of REX prefixes."

A local variable has no reason to be uint16_t; nowadays even
./CODING_STYLE says so.

> I don't think this is a clever move.
> 
> 
> Furthermore, it is bad enough that read_sreg() already takes one magic
> parameter which doesn't follow normal C rules - renaming to READ_SREG()
> would be an improvement - but this is now adding a second which is a
> capture by name.

I was expecting this to be a possible objection from you. I wouldn't
mind upper-casing the name, but ...

> I'm afraid that is a firm no from me.

... looks like you prefer the open-coding, while I'd like to avoid it
whenever reasonably possible.

> There is one other place where we read all segment registers at once. 
> Maybe having a static inline save_sregs(struct cpu_user_regs *) hiding
> the asm block, but I'm not necessarily convinced of this plan either. 
> At least it would cleanly separate the "I've obviously got a memory
> operand" and "I almost certainly want it in a register anyway" logic.

I could live with this as a compromise.

>>> @@ -1556,18 +1557,24 @@ static void load_segments(struct vcpu *n)
>>>                     : [ok] "+r" (all_segs_okay)          \
>>>                     : [_val] "rm" (val) )
>>>  
>>> -#ifdef CONFIG_HVM
>>> -    if ( cpu_has_svm && !compat && (uregs->fs | uregs->gs) <= 3 )
>>> +    if ( !compat )
>>>      {
>>> -        unsigned long gsb = n->arch.flags & TF_kernel_mode
>>> -            ? n->arch.pv.gs_base_kernel : n->arch.pv.gs_base_user;
>>> -        unsigned long gss = n->arch.flags & TF_kernel_mode
>>> -            ? n->arch.pv.gs_base_user : n->arch.pv.gs_base_kernel;
>>> +        gsb = n->arch.pv.gs_base_kernel;
>>> +        gss = n->arch.pv.gs_base_user;
>>> +
>>> +        /*
>>> +         * Figure out which way around gsb/gss want to be.  gsb needs to be
>>> +         * the active context, and gss needs to be the inactive context.
>>> +         */
>>> +        if ( !(n->arch.flags & TF_kernel_mode) )
>>> +            SWAP(gsb, gss);
>>>  
>>> -        fs_gs_done = svm_load_segs(n->arch.pv.ldt_ents, LDT_VIRT_START(n),
>>> -                                   n->arch.pv.fs_base, gsb, gss);
>>> +        if ( IS_ENABLED(CONFIG_HVM) && cpu_has_svm &&
>> The change from #ifdef to IS_ENABLED() wants mirroring to the
>> prefetching site imo. I wonder though whether the adjustment is a
>> good idea: The declaration lives in svm.h, and I would view it as
>> quite reasonable for that header to not get included in !HVM builds
>> (there may be a lot of disentangling to do to get there, but still).
> 
> I'm not overly fussed, but there will absolutely have to be HVM stubs
> for normal code.  I don't see why we should special case svm_load_segs()
> to not have a stub.

I don't see why they "absolutely" have to exist. With our relying on DCE
I don't think there'll need to be ones consistently for _every_ HVM
function used from more generic code. I also don't view this as "special
casing" - there are already various functions not having stubs, but
merely declarations. The special thing here is that, by their placement,
these declarations may not be in scope long term. Which is because we're
deliberately breaking proper layering here by using a HVM feature for PV.

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 0b0e3f8294..ab71b9f79c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1537,6 +1537,7 @@  arch_do_vcpu_op(
 static void load_segments(struct vcpu *n)
 {
     struct cpu_user_regs *uregs = &n->arch.user_regs;
+    unsigned long gsb = 0, gss = 0;
     bool compat = is_pv_32bit_vcpu(n);
     bool all_segs_okay = true, fs_gs_done = false;
 
@@ -1556,18 +1557,24 @@  static void load_segments(struct vcpu *n)
                    : [ok] "+r" (all_segs_okay)          \
                    : [_val] "rm" (val) )
 
-#ifdef CONFIG_HVM
-    if ( cpu_has_svm && !compat && (uregs->fs | uregs->gs) <= 3 )
+    if ( !compat )
     {
-        unsigned long gsb = n->arch.flags & TF_kernel_mode
-            ? n->arch.pv.gs_base_kernel : n->arch.pv.gs_base_user;
-        unsigned long gss = n->arch.flags & TF_kernel_mode
-            ? n->arch.pv.gs_base_user : n->arch.pv.gs_base_kernel;
+        gsb = n->arch.pv.gs_base_kernel;
+        gss = n->arch.pv.gs_base_user;
+
+        /*
+         * Figure out which way around gsb/gss want to be.  gsb needs to be
+         * the active context, and gss needs to be the inactive context.
+         */
+        if ( !(n->arch.flags & TF_kernel_mode) )
+            SWAP(gsb, gss);
 
-        fs_gs_done = svm_load_segs(n->arch.pv.ldt_ents, LDT_VIRT_START(n),
-                                   n->arch.pv.fs_base, gsb, gss);
+        if ( IS_ENABLED(CONFIG_HVM) && cpu_has_svm &&
+             (uregs->fs | uregs->gs) <= 3 )
+            fs_gs_done = svm_load_segs(n->arch.pv.ldt_ents, LDT_VIRT_START(n),
+                                       n->arch.pv.fs_base, gsb, gss);
     }
-#endif
+
     if ( !fs_gs_done )
     {
         load_LDT(n);
@@ -1581,13 +1588,19 @@  static void load_segments(struct vcpu *n)
 
     if ( !fs_gs_done && !compat )
     {
-        write_fs_base(n->arch.pv.fs_base);
-        write_gs_shadow(n->arch.pv.gs_base_kernel);
-        write_gs_base(n->arch.pv.gs_base_user);
-
-        /* If in kernel mode then switch the GS bases around. */
-        if ( (n->arch.flags & TF_kernel_mode) )
+        if ( read_cr4() & X86_CR4_FSGSBASE )
+        {
+            __wrgsbase(gss);
+            __wrfsbase(n->arch.pv.fs_base);
             asm volatile ( "swapgs" );
+            __wrgsbase(gsb);
+        }
+        else
+        {
+            wrmsrl(MSR_FS_BASE, n->arch.pv.fs_base);
+            wrmsrl(MSR_GS_BASE, gsb);
+            wrmsrl(MSR_SHADOW_GS_BASE, gss);
+        }
     }
 
     if ( unlikely(!all_segs_okay) )
@@ -1703,16 +1716,31 @@  static void save_segments(struct vcpu *v)
 {
     struct cpu_user_regs *regs = &v->arch.user_regs;
 
-    regs->ds = read_sreg(ds);
-    regs->es = read_sreg(es);
-    regs->fs = read_sreg(fs);
-    regs->gs = read_sreg(gs);
+    asm volatile ( "mov %%ds, %[ds];\n\t"
+                   "mov %%es, %[es];\n\t"
+                   "mov %%fs, %[fs];\n\t"
+                   "mov %%gs, %[gs];\n\t"
+                   : [ds] "=m" (regs->ds),
+                     [es] "=m" (regs->es),
+                     [fs] "=m" (regs->fs),
+                     [gs] "=m" (regs->gs) );
 
     if ( !is_pv_32bit_vcpu(v) )
     {
-        unsigned long gs_base = read_gs_base();
+        unsigned long fs_base, gs_base;
+
+        if ( read_cr4() & X86_CR4_FSGSBASE )
+        {
+            fs_base = __rdfsbase();
+            gs_base = __rdgsbase();
+        }
+        else
+        {
+            rdmsrl(MSR_FS_BASE, fs_base);
+            rdmsrl(MSR_GS_BASE, gs_base);
+        }
 
-        v->arch.pv.fs_base = read_fs_base();
+        v->arch.pv.fs_base = fs_base;
         if ( v->arch.flags & TF_kernel_mode )
             v->arch.pv.gs_base_kernel = gs_base;
         else