diff mbox series

[3/5] x86/pv: Optimise prefetching in svm_load_segs()

Message ID 20200909095920.25495-4-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86/pv: Minor perf improvements in segment handling | expand

Commit Message

Andrew Cooper Sept. 9, 2020, 9:59 a.m. UTC
Split into two functions.  Passing a load of zeros in results in somewhat poor
register scheduling in __context_switch().

Update the prefetching comment to note that the main point is the TLB fill.

Reorder the writes in svm_load_segs() to access the VMCB fields in ascending
order, which gets better next-line prefetch behaviour out of hardware.  Update
the prefetch instruction to match.

The net delta is:

  add/remove: 1/0 grow/shrink: 0/2 up/down: 38/-39 (-1)
  Function                                     old     new   delta
  svm_load_segs_prefetch                         -      38     +38
  __context_switch                             967     951     -16
  svm_load_segs                                291     268     -23

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
---
 xen/arch/x86/domain.c             |  2 +-
 xen/arch/x86/hvm/svm/svm.c        | 43 ++++++++++++++++++++-------------------
 xen/include/asm-x86/hvm/svm/svm.h |  5 +++--
 3 files changed, 26 insertions(+), 24 deletions(-)

Comments

Jan Beulich Sept. 10, 2020, 2:57 p.m. UTC | #1
On 09.09.2020 11:59, Andrew Cooper wrote:
> Split into two functions.  Passing a load of zeros in results in somewhat poor
> register scheduling in __context_switch().

I'm afraid I don't understand why this would be, no matter that
I trust you having observed this being the case: The registers
used for passing parameters are all call-clobbered anyway, so
the compiler can't use them for anything across the call. And
it would look pretty poor code generation wise if the XORs to
clear them (which effectively have no latency at all) would be
scheduled far ahead of the call, especially when there's better
use for the registers. The observation wasn't possibly from
before your recent dropping of two of the parameters, when they
couldn't all be passed in registers (albeit even then it would
be odd, as the change then should merely have lead to a slightly
smaller stack frame of the function)?

> Update the prefetching comment to note that the main point is the TLB fill.
> 
> Reorder the writes in svm_load_segs() to access the VMCB fields in ascending
> order, which gets better next-line prefetch behaviour out of hardware.  Update
> the prefetch instruction to match.
> 
> The net delta is:
> 
>   add/remove: 1/0 grow/shrink: 0/2 up/down: 38/-39 (-1)
>   Function                                     old     new   delta
>   svm_load_segs_prefetch                         -      38     +38
>   __context_switch                             967     951     -16
>   svm_load_segs                                291     268     -23

A net win of 1 byte ;-)

> --- a/xen/arch/x86/hvm/svm/svm.c
> +++ b/xen/arch/x86/hvm/svm/svm.c
> @@ -1520,6 +1520,19 @@ static void svm_init_erratum_383(const struct cpuinfo_x86 *c)
>  }
>  
>  #ifdef CONFIG_PV
> +void svm_load_segs_prefetch(void)
> +{
> +    struct vmcb_struct *vmcb = this_cpu(host_vmcb_va);

const?

> +    if ( vmcb )
> +        /*
> +         * The main reason for this prefetch is for the TLB fill.  Use the
> +         * opporunity to fetch the lowest address used, to get the best
> +         * behaviour out of hardwares next-line prefetcher.

Nit: "opportunity" and "hardware's" ?

I'm not opposed to the change, but before giving my R-b I'd like to
understand the register scheduling background a little better.

Jan
Andrew Cooper Sept. 10, 2020, 8:30 p.m. UTC | #2
On 10/09/2020 15:57, Jan Beulich wrote:
> On 09.09.2020 11:59, Andrew Cooper wrote:
>> Split into two functions.  Passing a load of zeros in results in somewhat poor
>> register scheduling in __context_switch().
> I'm afraid I don't understand why this would be, no matter that
> I trust you having observed this being the case: The registers
> used for passing parameters are all call-clobbered anyway, so
> the compiler can't use them for anything across the call. And
> it would look pretty poor code generation wise if the XORs to
> clear them (which effectively have no latency at all) would be
> scheduled far ahead of the call, especially when there's better
> use for the registers. The observation wasn't possibly from
> before your recent dropping of two of the parameters, when they
> couldn't all be passed in registers (albeit even then it would
> be odd, as the change then should merely have lead to a slightly
> smaller stack frame of the function)?

Hmm yes.  I wrote this patch before I did the assertion fix, and it the
comment didn't rebase very well.

Back then, one of the zeros was on the stack, which was definitely an
unwanted property.  Even though the XORs are mostly free, they're not
totally free, as they cost decode bandwidth and instruction cache space
(Trivial amounts, but still...).

In general, LTO's inter-procedural-analysis can figure out that
svm_load_segs_prefetch() doesn't use many registers, and the caller can
be optimised based on the fact that some registers aren't actually
clobbered.  (Then again, in this case with a sole caller, LTO really
ought to be able to inline and delete the function.)

How about "results in unnecessary caller setup code" ?

~Andrew
Jan Beulich Sept. 11, 2020, 6:31 a.m. UTC | #3
On 10.09.2020 22:30, Andrew Cooper wrote:
> On 10/09/2020 15:57, Jan Beulich wrote:
>> On 09.09.2020 11:59, Andrew Cooper wrote:
>>> Split into two functions.  Passing a load of zeros in results in somewhat poor
>>> register scheduling in __context_switch().
>> I'm afraid I don't understand why this would be, no matter that
>> I trust you having observed this being the case: The registers
>> used for passing parameters are all call-clobbered anyway, so
>> the compiler can't use them for anything across the call. And
>> it would look pretty poor code generation wise if the XORs to
>> clear them (which effectively have no latency at all) would be
>> scheduled far ahead of the call, especially when there's better
>> use for the registers. The observation wasn't possibly from
>> before your recent dropping of two of the parameters, when they
>> couldn't all be passed in registers (albeit even then it would
>> be odd, as the change then should merely have lead to a slightly
>> smaller stack frame of the function)?
> 
> Hmm yes.  I wrote this patch before I did the assertion fix, and it the
> comment didn't rebase very well.
> 
> Back then, one of the zeros was on the stack, which was definitely an
> unwanted property.  Even though the XORs are mostly free, they're not
> totally free, as they cost decode bandwidth and instruction cache space
> (Trivial amounts, but still...).
> 
> In general, LTO's inter-procedural-analysis can figure out that
> svm_load_segs_prefetch() doesn't use many registers, and the caller can
> be optimised based on the fact that some registers aren't actually
> clobbered.  (Then again, in this case with a sole caller, LTO really
> ought to be able to inline and delete the function.)
> 
> How about "results in unnecessary caller setup code" ?

Yeah, that's probably better as a description.

Reviewed-by: Jan Beulich <jbeulich@suse.com>

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 2271bee36a..0b0e3f8294 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1928,7 +1928,7 @@  static void __context_switch(void)
     /* Prefetch the VMCB if we expect to use it later in the context switch */
     if ( cpu_has_svm && is_pv_domain(nd) && !is_pv_32bit_domain(nd) &&
          !is_idle_domain(nd) )
-        svm_load_segs(0, 0, 0, 0, 0);
+        svm_load_segs_prefetch();
 #endif
 
     if ( need_full_gdt(nd) && !per_cpu(full_gdt_loaded, cpu) )
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 23b2a2aa17..9a2aca7770 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1520,6 +1520,19 @@  static void svm_init_erratum_383(const struct cpuinfo_x86 *c)
 }
 
 #ifdef CONFIG_PV
+void svm_load_segs_prefetch(void)
+{
+    struct vmcb_struct *vmcb = this_cpu(host_vmcb_va);
+
+    if ( vmcb )
+        /*
+         * The main reason for this prefetch is for the TLB fill.  Use the
+         * opporunity to fetch the lowest address used, to get the best
+         * behaviour out of hardwares next-line prefetcher.
+         */
+        prefetchw(&vmcb->fs);
+}
+
 bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
                    unsigned long fs_base, unsigned long gs_base,
                    unsigned long gs_shadow)
@@ -1530,17 +1543,15 @@  bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
     if ( unlikely(!vmcb) )
         return false;
 
-    if ( !ldt_base )
-    {
-        /*
-         * The actual structure field used here was arbitrarily chosen.
-         * Empirically it doesn't seem to matter much which element is used,
-         * and a clear explanation of the otherwise poor performance has not
-         * been found/provided so far.
-         */
-        prefetchw(&vmcb->ldtr);
-        return true;
-    }
+    vmcb->fs.sel = 0;
+    vmcb->fs.attr = 0;
+    vmcb->fs.limit = 0;
+    vmcb->fs.base = fs_base;
+
+    vmcb->gs.sel = 0;
+    vmcb->gs.attr = 0;
+    vmcb->gs.limit = 0;
+    vmcb->gs.base = gs_base;
 
     if ( likely(!ldt_ents) )
         memset(&vmcb->ldtr, 0, sizeof(vmcb->ldtr));
@@ -1558,16 +1569,6 @@  bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
         vmcb->ldtr.base = ldt_base;
     }
 
-    vmcb->fs.sel = 0;
-    vmcb->fs.attr = 0;
-    vmcb->fs.limit = 0;
-    vmcb->fs.base = fs_base;
-
-    vmcb->gs.sel = 0;
-    vmcb->gs.attr = 0;
-    vmcb->gs.limit = 0;
-    vmcb->gs.base = gs_base;
-
     vmcb->kerngsbase = gs_shadow;
 
     svm_vmload_pa(per_cpu(host_vmcb, cpu));
diff --git a/xen/include/asm-x86/hvm/svm/svm.h b/xen/include/asm-x86/hvm/svm/svm.h
index 2310878e41..faeca40174 100644
--- a/xen/include/asm-x86/hvm/svm/svm.h
+++ b/xen/include/asm-x86/hvm/svm/svm.h
@@ -50,12 +50,13 @@  void __update_guest_eip(struct cpu_user_regs *regs, unsigned int inst_len);
 void svm_update_guest_cr(struct vcpu *, unsigned int cr, unsigned int flags);
 
 /*
- * PV context switch helper. Calls with zero ldt_base request a prefetch of
- * the VMCB area to be loaded from, instead of an actual load of state.
+ * PV context switch helpers.  Prefetching the VMCB area itself has been shown
+ * to be useful for performance.
  *
  * Must only be used for NUL FS/GS, as the segment attributes/limits are not
  * read from the GDT/LDT.
  */
+void svm_load_segs_prefetch(void);
 bool svm_load_segs(unsigned int ldt_ents, unsigned long ldt_base,
                    unsigned long fs_base, unsigned long gs_base,
                    unsigned long gs_shadow);