diff mbox series

x86: Use LOCK ADD instead of MFENCE for smp_mb()

Message ID 20200921130423.8035-1-andrew.cooper3@citrix.com (mailing list archive)
State New, archived
Headers show
Series x86: Use LOCK ADD instead of MFENCE for smp_mb() | expand

Commit Message

Andrew Cooper Sept. 21, 2020, 1:04 p.m. UTC
MFENCE is overly heavyweight for SMP semantics on WB memory, because it also
orders weaker cached writes, and flushes the WC buffers.

This technique was used as an optimisation in Java[1], and later adopted by
Linux[2] where it was measured to have a 60% performance improvement in VirtIO
benchmarks.

The stack is used because it is hot in the L1 cache, and a -4 offset is used
to avoid creating a false data dependency on live data.  (For 64bit userspace,
the offset needs to be under the red zone to avoid false dependences).

Fix up the 32 bit definitions in HVMLoader and libxc to avoid a false data
dependency.

[1] https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
[2] https://git.kernel.org/torvalds/c/450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
CC: Jan Beulich <JBeulich@suse.com>
CC: Roger Pau Monné <roger.pau@citrix.com>
CC: Wei Liu <wl@xen.org>
CC: Ian Jackson <Ian.Jackson@citrix.com>
---
 tools/firmware/hvmloader/util.h   | 2 +-
 tools/libs/ctrl/include/xenctrl.h | 4 ++--
 xen/include/asm-x86/system.h      | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

Comments

Jan Beulich Sept. 21, 2020, 1:58 p.m. UTC | #1
On 21.09.2020 15:04, Andrew Cooper wrote:
> MFENCE is overly heavyweight for SMP semantics on WB memory, because it also
> orders weaker cached writes, and flushes the WC buffers.
> 
> This technique was used as an optimisation in Java[1], and later adopted by
> Linux[2] where it was measured to have a 60% performance improvement in VirtIO
> benchmarks.
> 
> The stack is used because it is hot in the L1 cache, and a -4 offset is used
> to avoid creating a false data dependency on live data.  (For 64bit userspace,
> the offset needs to be under the red zone to avoid false dependences).
> 
> Fix up the 32 bit definitions in HVMLoader and libxc to avoid a false data
> dependency.
> 
> [1] https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> [2] https://git.kernel.org/torvalds/c/450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>

For the hypervisor and hvmloader part:
Reviewed-by: Jan Beulich <jbeulich@suse.com>

> --- a/tools/libs/ctrl/include/xenctrl.h
> +++ b/tools/libs/ctrl/include/xenctrl.h
> @@ -68,11 +68,11 @@
>  #define xen_barrier() asm volatile ( "" : : : "memory")
>  
>  #if defined(__i386__)
> -#define xen_mb()  asm volatile ( "lock; addl $0,0(%%esp)" : : : "memory" )

If this causes a false dependency (which I agree it does), how
come ...

> +#define xen_mb()  asm volatile ( "lock addl $0, -4(%%esp)" ::: "memory" )
>  #define xen_rmb() xen_barrier()
>  #define xen_wmb() xen_barrier()
>  #elif defined(__x86_64__)
> -#define xen_mb()  asm volatile ( "mfence" : : : "memory")
> +#define xen_mb()  asm volatile ( "lock addl $0, -128(%%rsp)" ::: "memory" )

... this doesn't? It accesses the bottom 4 bytes of the redzone,
doesn't it?

As a minor other thought for all of its incarnations: Is a 32-bit
memory access really the best choice? Wouldn't an 8-bit access
further reduce (albeit not eliminate) the risk of unnecessary
dependencies between this memory access and others in functions
called from the users of this barrier?

Or actually, in the hypervisor case, since the used stack slot
would typically hold the return address of the next level's
functions, would a 64-bit access or one further away from the
current stack pointer not help avoid partial dependencies?

And finally, already when Linux used this for just 32-bit I've
always been wondering why they bother preserving the contents of
this piece of memory. How about using NOT (saving the immediate
byte) or XCHG (requiring a dead register instead of the saved
arithmetic, immediate byte, and lock prefix)?

Jan
Andrew Cooper Sept. 25, 2020, 1:22 p.m. UTC | #2
On 21/09/2020 14:58, Jan Beulich wrote:
> On 21.09.2020 15:04, Andrew Cooper wrote:
>> MFENCE is overly heavyweight for SMP semantics on WB memory, because it also
>> orders weaker cached writes, and flushes the WC buffers.
>>
>> This technique was used as an optimisation in Java[1], and later adopted by
>> Linux[2] where it was measured to have a 60% performance improvement in VirtIO
>> benchmarks.
>>
>> The stack is used because it is hot in the L1 cache, and a -4 offset is used
>> to avoid creating a false data dependency on live data.  (For 64bit userspace,
>> the offset needs to be under the red zone to avoid false dependences).
>>
>> Fix up the 32 bit definitions in HVMLoader and libxc to avoid a false data
>> dependency.
>>
>> [1] https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
>> [2] https://git.kernel.org/torvalds/c/450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730
>>
>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> For the hypervisor and hvmloader part:
> Reviewed-by: Jan Beulich <jbeulich@suse.com>
>
>> --- a/tools/libs/ctrl/include/xenctrl.h
>> +++ b/tools/libs/ctrl/include/xenctrl.h
>> @@ -68,11 +68,11 @@
>>  #define xen_barrier() asm volatile ( "" : : : "memory")
>>  
>>  #if defined(__i386__)
>> -#define xen_mb()  asm volatile ( "lock; addl $0,0(%%esp)" : : : "memory" )
> If this causes a false dependency (which I agree it does), how
> come ...
>
>> +#define xen_mb()  asm volatile ( "lock addl $0, -4(%%esp)" ::: "memory" )
>>  #define xen_rmb() xen_barrier()
>>  #define xen_wmb() xen_barrier()
>>  #elif defined(__x86_64__)
>> -#define xen_mb()  asm volatile ( "mfence" : : : "memory")
>> +#define xen_mb()  asm volatile ( "lock addl $0, -128(%%rsp)" ::: "memory" )
> ... this doesn't? It accesses the bottom 4 bytes of the redzone,
> doesn't it?

I suppose it does.  However, if anything, I'd be tempted to move it
closer, perhaps even to -32 or so.

The red-zone isn't used a great deal to start with.  It is only useful
for blocks of code with a working set marginally larger than the GPRs,
and no calls/etc.  This is far more commonly seen in number crunching
code, and rare in data-marshalling logic around a shared memory buffer.

However, you also don't want the memory operand to hit a cache line
which isn't in L1, because that will be far slower than any false data
dependency in the memory order buffer.

OTOH, this is userspace, and neither a locked access hitting DRAM, nor a
false data dependency on the bottom of the red zone are going to be the
dominating factor in overall performance.

> As a minor other thought for all of its incarnations: Is a 32-bit
> memory access really the best choice?

Yes.

> Wouldn't an 8-bit access
> further reduce (albeit not eliminate) the risk of unnecessary
> dependencies between this memory access and others in functions
> called from the users of this barrier?

Not really.  The overwhelming majority of stack accesses are full words,
and would still alias even if the memory order buffer tracks aliasing at
byte granularity (which it might not, to save silicon area).

8 or 32 bit access are probably identical, based on the arch and uarch
details I'm aware of.  However, just in case they are not, I'm
definitely going for the form in common with several other major
software projects.

> Or actually, in the hypervisor case, since the used stack slot
> would typically hold the return address of the next level's
> functions, would a 64-bit access or one further away from the
> current stack pointer not help avoid partial dependencies?

No.  The follow call instruction is a write onto the stack, and will
form a write-after-write condition, rather than a false read-after-write
dependency.

> And finally, already when Linux used this for just 32-bit I've
> always been wondering why they bother preserving the contents of
> this piece of memory. How about using NOT (saving the immediate
> byte) or XCHG (requiring a dead register instead of the saved
> arithmetic, immediate byte, and lock prefix)?

Because performance is not dictated by the instruction length.

For equivalent instructions, the shorter encoding provides a marginal
benefit in terms of reduced instruction cache space, compound
improvements such as possibly being able to relax JMP disp32 to JMP
disp8, and increased decode bandwidth to a point.  There are some code
layout patterns where the longer forms are actually faster to decode on
some microarchitectures.

However, most instructions are not equivalent.

The XCHG plan may force a register to be spilled.  Some processors
really do track zeroes all the way out into the cache[1] and make
optimisations based on this knowledge.  ADD $0 is a no-op which can be
spotted at decode time, by a sufficiently clever pipeline.

~Andrew

[1] https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html
Wei Liu Sept. 30, 2020, 3:02 p.m. UTC | #3
On Mon, Sep 21, 2020 at 02:04:23PM +0100, Andrew Cooper wrote:
> MFENCE is overly heavyweight for SMP semantics on WB memory, because it also
> orders weaker cached writes, and flushes the WC buffers.
> 
> This technique was used as an optimisation in Java[1], and later adopted by
> Linux[2] where it was measured to have a 60% performance improvement in VirtIO
> benchmarks.
> 
> The stack is used because it is hot in the L1 cache, and a -4 offset is used
> to avoid creating a false data dependency on live data.  (For 64bit userspace,
> the offset needs to be under the red zone to avoid false dependences).
> 
> Fix up the 32 bit definitions in HVMLoader and libxc to avoid a false data
> dependency.
> 
> [1] https://shipilev.net/blog/2014/on-the-fence-with-dependencies/
> [2] https://git.kernel.org/torvalds/c/450cbdd0125cfa5d7bbf9e2a6b6961cc48d29730
> 
> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
> ---
> CC: Jan Beulich <JBeulich@suse.com>
> CC: Roger Pau Monné <roger.pau@citrix.com>
> CC: Wei Liu <wl@xen.org>
> CC: Ian Jackson <Ian.Jackson@citrix.com>
> ---
>  tools/firmware/hvmloader/util.h   | 2 +-
>  tools/libs/ctrl/include/xenctrl.h | 4 ++--

If this is ever needed:

Acked-by: Wei Liu <wl@xen.org>

I have not followed the discussion in the thread closely, but the change
looks to be following what Linux does, so I'm certainly fine with this.
diff mbox series

Patch

diff --git a/tools/firmware/hvmloader/util.h b/tools/firmware/hvmloader/util.h
index 31889de634..4f0baade0e 100644
--- a/tools/firmware/hvmloader/util.h
+++ b/tools/firmware/hvmloader/util.h
@@ -133,7 +133,7 @@  static inline void cpu_relax(void)
 #define barrier() asm volatile ( "" : : : "memory" )
 #define rmb()     barrier()
 #define wmb()     barrier()
-#define mb()      asm volatile ( "lock; addl $0,0(%%esp)" : : : "memory" )
+#define mb()      asm volatile ( "lock addl $0, -4(%%esp)" ::: "memory" )
 
 /*
  * Divide a 64-bit dividend by a 32-bit divisor.
diff --git a/tools/libs/ctrl/include/xenctrl.h b/tools/libs/ctrl/include/xenctrl.h
index 73e9535fc8..1d9f514302 100644
--- a/tools/libs/ctrl/include/xenctrl.h
+++ b/tools/libs/ctrl/include/xenctrl.h
@@ -68,11 +68,11 @@ 
 #define xen_barrier() asm volatile ( "" : : : "memory")
 
 #if defined(__i386__)
-#define xen_mb()  asm volatile ( "lock; addl $0,0(%%esp)" : : : "memory" )
+#define xen_mb()  asm volatile ( "lock addl $0, -4(%%esp)" ::: "memory" )
 #define xen_rmb() xen_barrier()
 #define xen_wmb() xen_barrier()
 #elif defined(__x86_64__)
-#define xen_mb()  asm volatile ( "mfence" : : : "memory")
+#define xen_mb()  asm volatile ( "lock addl $0, -128(%%rsp)" ::: "memory" )
 #define xen_rmb() xen_barrier()
 #define xen_wmb() xen_barrier()
 #elif defined(__arm__)
diff --git a/xen/include/asm-x86/system.h b/xen/include/asm-x86/system.h
index 7e5891f3df..6474dd1243 100644
--- a/xen/include/asm-x86/system.h
+++ b/xen/include/asm-x86/system.h
@@ -226,7 +226,7 @@  static always_inline unsigned long __xadd(
  *
  * Refer to the vendor system programming manuals for further details.
  */
-#define smp_mb()        mb()
+#define smp_mb()        asm volatile ( "lock addl $0, -4(%%rsp)" ::: "memory" )
 #define smp_rmb()       barrier()
 #define smp_wmb()       barrier()