Patchwork [00/12] x86/crypto: Fix RBP usage in several crypto .S files

login
register
mail settings
Submitter Josh Poimboeuf
Date Sept. 13, 2017, 9:24 p.m.
Message ID <20170913212428.kibwbqs2f7dkeslb@treble>
Download mbox | patch
Permalink /patch/9952137/
State Not Applicable
Delegated to: Herbert Xu
Headers show

Comments

Josh Poimboeuf - Sept. 13, 2017, 9:24 p.m.
On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > 
> > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Eric Biggers <ebiggers3@gmail.com> wrote:
> > > > 
> > > > > Thanks for fixing these!  I don't have time to review these in detail, but I ran
> > > > > the crypto self-tests on the affected algorithms, and they all pass.  I also
> > > > > benchmarked them before and after; the only noticable performance difference was
> > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > 
> > > > Which CPU model did you use for the test?
> > > > 
> > > > Thanks,
> > > > 
> > > > 	Ingo
> > > 
> > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > 
> > Any chance to test this with the latest microarchitecture - any Skylake derivative 
> > Intel CPU you have access to?
> > 
> > Thanks,
> > 
> > 	Ingo
> 
> Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
> were the following which seemed a bit worse than Haswell:
> 
> 	sha256-avx2 became 3.5% slower
> 	sha512-avx2 became 7.5% slower
> 
> Note: it's tricky to benchmark this, especially with just a few percent
> difference, so don't read too much into the exact numbers.

Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
closes the performance gap?

I'm still looking at the other one (sha512-avx2), but so far I haven't
found a way to speed it back up.

From: Josh Poimboeuf <jpoimboe@redhat.com>
Subject: [PATCH] x86/crypto: Fix RBP usage in sha256-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

There's no need to use RBP as a temporary register for the TBL value,
because it always stores the same value: the address of the K256 table.
Instead just reference the address of K256 directly.

Reported-by: Eric Biggers <ebiggers@google.com>
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 arch/x86/crypto/sha256-avx2-asm.S | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)
Ingo Molnar - Sept. 14, 2017, 9:16 a.m.
* Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

Here's a couple of very quick observations with possible optimizations:

AFAICS the main effect of the RBP fixes is the introduction of a memory load into 
the critical path, into the body unrolled loop:

+       mov frame_TBL(%rsp), TBL
        vpaddq  (TBL), Y_0, XFER
        vmovdqa XFER, frame_XFER(%rsp)
        FOUR_ROUNDS_AND_SCHED

Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be 
reloaded from the stack.

1)

Note how R12 is used immediately, right in the next instruction:

        vpaddq  (TBL), Y_0, XFER

I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
constraint and a few extra cycles per loop iteration if the workload is 
address-generator bandwidth limited on that.

A simple way to ease that constraint would be to move the 'TLB' load up into the 
loop, body, to the point where 'T1' is used for the last time - which is:


        mov     a, T1           # T1 = a                                # MAJB
        and     c, T1           # T1 = a&c                              # MAJB

        add     y0, y2          # y2 = S1 + CH                          # --
        or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ

+       mov frame_TBL(%rsp), TBL

        add     y1, h           # h = k + w + h + S0                    # --

        add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --

        add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
        add     y3, h           # h = t1 + S0 + MAJ                     # --

Note how this moves up the 'TLB' reload by 4 instructions.

2)

If this does not get back performance, then maybe another reason is that it's 
cache access latency limited, in which case a more involved optimization would be 
to look at the register patterns and usages:


			first-use	last-use		use-length
	a:		#10		#29			20
	b:		#24		#24			 1
	c:		#14		#30			17
	d:		#23		#34			12
	e:		#11		#20			10
	f:		#15		#15			 1
	g:		#18		#27			10
	h:		#13		#36			24

	y0:		#11		#31			21
	y1:		#12		#33			22
	y2:		#15		#35			21
	y3:		#10		#36			27

	T1:		#16		#32			17

The 'first-use' colums shows the number of the instruction within the loop body 
that the register gets used - with '#1' denoting the first instruction ad #36 the 
last instruction, the 'last-use' column is showing the last instruction, and the 
'use-length' colum shows the 'window' in which a register is used.

What we want are the registers that are used the most tightly, i.e. these two:

	b:		#24		#24			 1
	f:		#15		#15			 1

Of these two 'f' is the best one, because it has an earlier use and longer 
cooldown.

If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very 
early on:

        mov     f, y2           # y2 = f                                # CH
+       mov frame_TBL(%rsp), TBL
        rorx    $34, a, T1      # T1 = a >> 34                          # S0B

And there will be 21 instructions that don't depend on TLB after this, plenty of 
time for the load to be generated and propagated.

NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState 
macro name rotation. It's still fundamentally possible I believe, it's just that 
'TBL' has to be rotated too, together with the other varibles.

3)

If even this does not help, because the workload is ucode-cache limited, and the 
extra reloads pushed the critical path just beyond some cache limit, then another 
experiment to try would be to roll _back_ the loop some more: instead of 4x 
FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.

The CPU should still be smart enough with 2x interleaving of the loop body, and 
the extra branches should be relatively small and we could get back some 
performance.

In theory ...

4)

If the workload is fundamentally cache-port bandwidth limited, then the extra 
loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy 
fix for that, but to find an unused register.

Here's the (initial, pre-rotation) integer register mappings:

	a:	RAX
	b:	RBX
	c:	RCX
	d:	R8
	e:	RDX
	f:	R9
	g:	R10
	h:	R11

	y0:	R13
	y1:	R14
	y2:	R15
	y3:	RSI

	T1:	R12

	TLB:	R12 # aliased to T1

Look what's missing: I don't see RDI being used in the loop.

RDI is allocated to 'CTX', but that's only used in higher level glue code, it does 
not appear to be used in the inner loops (explicitly at least).

So if this observation of mine is true we could go back to the old code for the 
hotpath, but use RDI for TBL and not reload it in the hotpath.

Thanks,

	Ingo
Ingo Molnar - Sept. 14, 2017, 9:28 a.m.
* Ingo Molnar <mingo@kernel.org> wrote:

> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
>         vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
>         mov     a, T1           # T1 = a                                # MAJB
>         and     c, T1           # T1 = a&c                              # MAJB
> 
>         add     y0, y2          # y2 = S1 + CH                          # --
>         or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ
> 
> +       mov frame_TBL(%rsp), TBL
> 
>         add     y1, h           # h = k + w + h + S0                    # --
> 
>         add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --
> 
>         add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
>         add     y3, h           # h = t1 + S0 + MAJ                     # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.

Note that in this case 'TBL' would have to be initialized before the 1st 
iteration, via something like:

        movq    $4, frame_SRND(%rsp)

+	mov frame_TBL(%rsp), TBL

.align 16
loop1:
        vpaddq  (TBL), Y_0, XFER
        vmovdqa XFER, frame_XFER(%rsp)
        FOUR_ROUNDS_AND_SCHED

Thanks,

	Ingo
Josh Poimboeuf - Sept. 14, 2017, 1:28 p.m.
On Thu, Sep 14, 2017 at 11:16:12AM +0200, Ingo Molnar wrote:
> 
> * Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> 
> > I'm still looking at the other one (sha512-avx2), but so far I haven't
> > found a way to speed it back up.
> 
> Here's a couple of very quick observations with possible optimizations:
> 
> AFAICS the main effect of the RBP fixes is the introduction of a memory load into 
> the critical path, into the body unrolled loop:
> 
> +       mov frame_TBL(%rsp), TBL
>         vpaddq  (TBL), Y_0, XFER
>         vmovdqa XFER, frame_XFER(%rsp)
>         FOUR_ROUNDS_AND_SCHED
> 
> Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be 
> reloaded from the stack.
> 
> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
>         vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
>         mov     a, T1           # T1 = a                                # MAJB
>         and     c, T1           # T1 = a&c                              # MAJB
> 
>         add     y0, y2          # y2 = S1 + CH                          # --
>         or      T1, y3          # y3 = MAJ = (a|c)&b)|(a&c)             # MAJ
> 
> +       mov frame_TBL(%rsp), TBL
> 
>         add     y1, h           # h = k + w + h + S0                    # --
> 
>         add     y2, d           # d = k + w + h + d + S1 + CH = d + t1  # --
> 
>         add     y2, h           # h = k + w + h + S0 + S1 + CH = t1 + S0# --
>         add     y3, h           # h = t1 + S0 + MAJ                     # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.
> 
> 2)
> 
> If this does not get back performance, then maybe another reason is that it's 
> cache access latency limited, in which case a more involved optimization would be 
> to look at the register patterns and usages:
> 
> 
> 			first-use	last-use		use-length
> 	a:		#10		#29			20
> 	b:		#24		#24			 1
> 	c:		#14		#30			17
> 	d:		#23		#34			12
> 	e:		#11		#20			10
> 	f:		#15		#15			 1
> 	g:		#18		#27			10
> 	h:		#13		#36			24
> 
> 	y0:		#11		#31			21
> 	y1:		#12		#33			22
> 	y2:		#15		#35			21
> 	y3:		#10		#36			27
> 
> 	T1:		#16		#32			17
> 
> The 'first-use' colums shows the number of the instruction within the loop body 
> that the register gets used - with '#1' denoting the first instruction ad #36 the 
> last instruction, the 'last-use' column is showing the last instruction, and the 
> 'use-length' colum shows the 'window' in which a register is used.
> 
> What we want are the registers that are used the most tightly, i.e. these two:
> 
> 	b:		#24		#24			 1
> 	f:		#15		#15			 1
> 
> Of these two 'f' is the best one, because it has an earlier use and longer 
> cooldown.
> 
> If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very 
> early on:
> 
>         mov     f, y2           # y2 = f                                # CH
> +       mov frame_TBL(%rsp), TBL
>         rorx    $34, a, T1      # T1 = a >> 34                          # S0B
> 
> And there will be 21 instructions that don't depend on TLB after this, plenty of 
> time for the load to be generated and propagated.
> 
> NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState 
> macro name rotation. It's still fundamentally possible I believe, it's just that 
> 'TBL' has to be rotated too, together with the other varibles.
> 
> 3)
> 
> If even this does not help, because the workload is ucode-cache limited, and the 
> extra reloads pushed the critical path just beyond some cache limit, then another 
> experiment to try would be to roll _back_ the loop some more: instead of 4x 
> FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.
> 
> The CPU should still be smart enough with 2x interleaving of the loop body, and 
> the extra branches should be relatively small and we could get back some 
> performance.
> 
> In theory ...
> 
> 4)
> 
> If the workload is fundamentally cache-port bandwidth limited, then the extra 
> loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy 
> fix for that, but to find an unused register.
> 
> Here's the (initial, pre-rotation) integer register mappings:
> 
> 	a:	RAX
> 	b:	RBX
> 	c:	RCX
> 	d:	R8
> 	e:	RDX
> 	f:	R9
> 	g:	R10
> 	h:	R11
> 
> 	y0:	R13
> 	y1:	R14
> 	y2:	R15
> 	y3:	RSI
> 
> 	T1:	R12
> 
> 	TLB:	R12 # aliased to T1
> 
> Look what's missing: I don't see RDI being used in the loop.
> 
> RDI is allocated to 'CTX', but that's only used in higher level glue code, it does 
> not appear to be used in the inner loops (explicitly at least).
> 
> So if this observation of mine is true we could go back to the old code for the 
> hotpath, but use RDI for TBL and not reload it in the hotpath.

Thanks for the excellent breakdown.

When I looked at the patch again, I came to the same conclusion as your
#4, which is that RDI isn't being used in the inner loops.  It *is* used
in the outermost loop, however.

So v2 of my sha512-avx2-asm.S patch spilled CTX onto the stack, instead
of TBL:

  https://lkml.kernel.org/r/20170913223303.pskmy2v7nto6rvtg@treble

That should result in fewer stack accesses, since it's only loaded from
the stack once in the outer loop.

If we still need more performance after that, then we can attempt
something similar to your other suggestions, except with CTX instead of
TBL.

Patch

diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S
index 89c8f09787d2..1420db15dcdd 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -98,8 +98,6 @@  d	= %r8d
 e       = %edx	# clobbers NUM_BLKS
 y3	= %esi	# clobbers INP
 
-
-TBL	= %rbp
 SRND	= CTX	# SRND is same register as CTX
 
 a = %eax
@@ -531,7 +529,6 @@  STACK_SIZE	= _RSP      + _RSP_SIZE
 ENTRY(sha256_transform_rorx)
 .align 32
 	pushq	%rbx
-	pushq	%rbp
 	pushq	%r12
 	pushq	%r13
 	pushq	%r14
@@ -568,8 +565,6 @@  ENTRY(sha256_transform_rorx)
 	mov	CTX, _CTX(%rsp)
 
 loop0:
-	lea     K256(%rip), TBL
-
 	## Load first 16 dwords from two blocks
 	VMOVDQ	0*32(INP),XTMP0
 	VMOVDQ	1*32(INP),XTMP1
@@ -597,19 +592,19 @@  last_block_enter:
 
 .align 16
 loop1:
-	vpaddd	0*32(TBL, SRND), X0, XFER
+	vpaddd	K256+0*32(SRND), X0, XFER
 	vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 0*32
 
-	vpaddd	1*32(TBL, SRND), X0, XFER
+	vpaddd	K256+1*32(SRND), X0, XFER
 	vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 1*32
 
-	vpaddd	2*32(TBL, SRND), X0, XFER
+	vpaddd	K256+2*32(SRND), X0, XFER
 	vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 2*32
 
-	vpaddd	3*32(TBL, SRND), X0, XFER
+	vpaddd	K256+3*32(SRND), X0, XFER
 	vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
 	FOUR_ROUNDS_AND_SCHED	_XFER + 3*32
 
@@ -619,10 +614,11 @@  loop1:
 
 loop2:
 	## Do last 16 rounds with no scheduling
-	vpaddd	0*32(TBL, SRND), X0, XFER
+	vpaddd	K256+0*32(SRND), X0, XFER
 	vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
 	DO_4ROUNDS	_XFER + 0*32
-	vpaddd	1*32(TBL, SRND), X1, XFER
+
+	vpaddd	K256+1*32(SRND), X1, XFER
 	vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
 	DO_4ROUNDS	_XFER + 1*32
 	add	$2*32, SRND
@@ -676,9 +672,6 @@  loop3:
 	ja	done_hash
 
 do_last_block:
-	#### do last block
-	lea	K256(%rip), TBL
-
 	VMOVDQ	0*16(INP),XWORD0
 	VMOVDQ	1*16(INP),XWORD1
 	VMOVDQ	2*16(INP),XWORD2
@@ -718,7 +711,6 @@  done_hash:
 	popq	%r14
 	popq	%r13
 	popq	%r12
-	popq	%rbp
 	popq	%rbx
 	ret
 ENDPROC(sha256_transform_rorx)