Message ID | 20170913212428.kibwbqs2f7dkeslb@treble (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
* Josh Poimboeuf <jpoimboe@redhat.com> wrote: > I'm still looking at the other one (sha512-avx2), but so far I haven't > found a way to speed it back up. Here's a couple of very quick observations with possible optimizations: AFAICS the main effect of the RBP fixes is the introduction of a memory load into the critical path, into the body unrolled loop: + mov frame_TBL(%rsp), TBL vpaddq (TBL), Y_0, XFER vmovdqa XFER, frame_XFER(%rsp) FOUR_ROUNDS_AND_SCHED Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be reloaded from the stack. 1) Note how R12 is used immediately, right in the next instruction: vpaddq (TBL), Y_0, XFER I.e. the RBP fixes lengthen the program order data dependencies - that's a new constraint and a few extra cycles per loop iteration if the workload is address-generator bandwidth limited on that. A simple way to ease that constraint would be to move the 'TLB' load up into the loop, body, to the point where 'T1' is used for the last time - which is: mov a, T1 # T1 = a # MAJB and c, T1 # T1 = a&c # MAJB add y0, y2 # y2 = S1 + CH # -- or T1, y3 # y3 = MAJ = (a|c)&b)|(a&c) # MAJ + mov frame_TBL(%rsp), TBL add y1, h # h = k + w + h + S0 # -- add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- add y3, h # h = t1 + S0 + MAJ # -- Note how this moves up the 'TLB' reload by 4 instructions. 2) If this does not get back performance, then maybe another reason is that it's cache access latency limited, in which case a more involved optimization would be to look at the register patterns and usages: first-use last-use use-length a: #10 #29 20 b: #24 #24 1 c: #14 #30 17 d: #23 #34 12 e: #11 #20 10 f: #15 #15 1 g: #18 #27 10 h: #13 #36 24 y0: #11 #31 21 y1: #12 #33 22 y2: #15 #35 21 y3: #10 #36 27 T1: #16 #32 17 The 'first-use' colums shows the number of the instruction within the loop body that the register gets used - with '#1' denoting the first instruction ad #36 the last instruction, the 'last-use' column is showing the last instruction, and the 'use-length' colum shows the 'window' in which a register is used. What we want are the registers that are used the most tightly, i.e. these two: b: #24 #24 1 f: #15 #15 1 Of these two 'f' is the best one, because it has an earlier use and longer cooldown. If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very early on: mov f, y2 # y2 = f # CH + mov frame_TBL(%rsp), TBL rorx $34, a, T1 # T1 = a >> 34 # S0B And there will be 21 instructions that don't depend on TLB after this, plenty of time for the load to be generated and propagated. NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState macro name rotation. It's still fundamentally possible I believe, it's just that 'TBL' has to be rotated too, together with the other varibles. 3) If even this does not help, because the workload is ucode-cache limited, and the extra reloads pushed the critical path just beyond some cache limit, then another experiment to try would be to roll _back_ the loop some more: instead of 4x FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2. The CPU should still be smart enough with 2x interleaving of the loop body, and the extra branches should be relatively small and we could get back some performance. In theory ... 4) If the workload is fundamentally cache-port bandwidth limited, then the extra loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy fix for that, but to find an unused register. Here's the (initial, pre-rotation) integer register mappings: a: RAX b: RBX c: RCX d: R8 e: RDX f: R9 g: R10 h: R11 y0: R13 y1: R14 y2: R15 y3: RSI T1: R12 TLB: R12 # aliased to T1 Look what's missing: I don't see RDI being used in the loop. RDI is allocated to 'CTX', but that's only used in higher level glue code, it does not appear to be used in the inner loops (explicitly at least). So if this observation of mine is true we could go back to the old code for the hotpath, but use RDI for TBL and not reload it in the hotpath. Thanks, Ingo
* Ingo Molnar <mingo@kernel.org> wrote: > 1) > > Note how R12 is used immediately, right in the next instruction: > > vpaddq (TBL), Y_0, XFER > > I.e. the RBP fixes lengthen the program order data dependencies - that's a new > constraint and a few extra cycles per loop iteration if the workload is > address-generator bandwidth limited on that. > > A simple way to ease that constraint would be to move the 'TLB' load up into the > loop, body, to the point where 'T1' is used for the last time - which is: > > > mov a, T1 # T1 = a # MAJB > and c, T1 # T1 = a&c # MAJB > > add y0, y2 # y2 = S1 + CH # -- > or T1, y3 # y3 = MAJ = (a|c)&b)|(a&c) # MAJ > > + mov frame_TBL(%rsp), TBL > > add y1, h # h = k + w + h + S0 # -- > > add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- > > add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- > add y3, h # h = t1 + S0 + MAJ # -- > > Note how this moves up the 'TLB' reload by 4 instructions. Note that in this case 'TBL' would have to be initialized before the 1st iteration, via something like: movq $4, frame_SRND(%rsp) + mov frame_TBL(%rsp), TBL .align 16 loop1: vpaddq (TBL), Y_0, XFER vmovdqa XFER, frame_XFER(%rsp) FOUR_ROUNDS_AND_SCHED Thanks, Ingo
On Thu, Sep 14, 2017 at 11:16:12AM +0200, Ingo Molnar wrote: > > * Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > > I'm still looking at the other one (sha512-avx2), but so far I haven't > > found a way to speed it back up. > > Here's a couple of very quick observations with possible optimizations: > > AFAICS the main effect of the RBP fixes is the introduction of a memory load into > the critical path, into the body unrolled loop: > > + mov frame_TBL(%rsp), TBL > vpaddq (TBL), Y_0, XFER > vmovdqa XFER, frame_XFER(%rsp) > FOUR_ROUNDS_AND_SCHED > > Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be > reloaded from the stack. > > 1) > > Note how R12 is used immediately, right in the next instruction: > > vpaddq (TBL), Y_0, XFER > > I.e. the RBP fixes lengthen the program order data dependencies - that's a new > constraint and a few extra cycles per loop iteration if the workload is > address-generator bandwidth limited on that. > > A simple way to ease that constraint would be to move the 'TLB' load up into the > loop, body, to the point where 'T1' is used for the last time - which is: > > > mov a, T1 # T1 = a # MAJB > and c, T1 # T1 = a&c # MAJB > > add y0, y2 # y2 = S1 + CH # -- > or T1, y3 # y3 = MAJ = (a|c)&b)|(a&c) # MAJ > > + mov frame_TBL(%rsp), TBL > > add y1, h # h = k + w + h + S0 # -- > > add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- > > add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- > add y3, h # h = t1 + S0 + MAJ # -- > > Note how this moves up the 'TLB' reload by 4 instructions. > > 2) > > If this does not get back performance, then maybe another reason is that it's > cache access latency limited, in which case a more involved optimization would be > to look at the register patterns and usages: > > > first-use last-use use-length > a: #10 #29 20 > b: #24 #24 1 > c: #14 #30 17 > d: #23 #34 12 > e: #11 #20 10 > f: #15 #15 1 > g: #18 #27 10 > h: #13 #36 24 > > y0: #11 #31 21 > y1: #12 #33 22 > y2: #15 #35 21 > y3: #10 #36 27 > > T1: #16 #32 17 > > The 'first-use' colums shows the number of the instruction within the loop body > that the register gets used - with '#1' denoting the first instruction ad #36 the > last instruction, the 'last-use' column is showing the last instruction, and the > 'use-length' colum shows the 'window' in which a register is used. > > What we want are the registers that are used the most tightly, i.e. these two: > > b: #24 #24 1 > f: #15 #15 1 > > Of these two 'f' is the best one, because it has an earlier use and longer > cooldown. > > If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very > early on: > > mov f, y2 # y2 = f # CH > + mov frame_TBL(%rsp), TBL > rorx $34, a, T1 # T1 = a >> 34 # S0B > > And there will be 21 instructions that don't depend on TLB after this, plenty of > time for the load to be generated and propagated. > > NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState > macro name rotation. It's still fundamentally possible I believe, it's just that > 'TBL' has to be rotated too, together with the other varibles. > > 3) > > If even this does not help, because the workload is ucode-cache limited, and the > extra reloads pushed the critical path just beyond some cache limit, then another > experiment to try would be to roll _back_ the loop some more: instead of 4x > FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2. > > The CPU should still be smart enough with 2x interleaving of the loop body, and > the extra branches should be relatively small and we could get back some > performance. > > In theory ... > > 4) > > If the workload is fundamentally cache-port bandwidth limited, then the extra > loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy > fix for that, but to find an unused register. > > Here's the (initial, pre-rotation) integer register mappings: > > a: RAX > b: RBX > c: RCX > d: R8 > e: RDX > f: R9 > g: R10 > h: R11 > > y0: R13 > y1: R14 > y2: R15 > y3: RSI > > T1: R12 > > TLB: R12 # aliased to T1 > > Look what's missing: I don't see RDI being used in the loop. > > RDI is allocated to 'CTX', but that's only used in higher level glue code, it does > not appear to be used in the inner loops (explicitly at least). > > So if this observation of mine is true we could go back to the old code for the > hotpath, but use RDI for TBL and not reload it in the hotpath. Thanks for the excellent breakdown. When I looked at the patch again, I came to the same conclusion as your #4, which is that RDI isn't being used in the inner loops. It *is* used in the outermost loop, however. So v2 of my sha512-avx2-asm.S patch spilled CTX onto the stack, instead of TBL: https://lkml.kernel.org/r/20170913223303.pskmy2v7nto6rvtg@treble That should result in fewer stack accesses, since it's only loaded from the stack once in the outer loop. If we still need more performance after that, then we can attempt something similar to your other suggestions, except with CTX instead of TBL.
diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S index 89c8f09787d2..1420db15dcdd 100644 --- a/arch/x86/crypto/sha256-avx2-asm.S +++ b/arch/x86/crypto/sha256-avx2-asm.S @@ -98,8 +98,6 @@ d = %r8d e = %edx # clobbers NUM_BLKS y3 = %esi # clobbers INP - -TBL = %rbp SRND = CTX # SRND is same register as CTX a = %eax @@ -531,7 +529,6 @@ STACK_SIZE = _RSP + _RSP_SIZE ENTRY(sha256_transform_rorx) .align 32 pushq %rbx - pushq %rbp pushq %r12 pushq %r13 pushq %r14 @@ -568,8 +565,6 @@ ENTRY(sha256_transform_rorx) mov CTX, _CTX(%rsp) loop0: - lea K256(%rip), TBL - ## Load first 16 dwords from two blocks VMOVDQ 0*32(INP),XTMP0 VMOVDQ 1*32(INP),XTMP1 @@ -597,19 +592,19 @@ last_block_enter: .align 16 loop1: - vpaddd 0*32(TBL, SRND), X0, XFER + vpaddd K256+0*32(SRND), X0, XFER vmovdqa XFER, 0*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 0*32 - vpaddd 1*32(TBL, SRND), X0, XFER + vpaddd K256+1*32(SRND), X0, XFER vmovdqa XFER, 1*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 1*32 - vpaddd 2*32(TBL, SRND), X0, XFER + vpaddd K256+2*32(SRND), X0, XFER vmovdqa XFER, 2*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 2*32 - vpaddd 3*32(TBL, SRND), X0, XFER + vpaddd K256+3*32(SRND), X0, XFER vmovdqa XFER, 3*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 3*32 @@ -619,10 +614,11 @@ loop1: loop2: ## Do last 16 rounds with no scheduling - vpaddd 0*32(TBL, SRND), X0, XFER + vpaddd K256+0*32(SRND), X0, XFER vmovdqa XFER, 0*32+_XFER(%rsp, SRND) DO_4ROUNDS _XFER + 0*32 - vpaddd 1*32(TBL, SRND), X1, XFER + + vpaddd K256+1*32(SRND), X1, XFER vmovdqa XFER, 1*32+_XFER(%rsp, SRND) DO_4ROUNDS _XFER + 1*32 add $2*32, SRND @@ -676,9 +672,6 @@ loop3: ja done_hash do_last_block: - #### do last block - lea K256(%rip), TBL - VMOVDQ 0*16(INP),XWORD0 VMOVDQ 1*16(INP),XWORD1 VMOVDQ 2*16(INP),XWORD2 @@ -718,7 +711,6 @@ done_hash: popq %r14 popq %r13 popq %r12 - popq %rbp popq %rbx ret ENDPROC(sha256_transform_rorx)