net: Remove branch in csum_shift()

Message ID	efeeb0b9979b0377cd313311ad29cf0ac060ae4b.1644569106.git.christophe.leroy@csgroup.eu (mailing list archive)
State	Changes Requested
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> From: Christophe Leroy <christophe.leroy@csgroup.eu> To: "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu>, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, netdev@vger.kernel.org Subject: [PATCH] net: Remove branch in csum_shift() Date: Fri, 11 Feb 2022 09:48:16 +0100 Message-Id: <efeeb0b9979b0377cd313311ad29cf0ac060ae4b.1644569106.git.christophe.leroy@csgroup.eu> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	net: Remove branch in csum_shift() \| expand net: Remove branch in csum_shift()

Context	Check	Description
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	warning	Target tree name not specified in the subject
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 6018 this patch: 6018
netdev/cc_maintainers	success	CCed 3 of 3 maintainers
netdev/build_clang	success	Errors and warnings before: 878 this patch: 878
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 6171 this patch: 6171
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 10 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
netdev/tree_selection	success	Guessing tree name failed - patch did not apply

Christophe Leroy Feb. 11, 2022, 8:48 a.m. UTC

Today's implementation of csum_shift() leads to branching based on
parity of 'offset'

	000002f8 <csum_block_add>:
	     2f8:	70 a5 00 01 	andi.   r5,r5,1
	     2fc:	41 a2 00 08 	beq     304 <csum_block_add+0xc>
	     300:	54 84 c0 3e 	rotlwi  r4,r4,24
	     304:	7c 63 20 14 	addc    r3,r3,r4
	     308:	7c 63 01 94 	addze   r3,r3
	     30c:	4e 80 00 20 	blr

Use first bit of 'offset' directly as input of the rotation instead of
branching.

	000002f8 <csum_block_add>:
	     2f8:	54 a5 1f 38 	rlwinm  r5,r5,3,28,28
	     2fc:	20 a5 00 20 	subfic  r5,r5,32
	     300:	5c 84 28 3e 	rotlw   r4,r4,r5
	     304:	7c 63 20 14 	addc    r3,r3,r4
	     308:	7c 63 01 94 	addze   r3,r3
	     30c:	4e 80 00 20 	blr

And change to left shift instead of right shift to skip one more
instruction. This has no impact on the final sum.

	000002f8 <csum_block_add>:
	     2f8:	54 a5 1f 38 	rlwinm  r5,r5,3,28,28
	     2fc:	5c 84 28 3e 	rotlw   r4,r4,r5
	     300:	7c 63 20 14 	addc    r3,r3,r4
	     304:	7c 63 01 94 	addze   r3,r3
	     308:	4e 80 00 20 	blr

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 include/net/checksum.h | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

David Laight Feb. 13, 2022, 2:39 a.m. UTC | #1

From: Christophe Leroy
> Sent: 11 February 2022 08:48
> 
> Today's implementation of csum_shift() leads to branching based on
> parity of 'offset'
> 
> 	000002f8 <csum_block_add>:
> 	     2f8:	70 a5 00 01 	andi.   r5,r5,1
> 	     2fc:	41 a2 00 08 	beq     304 <csum_block_add+0xc>
> 	     300:	54 84 c0 3e 	rotlwi  r4,r4,24
> 	     304:	7c 63 20 14 	addc    r3,r3,r4
> 	     308:	7c 63 01 94 	addze   r3,r3
> 	     30c:	4e 80 00 20 	blr
> 
> Use first bit of 'offset' directly as input of the rotation instead of
> branching.
> 
> 	000002f8 <csum_block_add>:
> 	     2f8:	54 a5 1f 38 	rlwinm  r5,r5,3,28,28
> 	     2fc:	20 a5 00 20 	subfic  r5,r5,32
> 	     300:	5c 84 28 3e 	rotlw   r4,r4,r5
> 	     304:	7c 63 20 14 	addc    r3,r3,r4
> 	     308:	7c 63 01 94 	addze   r3,r3
> 	     30c:	4e 80 00 20 	blr
> 
> And change to left shift instead of right shift to skip one more
> instruction. This has no impact on the final sum.
> 
> 	000002f8 <csum_block_add>:
> 	     2f8:	54 a5 1f 38 	rlwinm  r5,r5,3,28,28
> 	     2fc:	5c 84 28 3e 	rotlw   r4,r4,r5
> 	     300:	7c 63 20 14 	addc    r3,r3,r4
> 	     304:	7c 63 01 94 	addze   r3,r3
> 	     308:	4e 80 00 20 	blr

That is ppc64.
What happens on x86-64?

Trying to do the same in the x86 ipcsum code tended to make the code worse.
(Although that test is for an odd length fragment and can just be removed.)

	David

> 
> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> ---
>  include/net/checksum.h | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/include/net/checksum.h b/include/net/checksum.h
> index 5218041e5c8f..9badcd5532ef 100644
> --- a/include/net/checksum.h
> +++ b/include/net/checksum.h
> @@ -83,9 +83,7 @@ static inline __sum16 csum16_sub(__sum16 csum, __be16 addend)
>  static inline __wsum csum_shift(__wsum sum, int offset)
>  {
>  	/* rotate sum to align it with a 16b boundary */
> -	if (offset & 1)
> -		return (__force __wsum)ror32((__force u32)sum, 8);
> -	return sum;
> +	return (__force __wsum)rol32((__force u32)sum, (offset & 1) << 3);
>  }
> 
>  static inline __wsum
> --
> 2.34.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Segher Boessenkool Feb. 13, 2022, 9:16 a.m. UTC | #2

On Sun, Feb 13, 2022 at 02:39:06AM +0000, David Laight wrote:
> From: Christophe Leroy
> > Sent: 11 February 2022 08:48
> > 
> > Today's implementation of csum_shift() leads to branching based on
> > parity of 'offset'
> > 
> > 	000002f8 <csum_block_add>:
> > 	     2f8:	70 a5 00 01 	andi.   r5,r5,1
> > 	     2fc:	41 a2 00 08 	beq     304 <csum_block_add+0xc>
> > 	     300:	54 84 c0 3e 	rotlwi  r4,r4,24
> > 	     304:	7c 63 20 14 	addc    r3,r3,r4
> > 	     308:	7c 63 01 94 	addze   r3,r3
> > 	     30c:	4e 80 00 20 	blr
> > 
> > Use first bit of 'offset' directly as input of the rotation instead of
> > branching.
> > 
> > 	000002f8 <csum_block_add>:
> > 	     2f8:	54 a5 1f 38 	rlwinm  r5,r5,3,28,28
> > 	     2fc:	20 a5 00 20 	subfic  r5,r5,32
> > 	     300:	5c 84 28 3e 	rotlw   r4,r4,r5
> > 	     304:	7c 63 20 14 	addc    r3,r3,r4
> > 	     308:	7c 63 01 94 	addze   r3,r3
> > 	     30c:	4e 80 00 20 	blr
> > 
> > And change to left shift instead of right shift to skip one more
> > instruction. This has no impact on the final sum.
> > 
> > 	000002f8 <csum_block_add>:
> > 	     2f8:	54 a5 1f 38 	rlwinm  r5,r5,3,28,28
> > 	     2fc:	5c 84 28 3e 	rotlw   r4,r4,r5
> > 	     300:	7c 63 20 14 	addc    r3,r3,r4
> > 	     304:	7c 63 01 94 	addze   r3,r3
> > 	     308:	4e 80 00 20 	blr
> 
> That is ppc64.

That is 32-bit powerpc.

> What happens on x86-64?
> 
> Trying to do the same in the x86 ipcsum code tended to make the code worse.
> (Although that test is for an odd length fragment and can just be removed.)

In an ideal world the compiler could choose the optimal code sequences
everywhere.  But that won't ever happen, the search space is way too
big.  So compilers just use heuristics, not exhaustive search like
superopt does.  There is a middle way of course, something with directed
searches, and maybe in a few decades systems will be fast enough.  Until
then we will very often see code that is 10% slower and 30% bigger than
necessary.  A single insn more than needed isn't so bad :-)

Making things branch-free is very much worth it here though!


Segher

David Laight Feb. 13, 2022, 5:47 p.m. UTC | #3

From: Segher Boessenkool 
> Sent: 13 February 2022 09:16
....
> 
> > What happens on x86-64?
> >
> > Trying to do the same in the x86 ipcsum code tended to make the code worse.
> > (Although that test is for an odd length fragment and can just be removed.)
> 
> In an ideal world the compiler could choose the optimal code sequences
> everywhere.  But that won't ever happen, the search space is way too
> big.  So compilers just use heuristics, not exhaustive search like
> superopt does.  There is a middle way of course, something with directed
> searches, and maybe in a few decades systems will be fast enough.  Until
> then we will very often see code that is 10% slower and 30% bigger than
> necessary.  A single insn more than needed isn't so bad :-)

But it can be a lot more than that.

> Making things branch-free is very much worth it here though!

I tried to find out where 'here' is.

I can't get godbolt to generate anything like that object code
for a call to csum_shift().

I can't actually get it to issue a rotate (x86 of ppc).

I think it is only a single instruction because the compiler
has saved 'offset & 1' much earlier instead of doing testing
'offset & 1' just prior to the conditional.
It certainly has a nasty habit of doing that pessimisation.

So while it helps a specific call site it may be much
worse in general.

I also suspect that the addc/addze pair could be removed
by passing the old checksum into csum_partial.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Segher Boessenkool Feb. 14, 2022, 9:29 a.m. UTC | #4

On Sun, Feb 13, 2022 at 05:47:52PM +0000, David Laight wrote:
> From: Segher Boessenkool 
> > Sent: 13 February 2022 09:16
> > In an ideal world the compiler could choose the optimal code sequences
> > everywhere.  But that won't ever happen, the search space is way too
> > big.  So compilers just use heuristics, not exhaustive search like
> > superopt does.  There is a middle way of course, something with directed
> > searches, and maybe in a few decades systems will be fast enough.  Until
> > then we will very often see code that is 10% slower and 30% bigger than
> > necessary.  A single insn more than needed isn't so bad :-)
> 
> But it can be a lot more than that.

Obviously, but that isn't the case here (for powerpc anyway).  My point
here is that you won't ever get ideal generated code from your high-
level code (which is what C is), certainly not for all architectures.
But it *is* possible to get something reasonably good.

> > Making things branch-free is very much worth it here though!
> 
> I tried to find out where 'here' is.

I meant "with this patch".

Unpredictable branches are very expensive.  They already were something
to worry about on single-issue in-order processors, but they are much
more expensive now.

> I can't get godbolt to generate anything like that object code
> for a call to csum_shift().
> 
> I can't actually get it to issue a rotate (x86 of ppc).

All powerpc rotate insns start with "rl", and no other insns do.  There
also are extended mnemonics to ease programming, like "rotlw", which is
just a form of rlwinm (rotlw d,s,n is rlwnm d,s,n,0,31).

Depending on what tool you use to display binary code it will show you
extended mnemonics for some insns or just the basic insns.

> I think it is only a single instruction because the compiler
> has saved 'offset & 1' much earlier instead of doing testing
> 'offset & 1' just prior to the conditional.

rlwinm -- "nm" means "and mask".  rlwnm d,s,n,mb,me rotates register s
left by the contents of register n bits, and logical ands it with the
mask from bit mb until bit me.

> It certainly has a nasty habit of doing that pessimisation.

?  Not sure what you mean here.

> I also suspect that the addc/addze pair could be removed
> by passing the old checksum into csum_partial.

Maybe?  Does it not have to return a reduced result here?

Segher

Christophe Leroy March 1, 2022, 10:20 a.m. UTC | #5

Le 13/02/2022 à 18:47, David Laight a écrit :
> From: Segher Boessenkool
>> Sent: 13 February 2022 09:16
> ....
>>
>>> What happens on x86-64?
>>>
>>> Trying to do the same in the x86 ipcsum code tended to make the code worse.
>>> (Although that test is for an odd length fragment and can just be removed.)
>>
>> In an ideal world the compiler could choose the optimal code sequences
>> everywhere.  But that won't ever happen, the search space is way too
>> big.  So compilers just use heuristics, not exhaustive search like
>> superopt does.  There is a middle way of course, something with directed
>> searches, and maybe in a few decades systems will be fast enough.  Until
>> then we will very often see code that is 10% slower and 30% bigger than
>> necessary.  A single insn more than needed isn't so bad :-)
> 
> But it can be a lot more than that.
> 
>> Making things branch-free is very much worth it here though!
> 
> I tried to find out where 'here' is.
> 
> I can't get godbolt to generate anything like that object code
> for a call to csum_shift().
> 
> I can't actually get it to issue a rotate (x86 of ppc).
> 
> I think it is only a single instruction because the compiler
> has saved 'offset & 1' much earlier instead of doing testing
> 'offset & 1' just prior to the conditional.
> It certainly has a nasty habit of doing that pessimisation.
> 
> So while it helps a specific call site it may be much
> worse in general.
> 

The main user of csum_shift() is csum_and_copy_to_iter().

You clearly see the difference in one of the instances below extracted 
from output of objdump -S lib/iov_iter.o:


Without the patch:

	sum = csum_shift(csstate->csum, csstate->off);
     21a8:	92 e1 00 4c 	stw     r23,76(r1)
     21ac:	7c 77 1b 78 	mr      r23,r3
     21b0:	93 01 00 50 	stw     r24,80(r1)
     21b4:	7c b8 2b 78 	mr      r24,r5
     21b8:	93 61 00 5c 	stw     r27,92(r1)
     21bc:	7c db 33 78 	mr      r27,r6
     21c0:	93 81 00 60 	stw     r28,96(r1)
     21c4:	81 05 00 04 	lwz     r8,4(r5)
     21c8:	83 85 00 00 	lwz     r28,0(r5)
}

static __always_inline __wsum csum_shift(__wsum sum, int offset)
{
	/* rotate sum to align it with a 16b boundary */
	if (offset & 1)
     21cc:	71 09 00 01 	andi.   r9,r8,1		<== test oddity
     21d0:	41 a2 00 08 	beq     21d8		<== branch
  * @word: value to rotate
  * @shift: bits to roll
  */
static inline __u32 ror32(__u32 word, unsigned int shift)
{
	return (word >> (shift & 31)) | (word << ((-shift) & 31));
     21d4:	57 9c c0 3e 	rotlwi  r28,r28,24	<== rotate
     21d8:	2b 8a 00 03 	cmplwi  cr7,r10,3
     21dc:	41 9e 01 ec 	beq     cr7,23c8 <csum_and_copy_to_iter+0x234>




With the patch:

	sum = csum_shift(csstate->csum, csstate->off);
     21a8:	92 c1 00 48 	stw     r22,72(r1)
	if (unlikely(iov_iter_is_pipe(i)))
     21ac:	28 08 00 03 	cmplwi  r8,3
     21b0:	92 e1 00 4c 	stw     r23,76(r1)
     21b4:	7c 76 1b 78 	mr      r22,r3
     21b8:	93 41 00 58 	stw     r26,88(r1)
     21bc:	7c b7 2b 78 	mr      r23,r5
     21c0:	93 81 00 60 	stw     r28,96(r1)
     21c4:	7c da 33 78 	mr      r26,r6
	sum = csum_shift(csstate->csum, csstate->off);
     21c8:	80 e5 00 04 	lwz     r7,4(r5)
  * @word: value to rotate
  * @shift: bits to roll
  */
static inline __u32 rol32(__u32 word, unsigned int shift)
{
	return (word << (shift & 31)) | (word >> ((-shift) & 31));
     21cc:	81 25 00 00 	lwz     r9,0(r5)
}

static __always_inline __wsum csum_shift(__wsum sum, int offset)
{
	/* rotate sum to align it with a 16b boundary */
	return (__force __wsum)rol32((__force u32)sum, (offset & 1) << 3);
     21d0:	54 ea 1f 38 	rlwinm  r10,r7,3,28,28
     21d4:	5d 3c 50 3e 	rotlw   r28,r9,r10	<== rotate
	if (unlikely(iov_iter_is_pipe(i)))
     21d8:	41 82 01 e0 	beq     23b8 <csum_and_copy_to_iter+0x224>


Christophe

David Laight March 1, 2022, 10:47 a.m. UTC | #6

From: Christophe Leroy
> Sent: 01 March 2022 10:20
> 
> Le 13/02/2022 à 18:47, David Laight a écrit :
> > From: Segher Boessenkool
> >> Sent: 13 February 2022 09:16
> > ....
> >>
> >>> What happens on x86-64?
> >>>
> >>> Trying to do the same in the x86 ipcsum code tended to make the code worse.
> >>> (Although that test is for an odd length fragment and can just be removed.)
> >>
> >> In an ideal world the compiler could choose the optimal code sequences
> >> everywhere.  But that won't ever happen, the search space is way too
> >> big.  So compilers just use heuristics, not exhaustive search like
> >> superopt does.  There is a middle way of course, something with directed
> >> searches, and maybe in a few decades systems will be fast enough.  Until
> >> then we will very often see code that is 10% slower and 30% bigger than
> >> necessary.  A single insn more than needed isn't so bad :-)
> >
> > But it can be a lot more than that.
> >
> >> Making things branch-free is very much worth it here though!
> >
> > I tried to find out where 'here' is.
> >
> > I can't get godbolt to generate anything like that object code
> > for a call to csum_shift().
> >
> > I can't actually get it to issue a rotate (x86 of ppc).
> >
> > I think it is only a single instruction because the compiler
> > has saved 'offset & 1' much earlier instead of doing testing
> > 'offset & 1' just prior to the conditional.
> > It certainly has a nasty habit of doing that pessimisation.
> >
> > So while it helps a specific call site it may be much
> > worse in general.
> >
> 
> The main user of csum_shift() is csum_and_copy_to_iter().
> 
> You clearly see the difference in one of the instances below extracted
> from output of objdump -S lib/iov_iter.o:
> 
> 
> Without the patch:
> 
> 	sum = csum_shift(csstate->csum, csstate->off);
>      21a8:	92 e1 00 4c 	stw     r23,76(r1)
>      21ac:	7c 77 1b 78 	mr      r23,r3
>      21b0:	93 01 00 50 	stw     r24,80(r1)
>      21b4:	7c b8 2b 78 	mr      r24,r5
>      21b8:	93 61 00 5c 	stw     r27,92(r1)
>      21bc:	7c db 33 78 	mr      r27,r6
>      21c0:	93 81 00 60 	stw     r28,96(r1)
>      21c4:	81 05 00 04 	lwz     r8,4(r5)
>      21c8:	83 85 00 00 	lwz     r28,0(r5)
> }
> 
> static __always_inline __wsum csum_shift(__wsum sum, int offset)
> {
> 	/* rotate sum to align it with a 16b boundary */
> 	if (offset & 1)
>      21cc:	71 09 00 01 	andi.   r9,r8,1		<== test oddity
>      21d0:	41 a2 00 08 	beq     21d8		<== branch
>   * @word: value to rotate
>   * @shift: bits to roll
>   */
> static inline __u32 ror32(__u32 word, unsigned int shift)
> {
> 	return (word >> (shift & 31)) | (word << ((-shift) & 31));
>      21d4:	57 9c c0 3e 	rotlwi  r28,r28,24	<== rotate
>      21d8:	2b 8a 00 03 	cmplwi  cr7,r10,3
>      21dc:	41 9e 01 ec 	beq     cr7,23c8 <csum_and_copy_to_iter+0x234>
> 
> 
> 
> 
> With the patch:
> 
> 	sum = csum_shift(csstate->csum, csstate->off);
>      21a8:	92 c1 00 48 	stw     r22,72(r1)
> 	if (unlikely(iov_iter_is_pipe(i)))
>      21ac:	28 08 00 03 	cmplwi  r8,3
>      21b0:	92 e1 00 4c 	stw     r23,76(r1)
>      21b4:	7c 76 1b 78 	mr      r22,r3
>      21b8:	93 41 00 58 	stw     r26,88(r1)
>      21bc:	7c b7 2b 78 	mr      r23,r5
>      21c0:	93 81 00 60 	stw     r28,96(r1)
>      21c4:	7c da 33 78 	mr      r26,r6
> 	sum = csum_shift(csstate->csum, csstate->off);
>      21c8:	80 e5 00 04 	lwz     r7,4(r5)
>   * @word: value to rotate
>   * @shift: bits to roll
>   */
> static inline __u32 rol32(__u32 word, unsigned int shift)
> {
> 	return (word << (shift & 31)) | (word >> ((-shift) & 31));
>      21cc:	81 25 00 00 	lwz     r9,0(r5)
> }
> 
> static __always_inline __wsum csum_shift(__wsum sum, int offset)
> {
> 	/* rotate sum to align it with a 16b boundary */
> 	return (__force __wsum)rol32((__force u32)sum, (offset & 1) << 3);
>      21d0:	54 ea 1f 38 	rlwinm  r10,r7,3,28,28

Right, this all depends on the rlwinm instruction.
I had to look it up, that one shifts r7 (count) left 3 bits
and then masks it with all the bits from 28 to 28 (in some counting scheme).

Try the same code on x86.
The mask and shift have to be separate instructions and it probably
needs a register move (which might be a rename and free).
Whereas the current code generates a conditional move.
(At least, the only rotates in that function are followed by a cmovne.)

So x86 is almost certainly better with the current code.
No idea about arm or anything else people might care about.

All the world isn't ppc.

	David

>      21d4:	5d 3c 50 3e 	rotlw   r28,r9,r10	<== rotate
> 	if (unlikely(iov_iter_is_pipe(i)))
>      21d8:	41 82 01 e0 	beq     23b8 <csum_and_copy_to_iter+0x224>
> 
> 
> Christophe

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Christophe Leroy March 1, 2022, 11:14 a.m. UTC | #7

Le 01/03/2022 à 11:47, David Laight a écrit :
> From: Christophe Leroy
>> Sent: 01 March 2022 10:20
>>
>> Le 13/02/2022 à 18:47, David Laight a écrit :
>>> From: Segher Boessenkool
>>>> Sent: 13 February 2022 09:16
>>> ....
>>>>
>>>>> What happens on x86-64?
>>>>>
>>>>> Trying to do the same in the x86 ipcsum code tended to make the code worse.
>>>>> (Although that test is for an odd length fragment and can just be removed.)
>>>>
>>>> In an ideal world the compiler could choose the optimal code sequences
>>>> everywhere.  But that won't ever happen, the search space is way too
>>>> big.  So compilers just use heuristics, not exhaustive search like
>>>> superopt does.  There is a middle way of course, something with directed
>>>> searches, and maybe in a few decades systems will be fast enough.  Until
>>>> then we will very often see code that is 10% slower and 30% bigger than
>>>> necessary.  A single insn more than needed isn't so bad :-)
>>>
>>> But it can be a lot more than that.
>>>
>>>> Making things branch-free is very much worth it here though!
>>>
>>> I tried to find out where 'here' is.
>>>
>>> I can't get godbolt to generate anything like that object code
>>> for a call to csum_shift().
>>>
>>> I can't actually get it to issue a rotate (x86 of ppc).
>>>
>>> I think it is only a single instruction because the compiler
>>> has saved 'offset & 1' much earlier instead of doing testing
>>> 'offset & 1' just prior to the conditional.
>>> It certainly has a nasty habit of doing that pessimisation.
>>>
>>> So while it helps a specific call site it may be much
>>> worse in general.
>>>
>>
>> The main user of csum_shift() is csum_and_copy_to_iter().
>>
>> You clearly see the difference in one of the instances below extracted
>> from output of objdump -S lib/iov_iter.o:
>>
>>
>> Without the patch:
>>
>> 	sum = csum_shift(csstate->csum, csstate->off);
>>       21a8:	92 e1 00 4c 	stw     r23,76(r1)
>>       21ac:	7c 77 1b 78 	mr      r23,r3
>>       21b0:	93 01 00 50 	stw     r24,80(r1)
>>       21b4:	7c b8 2b 78 	mr      r24,r5
>>       21b8:	93 61 00 5c 	stw     r27,92(r1)
>>       21bc:	7c db 33 78 	mr      r27,r6
>>       21c0:	93 81 00 60 	stw     r28,96(r1)
>>       21c4:	81 05 00 04 	lwz     r8,4(r5)
>>       21c8:	83 85 00 00 	lwz     r28,0(r5)
>> }
>>
>> static __always_inline __wsum csum_shift(__wsum sum, int offset)
>> {
>> 	/* rotate sum to align it with a 16b boundary */
>> 	if (offset & 1)
>>       21cc:	71 09 00 01 	andi.   r9,r8,1		<== test oddity
>>       21d0:	41 a2 00 08 	beq     21d8		<== branch
>>    * @word: value to rotate
>>    * @shift: bits to roll
>>    */
>> static inline __u32 ror32(__u32 word, unsigned int shift)
>> {
>> 	return (word >> (shift & 31)) | (word << ((-shift) & 31));
>>       21d4:	57 9c c0 3e 	rotlwi  r28,r28,24	<== rotate
>>       21d8:	2b 8a 00 03 	cmplwi  cr7,r10,3
>>       21dc:	41 9e 01 ec 	beq     cr7,23c8 <csum_and_copy_to_iter+0x234>
>>
>>
>>
>>
>> With the patch:
>>
>> 	sum = csum_shift(csstate->csum, csstate->off);
>>       21a8:	92 c1 00 48 	stw     r22,72(r1)
>> 	if (unlikely(iov_iter_is_pipe(i)))
>>       21ac:	28 08 00 03 	cmplwi  r8,3
>>       21b0:	92 e1 00 4c 	stw     r23,76(r1)
>>       21b4:	7c 76 1b 78 	mr      r22,r3
>>       21b8:	93 41 00 58 	stw     r26,88(r1)
>>       21bc:	7c b7 2b 78 	mr      r23,r5
>>       21c0:	93 81 00 60 	stw     r28,96(r1)
>>       21c4:	7c da 33 78 	mr      r26,r6
>> 	sum = csum_shift(csstate->csum, csstate->off);
>>       21c8:	80 e5 00 04 	lwz     r7,4(r5)
>>    * @word: value to rotate
>>    * @shift: bits to roll
>>    */
>> static inline __u32 rol32(__u32 word, unsigned int shift)
>> {
>> 	return (word << (shift & 31)) | (word >> ((-shift) & 31));
>>       21cc:	81 25 00 00 	lwz     r9,0(r5)
>> }
>>
>> static __always_inline __wsum csum_shift(__wsum sum, int offset)
>> {
>> 	/* rotate sum to align it with a 16b boundary */
>> 	return (__force __wsum)rol32((__force u32)sum, (offset & 1) << 3);
>>       21d0:	54 ea 1f 38 	rlwinm  r10,r7,3,28,28
> 
> Right, this all depends on the rlwinm instruction.
> I had to look it up, that one shifts r7 (count) left 3 bits
> and then masks it with all the bits from 28 to 28 (in some counting scheme).

Yes it does (offset << 3) & 8 which is exactly the same as (offset & 1) << 3


> 
> Try the same code on x86.
> The mask and shift have to be separate instructions and it probably
> needs a register move (which might be a rename and free).

I see

> Whereas the current code generates a conditional move.
> (At least, the only rotates in that function are followed by a cmovne.)

ppc32 doesn't have that.
ppc64 have isel, but gcc doesn't seem to use it either here.

> 
> So x86 is almost certainly better with the current code.
> No idea about arm or anything else people might care about.


Looks like ARM also does better code with the generic implementation as 
it seems to have some looking like conditional instructions 'rorne' and 
'strne'.

static __always_inline __wsum csum_shift(__wsum sum, int offset)
{
	/* rotate sum to align it with a 16b boundary */
	if (offset & 1)
     1d28:	e2102001 	ands	r2, r0, #1
     1d2c:	e58d3004 	str	r3, [sp, #4]
  * @word: value to rotate
  * @shift: bits to roll
  */
static inline __u32 ror32(__u32 word, unsigned int shift)
{
	return (word >> (shift & 31)) | (word << ((-shift) & 31));
     1d30:	11a03463 	rorne	r3, r3, #8
     1d34:	158d3004 	strne	r3, [sp, #4]
	if (unlikely(iov_iter_is_pipe(i)))


Whereas with my patch we get


static inline __u32 rol32(__u32 word, unsigned int shift)
{
	return (word << (shift & 31)) | (word >> ((-shift) & 31));
     1984:	e8930006 	ldm	r3, {r1, r2}
}

static __always_inline __wsum csum_shift(__wsum sum, int offset)
{
	/* rotate sum to align it with a 16b boundary */
	return (__force __wsum)rol32((__force u32)sum, (offset & 1) << 3);
     1988:	e1a03182 	lsl	r3, r2, #3
     198c:	e2033008 	and	r3, r3, #8
     1990:	e2633020 	rsb	r3, r3, #32
     1994:	e1a01371 	ror	r1, r1, r3
     1998:	e58d1008 	str	r1, [sp, #8]



We get slightly better with a ror32 instead of rol32:


	return (word >> (shift & 31)) | (word << ((-shift) & 31));
     1984:	e8930006 	ldm	r3, {r1, r2}
}

static __always_inline __wsum csum_shift(__wsum sum, int offset)
{
	/* rotate sum to align it with a 16b boundary */
	return (__force __wsum)ror32((__force u32)sum, (offset & 1) << 3);
     1988:	e1a03182 	lsl	r3, r2, #3
     198c:	e2033008 	and	r3, r3, #8
     1990:	e1a01371 	ror	r1, r1, r3
     1994:	e58d1008 	str	r1, [sp, #8]



> 
> All the world isn't ppc.

Ok, so the solution would be to have an arch specific version of 
csum_shift() in the same principle as csum_add().

Christophe

David Laight March 1, 2022, 11:41 a.m. UTC | #8

From: Christophe Leroy
> Sent: 01 March 2022 11:15
...
> Looks like ARM also does better code with the generic implementation as
> it seems to have some looking like conditional instructions 'rorne' and
> 'strne'.

In arm32 (and I think arm64) every instruction is conditional.

> static __always_inline __wsum csum_shift(__wsum sum, int offset)
> {
> 	/* rotate sum to align it with a 16b boundary */
> 	if (offset & 1)
>      1d28:	e2102001 	ands	r2, r0, #1
>      1d2c:	e58d3004 	str	r3, [sp, #4]
>   * @word: value to rotate
>   * @shift: bits to roll
>   */
> static inline __u32 ror32(__u32 word, unsigned int shift)
> {
> 	return (word >> (shift & 31)) | (word << ((-shift) & 31));
>      1d30:	11a03463 	rorne	r3, r3, #8
>      1d34:	158d3004 	strne	r3, [sp, #4]
> 	if (unlikely(iov_iter_is_pipe(i)))

There is a spare 'str' that a minor code change would
probably remove.
Likely not helped by registers being spilled to stack.

ISTR arm32 having a reasonable number of registers and then
a whole load of them being stolen by the implementation.
(I'm sure I remember stack limit and thread base...)
So the compiler doesn't get that many to play with.

Not quite as bad as nios2 - where r2 and r3 are 'reserved for
the assembler' (as they probably are on MIPS) but the nios2
assembler doesn't ever need to use them!

> ...
> Ok, so the solution would be to have an arch specific version of
> csum_shift() in the same principle as csum_add().

Probably.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Russell King (Oracle) March 1, 2022, 12:37 p.m. UTC | #9

On Tue, Mar 01, 2022 at 11:41:06AM +0000, David Laight wrote:
> From: Christophe Leroy
> > Sent: 01 March 2022 11:15
> ...
> > Looks like ARM also does better code with the generic implementation as
> > it seems to have some looking like conditional instructions 'rorne' and
> > 'strne'.
> 
> In arm32 (and I think arm64) every instruction is conditional.

Almost every instruction in arm32. There are a number of unconditional
instructions that were introduced.

net: Remove branch in csum_shift()

Checks

Commit Message

Comments

Patch