[v1] x86/csum: rewrite csum_partial()

Message ID	20211111181025.2139131-1-eric.dumazet@gmail.com (mailing list archive)
State	Superseded
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> From: Eric Dumazet <eric.dumazet@gmail.com> To: "David S . Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org> Cc: netdev <netdev@vger.kernel.org>, Eric Dumazet <edumazet@google.com>, Eric Dumazet <eric.dumazet@gmail.com>, x86@kernel.org, Alexander Duyck <alexander.duyck@gmail.com>, Peter Zijlstra <peterz@infradead.org> Subject: [PATCH v1] x86/csum: rewrite csum_partial() Date: Thu, 11 Nov 2021 10:10:25 -0800 Message-Id: <20211111181025.2139131-1-eric.dumazet@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[v1] x86/csum: rewrite csum_partial() \| expand [v1] x86/csum: rewrite csum_partial()

Context	Check	Description
netdev/tree_selection	success	Not a local patch

Eric Dumazet Nov. 11, 2021, 6:10 p.m. UTC

From: Eric Dumazet <edumazet@google.com>

With more NIC supporting CHECKSUM_COMPLETE, and IPv6 being widely used.
csum_partial() is heavily used with small amount of bytes,
and is consuming many cycles.

IPv6 header size for instance is 40 bytes.

Another thing to consider is that NET_IP_ALIGN is 0 on x86,
meaning that network headers are not word-aligned, unless
the driver forces this.

This means that csum_partial() fetches one u16
to 'align the buffer', then perform seven u64 additions
with carry in a loop, then a remaining u32, then a remaining u16.

With this new version, we perform 10 u32 adds with carry, to
avoid the expensive 64->32 transformation.

Also note that this avoids loops for less than ~60 bytes.

Tested on various cpus, all of them show a big reduction in
csum_partial() cost (by 50 to 75 %)

v2: - removed the hard-coded switch(), as it was not RETPOLINE aware.
    - removed the final add32_with_carry() that we were doing
      in csum_partial(), we can simply pass @sum to do_csum()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/lib/csum-partial_64.c | 151 +++++++++++++++++----------------
 1 file changed, 76 insertions(+), 75 deletions(-)

Alexander Duyck Nov. 11, 2021, 9:56 p.m. UTC | #1

On Thu, Nov 11, 2021 at 10:10 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> With more NIC supporting CHECKSUM_COMPLETE, and IPv6 being widely used.
> csum_partial() is heavily used with small amount of bytes,
> and is consuming many cycles.
>
> IPv6 header size for instance is 40 bytes.
>
> Another thing to consider is that NET_IP_ALIGN is 0 on x86,
> meaning that network headers are not word-aligned, unless
> the driver forces this.
>
> This means that csum_partial() fetches one u16
> to 'align the buffer', then perform seven u64 additions
> with carry in a loop, then a remaining u32, then a remaining u16.
>
> With this new version, we perform 10 u32 adds with carry, to
> avoid the expensive 64->32 transformation.
>
> Also note that this avoids loops for less than ~60 bytes.
>
> Tested on various cpus, all of them show a big reduction in
> csum_partial() cost (by 50 to 75 %)
>
> v2: - removed the hard-coded switch(), as it was not RETPOLINE aware.
>     - removed the final add32_with_carry() that we were doing
>       in csum_partial(), we can simply pass @sum to do_csum()
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Alexander Duyck <alexander.duyck@gmail.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/x86/lib/csum-partial_64.c | 151 +++++++++++++++++----------------
>  1 file changed, 76 insertions(+), 75 deletions(-)
>
> diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> index e7925d668b680269fb2442766deaf416dc42f9a1..910806a1b954c5fed90020191143d16aec74bf0a 100644
> --- a/arch/x86/lib/csum-partial_64.c
> +++ b/arch/x86/lib/csum-partial_64.c
> @@ -21,97 +21,99 @@ static inline unsigned short from32to16(unsigned a)
>  }
>
>  /*
> - * Do a 64-bit checksum on an arbitrary memory area.
> + * Do a checksum on an arbitrary memory area.
>   * Returns a 32bit checksum.
>   *
>   * This isn't as time critical as it used to be because many NICs
>   * do hardware checksumming these days.
> - *
> - * Things tried and found to not make it faster:
> - * Manual Prefetching
> - * Unrolling to an 128 bytes inner loop.
> - * Using interleaving with more registers to break the carry chains.
> + *
> + * Still, with CHECKSUM_COMPLETE this is called to compute
> + * checksums on IPv6 headers (40 bytes) and other small parts.
>   */
> -static unsigned do_csum(const unsigned char *buff, unsigned len)
> +static unsigned do_csum(const unsigned char *buff, unsigned len, u32 result)
>  {
> -       unsigned odd, count;
> -       unsigned long result = 0;
> +       unsigned odd;
>
> -       if (unlikely(len == 0))
> -               return result;
>         odd = 1 & (unsigned long) buff;
>         if (unlikely(odd)) {
> +               if (unlikely(len == 0))
> +                       return result;
>                 result = *buff << 8;
>                 len--;
>                 buff++;
>         }

It might be worthwhile to beef up the odd check to account for
anything 7 bytes or less. To address it you could do something along
the lines of:
    unaligned = 7 & (unsigned long) buff;
    if (unaligned) {
        shift = unaligned * 8;
        temp64 = (*(unsigned long)buff >> shift) << shift;
        buff += 8 - unaligned;
        if (len < 8 - unaligned) {
            shift = (8 - len - unaligned) * 8;
            temp64 <<= shift;
            temp64 >>= shift;
            len = 0;
        } else {
            len -= 8 - unaligned;
        }
        result += temp64;
        result += result < temp64;
   }

Then the check for odd at the end would have to be updated to check
for the lsb of the "unaligned" value, or you could just do away with
the conditional check and just rotate the final result by "unaligned *
8" before folding the value down to 32b.

> -       count = len >> 1;               /* nr of 16-bit words.. */
> -       if (count) {
> -               if (2 & (unsigned long) buff) {
> -                       result += *(unsigned short *)buff;
> -                       count--;
> -                       len -= 2;
> -                       buff += 2;
> -               }
> -               count >>= 1;            /* nr of 32-bit words.. */
> -               if (count) {
> -                       unsigned long zero;
> -                       unsigned count64;
> -                       if (4 & (unsigned long) buff) {
> -                               result += *(unsigned int *) buff;
> -                               count--;
> -                               len -= 4;
> -                               buff += 4;
> -                       }
> -                       count >>= 1;    /* nr of 64-bit words.. */
> -
> -                       /* main loop using 64byte blocks */
> -                       zero = 0;
> -                       count64 = count >> 3;
> -                       while (count64) {
> -                               asm("addq 0*8(%[src]),%[res]\n\t"
> -                                   "adcq 1*8(%[src]),%[res]\n\t"
> -                                   "adcq 2*8(%[src]),%[res]\n\t"
> -                                   "adcq 3*8(%[src]),%[res]\n\t"
> -                                   "adcq 4*8(%[src]),%[res]\n\t"
> -                                   "adcq 5*8(%[src]),%[res]\n\t"
> -                                   "adcq 6*8(%[src]),%[res]\n\t"
> -                                   "adcq 7*8(%[src]),%[res]\n\t"
> -                                   "adcq %[zero],%[res]"
> -                                   : [res] "=r" (result)
> -                                   : [src] "r" (buff), [zero] "r" (zero),
> -                                   "[res]" (result));
> -                               buff += 64;
> -                               count64--;
> -                       }
> -
> -                       /* last up to 7 8byte blocks */
> -                       count %= 8;
> -                       while (count) {
> -                               asm("addq %1,%0\n\t"
> -                                   "adcq %2,%0\n"
> -                                           : "=r" (result)
> -                                   : "m" (*(unsigned long *)buff),
> -                                   "r" (zero),  "0" (result));
> -                               --count;
> -                               buff += 8;
> -                       }
> -                       result = add32_with_carry(result>>32,
> -                                                 result&0xffffffff);
> -
> -                       if (len & 4) {
> -                               result += *(unsigned int *) buff;
> -                               buff += 4;
> -                       }
> -               }
> +       if (unlikely(len >= 64)) {
> +               unsigned long temp64 = result;
> +               do {
> +                       asm("   addq 0*8(%[src]),%[res]\n"
> +                           "   adcq 1*8(%[src]),%[res]\n"
> +                           "   adcq 2*8(%[src]),%[res]\n"
> +                           "   adcq 3*8(%[src]),%[res]\n"
> +                           "   adcq 4*8(%[src]),%[res]\n"
> +                           "   adcq 5*8(%[src]),%[res]\n"
> +                           "   adcq 6*8(%[src]),%[res]\n"
> +                           "   adcq 7*8(%[src]),%[res]\n"
> +                           "   adcq $0,%[res]"
> +                           : [res] "=r" (temp64)
> +                           : [src] "r" (buff), "[res]" (temp64)
> +                           : "memory");
> +                       buff += 64;
> +                       len -= 64;
> +               } while (len >= 64);

I wonder if we couldn't optimize this loop by tracking buff versus
another pointer instead of updating and checking the length.

Basically just update it so that we have something like:
    trailer = len & 63;
    end = buf + len - trailer;
    do {
        ...
        buff += 64;
    } while (buff < end);

> +               result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);

Squashing this to 32b here seems a bit premature. As I mentioned
before an add w/ carry is going to be one per cycle as it serializes
on the carry flag, so if you are doing it 4 bytes at a time you can
probably get the same or better by using adcq for the 32, 16, and 8B
lengths.

> +       }
> +       if (len & 32) {
> +               asm("   addl 0*4(%[src]),%[res]\n"
> +                   "   adcl 1*4(%[src]),%[res]\n"
> +                   "   adcl 2*4(%[src]),%[res]\n"
> +                   "   adcl 3*4(%[src]),%[res]\n"
> +                   "   adcl 4*4(%[src]),%[res]\n"
> +                   "   adcl 5*4(%[src]),%[res]\n"
> +                   "   adcl 6*4(%[src]),%[res]\n"
> +                   "   adcl 7*4(%[src]),%[res]\n"
> +                   "   adcl $0,%[res]"
> +                       : [res] "=r" (result)
> +                       : [src] "r" (buff), "[res]" (result)
> +                       : "memory");
> +               buff += 32;
> +       }
> +       if (len & 16) {
> +               asm("   addl 0*4(%[src]),%[res]\n"
> +                   "   adcl 1*4(%[src]),%[res]\n"
> +                   "   adcl 2*4(%[src]),%[res]\n"
> +                   "   adcl 3*4(%[src]),%[res]\n"
> +                   "   adcl $0,%[res]"
> +                       : [res] "=r" (result)
> +                       : [src] "r" (buff), "[res]" (result)
> +                       : "memory");
> +               buff += 16;
> +       }
> +       if (len & 8) {
> +               asm("   addl 0*4(%[src]),%[res]\n"
> +                   "   adcl 1*4(%[src]),%[res]\n"
> +                   "   adcl $0,%[res]"
> +                       : [res] "=r" (result)
> +                       : [src] "r" (buff), "[res]" (result)
> +                       : "memory");
> +               buff += 8;
> +       }

Alternatively we could look at accumulating the 32b values into a 64b
register to get rid of the need to monitor the carry flag and that may
improve the speed of things on architectures such as Haswell or newer.

> +       if (len & 4) {
> +               asm("   addl 0*4(%[src]),%[res]\n"
> +                   "   adcl $0,%[res]\n"
> +                       : [res] "=r" (result)
> +                       : [src] "r" (buff), "[res]" (result)
> +                       : "memory");
> +               buff += 4;
> +       }
> +       if (len & 3U) {
> +               result = from32to16(result);
>                 if (len & 2) {
>                         result += *(unsigned short *) buff;
>                         buff += 2;
>                 }
> +               if (len & 1)
> +                       result += *buff;
>         }

For values 7 through 1 I wonder if you wouldn't be better served by
just doing a single QWORD read and a pair of shifts. Something along
the lines of:
    if (len) {
        shift = (8 - len) * 8;
        temp64 = (*(unsigned long)buff << shift) >> shift;
        result += temp64;
        result += result < temp64;
    }

> -       if (len & 1)
> -               result += *buff;
> -       result = add32_with_carry(result>>32, result & 0xffffffff);
>         if (unlikely(odd)) {
>                 result = from32to16(result);
>                 result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> @@ -133,8 +135,7 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
>   */
>  __wsum csum_partial(const void *buff, int len, __wsum sum)
>  {
> -       return (__force __wsum)add32_with_carry(do_csum(buff, len),
> -                                               (__force u32)sum);
> +       return (__force __wsum)do_csum(buff, len, (__force u32)sum);
>  }
>  EXPORT_SYMBOL(csum_partial);
>
> --
> 2.34.0.rc1.387.gb447b232ab-goog
>

Eric Dumazet Nov. 11, 2021, 10:30 p.m. UTC | #2

On Thu, Nov 11, 2021 at 1:56 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Thu, Nov 11, 2021 at 10:10 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> >
> > With more NIC supporting CHECKSUM_COMPLETE, and IPv6 being widely used.
> > csum_partial() is heavily used with small amount of bytes,
> > and is consuming many cycles.
> >
> > IPv6 header size for instance is 40 bytes.
> >
> > Another thing to consider is that NET_IP_ALIGN is 0 on x86,
> > meaning that network headers are not word-aligned, unless
> > the driver forces this.
> >
> > This means that csum_partial() fetches one u16
> > to 'align the buffer', then perform seven u64 additions
> > with carry in a loop, then a remaining u32, then a remaining u16.
> >
> > With this new version, we perform 10 u32 adds with carry, to
> > avoid the expensive 64->32 transformation.
> >
> > Also note that this avoids loops for less than ~60 bytes.
> >
> > Tested on various cpus, all of them show a big reduction in
> > csum_partial() cost (by 50 to 75 %)
> >
> > v2: - removed the hard-coded switch(), as it was not RETPOLINE aware.
> >     - removed the final add32_with_carry() that we were doing
> >       in csum_partial(), we can simply pass @sum to do_csum()
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Alexander Duyck <alexander.duyck@gmail.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > ---
> >  arch/x86/lib/csum-partial_64.c | 151 +++++++++++++++++----------------
> >  1 file changed, 76 insertions(+), 75 deletions(-)
> >
> > diff --git a/arch/x86/lib/csum-partial_64.c b/arch/x86/lib/csum-partial_64.c
> > index e7925d668b680269fb2442766deaf416dc42f9a1..910806a1b954c5fed90020191143d16aec74bf0a 100644
> > --- a/arch/x86/lib/csum-partial_64.c
> > +++ b/arch/x86/lib/csum-partial_64.c
> > @@ -21,97 +21,99 @@ static inline unsigned short from32to16(unsigned a)
> >  }
> >
> >  /*
> > - * Do a 64-bit checksum on an arbitrary memory area.
> > + * Do a checksum on an arbitrary memory area.
> >   * Returns a 32bit checksum.
> >   *
> >   * This isn't as time critical as it used to be because many NICs
> >   * do hardware checksumming these days.
> > - *
> > - * Things tried and found to not make it faster:
> > - * Manual Prefetching
> > - * Unrolling to an 128 bytes inner loop.
> > - * Using interleaving with more registers to break the carry chains.
> > + *
> > + * Still, with CHECKSUM_COMPLETE this is called to compute
> > + * checksums on IPv6 headers (40 bytes) and other small parts.
> >   */
> > -static unsigned do_csum(const unsigned char *buff, unsigned len)
> > +static unsigned do_csum(const unsigned char *buff, unsigned len, u32 result)
> >  {
> > -       unsigned odd, count;
> > -       unsigned long result = 0;
> > +       unsigned odd;
> >
> > -       if (unlikely(len == 0))
> > -               return result;
> >         odd = 1 & (unsigned long) buff;
> >         if (unlikely(odd)) {
> > +               if (unlikely(len == 0))
> > +                       return result;
> >                 result = *buff << 8;
> >                 len--;
> >                 buff++;
> >         }
>
> It might be worthwhile to beef up the odd check to account for
> anything 7 bytes or less. To address it you could do something along
> the lines of:
>     unaligned = 7 & (unsigned long) buff;
>     if (unaligned) {
>         shift = unaligned * 8;
>         temp64 = (*(unsigned long)buff >> shift) << shift;

What happens if len=={1..7}  ? KASAN will complain.

>         buff += 8 - unaligned;
>         if (len < 8 - unaligned) {
>             shift = (8 - len - unaligned) * 8;
>             temp64 <<= shift;
>             temp64 >>= shift;
>             len = 0;
>         } else {
>             len -= 8 - unaligned;
>         }
>         result += temp64;
>         result += result < temp64;

Also this will trigger all the time because NET_IP_ALIGN == 0


>    }
>
> Then the check for odd at the end would have to be updated to check
> for the lsb of the "unaligned" value, or you could just do away with
> the conditional check and just rotate the final result by "unaligned *
> 8" before folding the value down to 32b.
>
> > -       count = len >> 1;               /* nr of 16-bit words.. */
> > -       if (count) {
> > -               if (2 & (unsigned long) buff) {
> > -                       result += *(unsigned short *)buff;
> > -                       count--;
> > -                       len -= 2;
> > -                       buff += 2;
> > -               }
> > -               count >>= 1;            /* nr of 32-bit words.. */
> > -               if (count) {
> > -                       unsigned long zero;
> > -                       unsigned count64;
> > -                       if (4 & (unsigned long) buff) {
> > -                               result += *(unsigned int *) buff;
> > -                               count--;
> > -                               len -= 4;
> > -                               buff += 4;
> > -                       }
> > -                       count >>= 1;    /* nr of 64-bit words.. */
> > -
> > -                       /* main loop using 64byte blocks */
> > -                       zero = 0;
> > -                       count64 = count >> 3;
> > -                       while (count64) {
> > -                               asm("addq 0*8(%[src]),%[res]\n\t"
> > -                                   "adcq 1*8(%[src]),%[res]\n\t"
> > -                                   "adcq 2*8(%[src]),%[res]\n\t"
> > -                                   "adcq 3*8(%[src]),%[res]\n\t"
> > -                                   "adcq 4*8(%[src]),%[res]\n\t"
> > -                                   "adcq 5*8(%[src]),%[res]\n\t"
> > -                                   "adcq 6*8(%[src]),%[res]\n\t"
> > -                                   "adcq 7*8(%[src]),%[res]\n\t"
> > -                                   "adcq %[zero],%[res]"
> > -                                   : [res] "=r" (result)
> > -                                   : [src] "r" (buff), [zero] "r" (zero),
> > -                                   "[res]" (result));
> > -                               buff += 64;
> > -                               count64--;
> > -                       }
> > -
> > -                       /* last up to 7 8byte blocks */
> > -                       count %= 8;
> > -                       while (count) {
> > -                               asm("addq %1,%0\n\t"
> > -                                   "adcq %2,%0\n"
> > -                                           : "=r" (result)
> > -                                   : "m" (*(unsigned long *)buff),
> > -                                   "r" (zero),  "0" (result));
> > -                               --count;
> > -                               buff += 8;
> > -                       }
> > -                       result = add32_with_carry(result>>32,
> > -                                                 result&0xffffffff);
> > -
> > -                       if (len & 4) {
> > -                               result += *(unsigned int *) buff;
> > -                               buff += 4;
> > -                       }
> > -               }
> > +       if (unlikely(len >= 64)) {
> > +               unsigned long temp64 = result;
> > +               do {
> > +                       asm("   addq 0*8(%[src]),%[res]\n"
> > +                           "   adcq 1*8(%[src]),%[res]\n"
> > +                           "   adcq 2*8(%[src]),%[res]\n"
> > +                           "   adcq 3*8(%[src]),%[res]\n"
> > +                           "   adcq 4*8(%[src]),%[res]\n"
> > +                           "   adcq 5*8(%[src]),%[res]\n"
> > +                           "   adcq 6*8(%[src]),%[res]\n"
> > +                           "   adcq 7*8(%[src]),%[res]\n"
> > +                           "   adcq $0,%[res]"
> > +                           : [res] "=r" (temp64)
> > +                           : [src] "r" (buff), "[res]" (temp64)
> > +                           : "memory");
> > +                       buff += 64;
> > +                       len -= 64;
> > +               } while (len >= 64);
>
> I wonder if we couldn't optimize this loop by tracking buff versus
> another pointer instead of updating and checking the length.
>
> Basically just update it so that we have something like:
>     trailer = len & 63;
>     end = buf + len - trailer;
>     do {
>         ...
>         buff += 64;
>     } while (buff < end);

Yeah, but I wonder why the compiler could not do that for us ?

>
> > +               result = add32_with_carry(temp64 >> 32, temp64 & 0xffffffff);
>
> Squashing this to 32b here seems a bit premature. As I mentioned
> before an add w/ carry is going to be one per cycle as it serializes
> on the carry flag, so if you are doing it 4 bytes at a time you can
> probably get the same or better by using adcq for the 32, 16, and 8B
> lengths.

That would work for a specialized fast_csum_40bytes(), but not if we have to
add checks about length == 40 ?

The squash happens once, _if_ we had to sum more than 64 bytes,
and needs to be done because of the following operations.

>
> > +       }
> > +       if (len & 32) {
> > +               asm("   addl 0*4(%[src]),%[res]\n"
> > +                   "   adcl 1*4(%[src]),%[res]\n"
> > +                   "   adcl 2*4(%[src]),%[res]\n"
> > +                   "   adcl 3*4(%[src]),%[res]\n"
> > +                   "   adcl 4*4(%[src]),%[res]\n"
> > +                   "   adcl 5*4(%[src]),%[res]\n"
> > +                   "   adcl 6*4(%[src]),%[res]\n"
> > +                   "   adcl 7*4(%[src]),%[res]\n"
> > +                   "   adcl $0,%[res]"
> > +                       : [res] "=r" (result)
> > +                       : [src] "r" (buff), "[res]" (result)
> > +                       : "memory");
> > +               buff += 32;
> > +       }
> > +       if (len & 16) {
> > +               asm("   addl 0*4(%[src]),%[res]\n"
> > +                   "   adcl 1*4(%[src]),%[res]\n"
> > +                   "   adcl 2*4(%[src]),%[res]\n"
> > +                   "   adcl 3*4(%[src]),%[res]\n"
> > +                   "   adcl $0,%[res]"
> > +                       : [res] "=r" (result)
> > +                       : [src] "r" (buff), "[res]" (result)
> > +                       : "memory");
> > +               buff += 16;
> > +       }
> > +       if (len & 8) {
> > +               asm("   addl 0*4(%[src]),%[res]\n"
> > +                   "   adcl 1*4(%[src]),%[res]\n"
> > +                   "   adcl $0,%[res]"
> > +                       : [res] "=r" (result)
> > +                       : [src] "r" (buff), "[res]" (result)
> > +                       : "memory");
> > +               buff += 8;
> > +       }
>
> Alternatively we could look at accumulating the 32b values into a 64b
> register to get rid of the need to monitor the carry flag and that may
> improve the speed of things on architectures such as Haswell or newer.

That requires an extra add32_with_carry(), which unfortunately made
the thing slower for me.

I even hardcoded an inline fast_csum_40bytes() and got best results
with the 10+1 addl,
instead of
 (5 + 1) acql +  mov (needing one extra  register) + shift + addl + adcl

So that was not because of the clang bug I mentioned earlier.

>
> > +       if (len & 4) {
> > +               asm("   addl 0*4(%[src]),%[res]\n"
> > +                   "   adcl $0,%[res]\n"
> > +                       : [res] "=r" (result)
> > +                       : [src] "r" (buff), "[res]" (result)
> > +                       : "memory");
> > +               buff += 4;
> > +       }
> > +       if (len & 3U) {
> > +               result = from32to16(result);
> >                 if (len & 2) {
> >                         result += *(unsigned short *) buff;
> >                         buff += 2;
> >                 }
> > +               if (len & 1)
> > +                       result += *buff;
> >         }
>
> For values 7 through 1 I wonder if you wouldn't be better served by
> just doing a single QWORD read and a pair of shifts. Something along
> the lines of:
>     if (len) {
>         shift = (8 - len) * 8;
>         temp64 = (*(unsigned long)buff << shift) >> shift;
>         result += temp64;
>         result += result < temp64;
>     }

Again, KASAN will not be happy.

>
> > -       if (len & 1)
> > -               result += *buff;
> > -       result = add32_with_carry(result>>32, result & 0xffffffff);
> >         if (unlikely(odd)) {
> >                 result = from32to16(result);
> >                 result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> > @@ -133,8 +135,7 @@ static unsigned do_csum(const unsigned char *buff, unsigned len)
> >   */
> >  __wsum csum_partial(const void *buff, int len, __wsum sum)
> >  {
> > -       return (__force __wsum)add32_with_carry(do_csum(buff, len),
> > -                                               (__force u32)sum);
> > +       return (__force __wsum)do_csum(buff, len, (__force u32)sum);
> >  }
> >  EXPORT_SYMBOL(csum_partial);
> >
> > --
> > 2.34.0.rc1.387.gb447b232ab-goog
> >

Peter Zijlstra Nov. 12, 2021, 9:13 a.m. UTC | #3

On Thu, Nov 11, 2021 at 02:30:50PM -0800, Eric Dumazet wrote:
> > For values 7 through 1 I wonder if you wouldn't be better served by
> > just doing a single QWORD read and a pair of shifts. Something along
> > the lines of:
> >     if (len) {
> >         shift = (8 - len) * 8;
> >         temp64 = (*(unsigned long)buff << shift) >> shift;
> >         result += temp64;
> >         result += result < temp64;
> >     }
> 
> Again, KASAN will not be happy.

If you do it in asm, kasan will not know, so who cares :-) as long as
the load is aligned, loading beyond @len shouldn't be a problem,
otherwise there's load_unaligned_zeropad().

Eric Dumazet Nov. 12, 2021, 2:21 p.m. UTC | #4

On Fri, Nov 12, 2021 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Nov 11, 2021 at 02:30:50PM -0800, Eric Dumazet wrote:
> > > For values 7 through 1 I wonder if you wouldn't be better served by
> > > just doing a single QWORD read and a pair of shifts. Something along
> > > the lines of:
> > >     if (len) {
> > >         shift = (8 - len) * 8;
> > >         temp64 = (*(unsigned long)buff << shift) >> shift;
> > >         result += temp64;
> > >         result += result < temp64;
> > >     }
> >
> > Again, KASAN will not be happy.
>
> If you do it in asm, kasan will not know, so who cares :-) as long as
> the load is aligned, loading beyond @len shouldn't be a problem,
> otherwise there's load_unaligned_zeropad().

OK, but then in this case we have to align buff on qword boundary,
or risk crossing page boundary.

So this stuff has to be done at the beginning, and at the end.

And with IP_IP_ALIGN==0, this will unfortunately trigger for the 40-byte
IPV6 header.

IPv6 header :  <2 bytes before qword boundary><4 * 8 bytes> < 6 bytes at trail>

I will try, but I have some doubts it can save one or two cycles...

Peter Zijlstra Nov. 12, 2021, 3:25 p.m. UTC | #5

On Fri, Nov 12, 2021 at 06:21:38AM -0800, Eric Dumazet wrote:
> On Fri, Nov 12, 2021 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Nov 11, 2021 at 02:30:50PM -0800, Eric Dumazet wrote:
> > > > For values 7 through 1 I wonder if you wouldn't be better served by
> > > > just doing a single QWORD read and a pair of shifts. Something along
> > > > the lines of:
> > > >     if (len) {
> > > >         shift = (8 - len) * 8;
> > > >         temp64 = (*(unsigned long)buff << shift) >> shift;
> > > >         result += temp64;
> > > >         result += result < temp64;
> > > >     }
> > >
> > > Again, KASAN will not be happy.
> >
> > If you do it in asm, kasan will not know, so who cares :-) as long as
> > the load is aligned, loading beyond @len shouldn't be a problem,
> > otherwise there's load_unaligned_zeropad().
> 
> OK, but then in this case we have to align buff on qword boundary,
> or risk crossing page boundary.

Read the above, use load_unaligned_zeropad(), it's made for exactly that
case.

Slightly related, see:

  https://lkml.kernel.org/r/20211110101326.141775772@infradead.org

Eric Dumazet Nov. 12, 2021, 3:37 p.m. UTC | #6

On Fri, Nov 12, 2021 at 7:25 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 12, 2021 at 06:21:38AM -0800, Eric Dumazet wrote:
> > On Fri, Nov 12, 2021 at 1:13 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Nov 11, 2021 at 02:30:50PM -0800, Eric Dumazet wrote:
> > > > > For values 7 through 1 I wonder if you wouldn't be better served by
> > > > > just doing a single QWORD read and a pair of shifts. Something along
> > > > > the lines of:
> > > > >     if (len) {
> > > > >         shift = (8 - len) * 8;
> > > > >         temp64 = (*(unsigned long)buff << shift) >> shift;
> > > > >         result += temp64;
> > > > >         result += result < temp64;
> > > > >     }
> > > >
> > > > Again, KASAN will not be happy.
> > >
> > > If you do it in asm, kasan will not know, so who cares :-) as long as
> > > the load is aligned, loading beyond @len shouldn't be a problem,
> > > otherwise there's load_unaligned_zeropad().
> >
> > OK, but then in this case we have to align buff on qword boundary,
> > or risk crossing page boundary.
>
> Read the above, use load_unaligned_zeropad(), it's made for exactly that
> case.
>
> Slightly related, see:
>
>   https://lkml.kernel.org/r/20211110101326.141775772@infradead.org

Ah right you are, thanks a lot !

David Laight Nov. 14, 2021, 2:21 p.m. UTC | #7

From: Alexander Duyck
> Sent: 11 November 2021 21:56
...
> It might be worthwhile to beef up the odd check to account for
> anything 7 bytes or less. To address it you could do something along
> the lines of:
>     unaligned = 7 & (unsigned long) buff;
>     if (unaligned) {
>         shift = unaligned * 8;
>         temp64 = (*(unsigned long)buff >> shift) << shift;
>         buff += 8 - unaligned;
>         if (len < 8 - unaligned) {
>             shift = (8 - len - unaligned) * 8;
>             temp64 <<= shift;
>             temp64 >>= shift;
>             len = 0;
>         } else {
>             len -= 8 - unaligned;
>         }
>         result += temp64;
>         result += result < temp64;
>    }

I tried doing that.
Basically it is likely to take longer that just doing the memory reads.
The register dependency chain is just too long.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

David Laight Nov. 14, 2021, 2:44 p.m. UTC | #8

From: Eric Dumazet
> Sent: 11 November 2021 22:31
..
> That requires an extra add32_with_carry(), which unfortunately made
> the thing slower for me.
> 
> I even hardcoded an inline fast_csum_40bytes() and got best results
> with the 10+1 addl,
> instead of
>  (5 + 1) acql +  mov (needing one extra  register) + shift + addl + adcl

Did you try something like:
	sum = buf[0];
	val = buf[1]:
	asm(
		add64 sum, val
		adc64 sum, buf[2]
		adc64 sum, buf[3]
		adc64 sum, buf[4]
		adc64 sum, 0
	}
	sum_hi = sum >> 32;
	asm(
		add32 sum, sum_hi
		adc32 sum, 0
	)
Splitting it like that should allow thew compiler to insert
additional instructions between the two 'adc' blocks
making it much more likely that the cpu will schedule them
in parallel with other instructions.

The extra 5 adc32 have to add 5 clocks (register dependency chain).
The 'mov' ought to be free (register rename) and the extra shift
and adds one clock each - so 3 (maybe 4) clocks.
So the 64bit version really ought to be faster even a s single
asm block.

Trying to second-guess the x86 cpu is largely impossible :-)

Oh, and then try the benchmarks of one of the 64bit Atom cpus
used in embedded systems....
We've some 4core+hyperthreading ones that aren't exactly slow.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Eric Dumazet Nov. 14, 2021, 3:03 p.m. UTC | #9

On Sun, Nov 14, 2021 at 6:44 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 11 November 2021 22:31
> ..
> > That requires an extra add32_with_carry(), which unfortunately made
> > the thing slower for me.
> >
> > I even hardcoded an inline fast_csum_40bytes() and got best results
> > with the 10+1 addl,
> > instead of
> >  (5 + 1) acql +  mov (needing one extra  register) + shift + addl + adcl
>
> Did you try something like:
>         sum = buf[0];
>         val = buf[1]:
>         asm(
>                 add64 sum, val
>                 adc64 sum, buf[2]
>                 adc64 sum, buf[3]
>                 adc64 sum, buf[4]
>                 adc64 sum, 0
>         }
>         sum_hi = sum >> 32;
>         asm(
>                 add32 sum, sum_hi
>                 adc32 sum, 0
>         )

This is what I tried. but the last part was using add32_with_carry(),
and clang was adding stupid mov to temp variable on the stack,
killing the perf.

This issue was solved with

diff --git a/arch/x86/include/asm/checksum_64.h
b/arch/x86/include/asm/checksum_64.h
index 9af3aed54c6b945e1216719db6889d38ef47cce7..56981943d49cbaa934f7dbac9afb1f575a2437b9
100644
--- a/arch/x86/include/asm/checksum_64.h
+++ b/arch/x86/include/asm/checksum_64.h
@@ -174,7 +174,7 @@ static inline unsigned add32_with_carry(unsigned
a, unsigned b)
        asm("addl %2,%0\n\t"
            "adcl $0,%0"
            : "=r" (a)
-           : "0" (a), "rm" (b));
+           : "0" (a), "r" (b));
        return a;
 }



> Splitting it like that should allow thew compiler to insert
> additional instructions between the two 'adc' blocks
> making it much more likely that the cpu will schedule them
> in parallel with other instructions.
>
> The extra 5 adc32 have to add 5 clocks (register dependency chain).
> The 'mov' ought to be free (register rename) and the extra shift
> and adds one clock each - so 3 (maybe 4) clocks.
> So the 64bit version really ought to be faster even a s single
> asm block.
>
> Trying to second-guess the x86 cpu is largely impossible :-)
>
> Oh, and then try the benchmarks of one of the 64bit Atom cpus
> used in embedded systems....
> We've some 4core+hyperthreading ones that aren't exactly slow.
>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

David Laight Nov. 14, 2021, 7:09 p.m. UTC | #10

From: Eric Dumazet
> Sent: 14 November 2021 15:04
> 
> On Sun, Nov 14, 2021 at 6:44 AM David Laight <David.Laight@aculab.com> wrote:
> >
> > From: Eric Dumazet
> > > Sent: 11 November 2021 22:31
> > ..
> > > That requires an extra add32_with_carry(), which unfortunately made
> > > the thing slower for me.
> > >
> > > I even hardcoded an inline fast_csum_40bytes() and got best results
> > > with the 10+1 addl,
> > > instead of
> > >  (5 + 1) acql +  mov (needing one extra  register) + shift + addl + adcl
> >
> > Did you try something like:
> >         sum = buf[0];
> >         val = buf[1]:
> >         asm(
> >                 add64 sum, val
> >                 adc64 sum, buf[2]
> >                 adc64 sum, buf[3]
> >                 adc64 sum, buf[4]
> >                 adc64 sum, 0
> >         }
> >         sum_hi = sum >> 32;
> >         asm(
> >                 add32 sum, sum_hi
> >                 adc32 sum, 0
> >         )
> 
> This is what I tried. but the last part was using add32_with_carry(),
> and clang was adding stupid mov to temp variable on the stack,
> killing the perf.

Persuading the compile the generate the required assembler is an art!

I also ended up using __builtin_bswap32(sum) when the alignment
was 'odd' - the shift expression didn't always get converted
to a rotate. Byteswap32 DTRT.

I also noticed that any initial checksum was being added in at the end.
The 64bit code can almost always handle a 32 bit (or maybe 56bit!)
input value and add it in 'for free' into the code that does the
initial alignment.

I don't remember testing misaligned buffers.
But I think it doesn't matter (on cpu anyone cares about!).
Even Sandy bridge can do two memory reads in one clock.
So should be able to do a single misaligned read every clock.
Which almost certainly means that aligning the addresses is pointless.
(Given you're not trying to do the adcx/adox loop.)
(Page spanning shouldn't matter.)

For buffers that aren't a multiple of 8 bytes it might be best to
read the last 8 bytes first and shift left to discard the ones that
would get added in twice.
This value can be added to the 32bit 'input' checksum.
Something like:
	sum_in += buf[length - 8] << (64 - (length & 7) * 8));
Annoyingly a special case is needed for buffers shorter than 8 bytes
to avoid falling off the start of a page.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Eric Dumazet Nov. 14, 2021, 7:23 p.m. UTC | #11

On Sun, Nov 14, 2021 at 11:10 AM David Laight <David.Laight@aculab.com> wrote:
>
> From: Eric Dumazet
> > Sent: 14 November 2021 15:04
> >
> > On Sun, Nov 14, 2021 at 6:44 AM David Laight <David.Laight@aculab.com> wrote:
> > >
> > > From: Eric Dumazet
> > > > Sent: 11 November 2021 22:31
> > > ..
> > > > That requires an extra add32_with_carry(), which unfortunately made
> > > > the thing slower for me.
> > > >
> > > > I even hardcoded an inline fast_csum_40bytes() and got best results
> > > > with the 10+1 addl,
> > > > instead of
> > > >  (5 + 1) acql +  mov (needing one extra  register) + shift + addl + adcl
> > >
> > > Did you try something like:
> > >         sum = buf[0];
> > >         val = buf[1]:
> > >         asm(
> > >                 add64 sum, val
> > >                 adc64 sum, buf[2]
> > >                 adc64 sum, buf[3]
> > >                 adc64 sum, buf[4]
> > >                 adc64 sum, 0
> > >         }
> > >         sum_hi = sum >> 32;
> > >         asm(
> > >                 add32 sum, sum_hi
> > >                 adc32 sum, 0
> > >         )
> >
> > This is what I tried. but the last part was using add32_with_carry(),
> > and clang was adding stupid mov to temp variable on the stack,
> > killing the perf.
>
> Persuading the compile the generate the required assembler is an art!
>
> I also ended up using __builtin_bswap32(sum) when the alignment
> was 'odd' - the shift expression didn't always get converted
> to a rotate. Byteswap32 DTRT.
>
> I also noticed that any initial checksum was being added in at the end.
> The 64bit code can almost always handle a 32 bit (or maybe 56bit!)
> input value and add it in 'for free' into the code that does the
> initial alignment.

This has been fixed in V2 : initial csum is used instead of 0

>
> I don't remember testing misaligned buffers.
> But I think it doesn't matter (on cpu anyone cares about!).
> Even Sandy bridge can do two memory reads in one clock.
> So should be able to do a single misaligned read every clock.
> Which almost certainly means that aligning the addresses is pointless.
> (Given you're not trying to do the adcx/adox loop.)
> (Page spanning shouldn't matter.)
>
> For buffers that aren't a multiple of 8 bytes it might be best to
> read the last 8 bytes first and shift left to discard the ones that
> would get added in twice.
> This value can be added to the 32bit 'input' checksum.
> Something like:
>         sum_in += buf[length - 8] << (64 - (length & 7) * 8));
> Annoyingly a special case is needed for buffers shorter than 8 bytes
> to avoid falling off the start of a page.

Yep, Alexander/Peter proposed this already, and it is implemented in V2

>
>         David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

[v1] x86/csum: rewrite csum_partial()

Checks

Commit Message

Comments

Patch