[v3] arm64: lib: accelerate do_csum with NEON instruction

Message ID	1546739729-17234-1-git-send-email-huanglingyan2@huawei.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Lingyan Huang <huanglingyan2@huawei.com> To: <linux-arm-kernel@lists.infradead.org> Subject: [PATCH v3] arm64: lib: accelerate do_csum with NEON instruction Date: Sun, 6 Jan 2019 09:55:29 +0800 Message-ID: <1546739729-17234-1-git-send-email-huanglingyan2@huawei.com> MIME-Version: 1.0 Precedence: list Cc: Lingyan Huang <huanglingyan2@huawei.com>, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Will Deacon <will.deacon@arm.com>, Catalin Marinas <catalin.marinas@arm.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	[v3] arm64: lib: accelerate do_csum with NEON instruction \| expand [v3] arm64: lib: accelerate do_csum with NEON instruction

huanglingyan (A) Jan. 6, 2019, 1:55 a.m. UTC

Function do_csum() in lib/checksum.c is used to compute checksum,
which is turned out to be slowly and costs a lot of resources.
Let's use neon instructions to accelerate the checksum computation
for arm64.

------
V2 ==> V3:
    only modify the arm64 codes instead of modifying headers
    under asm-generic and code in lib/checksum.c.
------
------
V1 ==> V2:
    Change NEON assembly code to NEON intrinsic code which is built
    on top of arm_neon.h to avoid dropping into assembly.
------

Here is the comparison results of function ip_compute_csum() between
general do_csum() and neon instruction do_csum(). The test platform 
is HUAWEI 1620 server with TAISHAN cores.

        len(1000cycle)      general(ns)        do_csum_neon(ns)
          64B:                58060                 59460
         128B:                82930                 83930
         256B:               132480                 73570
         512B:               230100                 86230
        1024B:               426600                 98200


Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
---
 arch/arm64/include/asm/checksum.h |   5 ++
 arch/arm64/lib/Makefile           |   8 +--
 arch/arm64/lib/checksum.c         |  26 ++++++++
 arch/arm64/lib/do_csum.c          | 136 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 171 insertions(+), 4 deletions(-)
 create mode 100644 arch/arm64/lib/checksum.c
 create mode 100644 arch/arm64/lib/do_csum.c

Ard Biesheuvel Jan. 6, 2019, 8:26 a.m. UTC | #1

On Sun, 6 Jan 2019 at 02:56, Lingyan Huang <huanglingyan2@huawei.com> wrote:
>
> Function do_csum() in lib/checksum.c is used to compute checksum,
> which is turned out to be slowly and costs a lot of resources.
> Let's use neon instructions to accelerate the checksum computation
> for arm64.
>
> ------
> V2 ==> V3:
>     only modify the arm64 codes instead of modifying headers
>     under asm-generic and code in lib/checksum.c.
> ------
> ------
> V1 ==> V2:
>     Change NEON assembly code to NEON intrinsic code which is built
>     on top of arm_neon.h to avoid dropping into assembly.
> ------
>

Please put the changelog between the --- below and the diffstat

> Here is the comparison results of function ip_compute_csum() between
> general do_csum() and neon instruction do_csum(). The test platform
> is HUAWEI 1620 server with TAISHAN cores.
>
>         len(1000cycle)      general(ns)        do_csum_neon(ns)
>           64B:                58060                 59460
>          128B:                82930                 83930
>          256B:               132480                 73570
>          512B:               230100                 86230
>         1024B:               426600                 98200
>

Again, very nice performance. How did you test for correctness?

>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
> ---
>  arch/arm64/include/asm/checksum.h |   5 ++
>  arch/arm64/lib/Makefile           |   8 +--
>  arch/arm64/lib/checksum.c         |  26 ++++++++
>  arch/arm64/lib/do_csum.c          | 136 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 171 insertions(+), 4 deletions(-)
>  create mode 100644 arch/arm64/lib/checksum.c
>  create mode 100644 arch/arm64/lib/do_csum.c
>
> diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h
> index 0b6f5a7..7acd713 100644
> --- a/arch/arm64/include/asm/checksum.h
> +++ b/arch/arm64/include/asm/checksum.h
> @@ -26,6 +26,10 @@ static inline __sum16 csum_fold(__wsum csum)
>  }
>  #define csum_fold csum_fold
>
> +#define do_csum do_csum
> +unsigned int do_csum(const unsigned char *buff, int len);
> +extern unsigned int do_csum_arm(const unsigned char *buff, int len);
> +
>  static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>  {
>         __uint128_t tmp;
> @@ -46,6 +50,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>  }
>  #define ip_fast_csum ip_fast_csum
>
> +

Drop this whitespace-only change

>  #include <asm-generic/checksum.h>
>
>  #endif /* __ASM_CHECKSUM_H */
> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
> index 5540a16..c0b5b8c 100644
> --- a/arch/arm64/lib/Makefile
> +++ b/arch/arm64/lib/Makefile
> @@ -3,12 +3,12 @@ lib-y         := clear_user.o delay.o copy_from_user.o                \
>                    copy_to_user.o copy_in_user.o copy_page.o            \
>                    clear_page.o memchr.o memcpy.o memmove.o memset.o    \
>                    memcmp.o strcmp.o strncmp.o strlen.o strnlen.o       \
> -                  strchr.o strrchr.o tishift.o
> +                  strchr.o strrchr.o tishift.o checksum.o
>
>  ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
> -obj-$(CONFIG_XOR_BLOCKS)       += xor-neon.o
> -CFLAGS_REMOVE_xor-neon.o       += -mgeneral-regs-only
> -CFLAGS_xor-neon.o              += -ffreestanding
> +obj-y    += do_csum.o

Please indent aligned with the others

> +CFLAGS_REMOVE_do_csum.o        += -mgeneral-regs-only
> +CFLAGS_do_csum.o               += -ffreestanding
>  endif
>
>  # Tell the compiler to treat all general purpose registers (with the
> diff --git a/arch/arm64/lib/checksum.c b/arch/arm64/lib/checksum.c
> new file mode 100644
> index 0000000..15a31bb
> --- /dev/null
> +++ b/arch/arm64/lib/checksum.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + *
> + * Authors: Lingyan Huang <huanglingyan2@huawei.com>
> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved.
> + *
> + * Generic C or neon implementation of do_csum operations.
> + * Choose faster neon instructions when NEON is supported.
> + *
> + */
> +
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <asm/checksum.h>
> +
> +#define CSUM_NEON_THRESHOLD 128
> +
> +unsigned int do_csum(const unsigned char *buff, int len)
> +{
> +#ifdef CONFIG_KERNEL_MODE_NEON
> +       if (len >= CSUM_NEON_THRESHOLD)

Please change this into

if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
    len >= CSUM_NEON_THRESHOLD &&
    may_use_simd()) {
        kernel_neon_begin();
        res = do_csum_neon(buff, len);
        kernel_neon_end();
    }

and drop the intermediate do_csum_arm()

> +               return do_csum_arm(buff, len);
> +#endif  /* CONFIG_KERNEL_MODE_NEON */

No else? What happens if len < CSUM_NEON_THRESHOLD ?

> +#undef do_csum

Can we drop this?

> +       return 0;
> +}
> diff --git a/arch/arm64/lib/do_csum.c b/arch/arm64/lib/do_csum.c
> new file mode 100644
> index 0000000..893583f
> --- /dev/null
> +++ b/arch/arm64/lib/do_csum.c
> @@ -0,0 +1,136 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Authors: Lingyan Huang <huanglingyan2@huawei.com>
> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved.
> + *
> + * Generic C or neon implementation of do_csum operations.
> + * Choose faster neon instructions when NEON is supported.
> + *
> + */
> +
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <asm/checksum.h>
> +#include <asm/byteorder.h>
> +#include <asm/neon-intrinsics.h>
> +
> +#define CSUM_NEON_THRESHOLD 128

Drop this - it is not used in this file

> +#ifdef CONFIG_KERNEL_MODE_NEON

This file is only built if KERNEL_MODE_NEON=y so the #ifdef can be dropped

> +static inline u32 from64to16(u64 x)
> +{
> +       /* add up 32-bit and 32-bit for 32+c bit */
> +       x = (x & 0xffffffff) + (x >> 32);
> +       /* add up carry.. */
> +       x = (x & 0xffffffff) + (x >> 32);
> +       /* add up 16-bit and 16-bit for 16+c bit */
> +       x = ((u32)x & 0xffff) + ((u32)x >> 16);
> +       /* add up carry.. */
> +       x = ((u32)x & 0xffff) + ((u32)x >> 16);
> +       return x;
> +}
> +
> +unsigned int do_csum_neon(const unsigned char *buff, int len)
> +{
> +       unsigned int odd, count;
> +       uint64_t result = 0;
> +       unsigned int count64;
> +       uint32x4_t vzero = (uint32x4_t){0, 0, 0, 0};
> +
> +       register uint32x4_t v0, v1, v2, v3;
> +
> +       if (unlikely(len <= 0))
> +               return result;
> +
> +       odd = 1 & (unsigned long)buff;
> +       if (unlikely(odd)) {
> +               result = *buff;
> +               len--;
> +               buff++;
> +       }
> +
> +       count = len >> 1;
> +       if (count) {
> +               if (2 & (unsigned long)buff) {
> +                       result += *(unsigned short *)buff;
> +                       count--;
> +                       len -= 2;
> +                       buff += 2;
> +               }
> +               count >>= 1;            /* nr of 32-bit words.. */
> +               if (count) {
> +                       if (4 & (unsigned long)buff) {
> +                               result += *(unsigned int *)buff;
> +                               count--;
> +                               len -= 4;
> +                               buff += 4;
> +                       }
> +                       count >>= 1;    /* nr of 64-bit words.. */
> +
> +                       v0 = vzero;
> +                       v1 = vzero;
> +                       v2 = vzero;
> +                       v3 = vzero;
> +
> +                       count64 = count >> 3;  /* compute 64 Byte circle */
> +                       while (count64) {
> +                               v0 = vpadalq_u16(v0,
> +                                       vld1q_u16((uint16_t *)buff + 0));
> +                               v1 = vpadalq_u16(v1,
> +                                       vld1q_u16((uint16_t *)buff + 8));
> +                               v2 = vpadalq_u16(v2,
> +                                       vld1q_u16((uint16_t *)buff + 16));
> +                               v3 = vpadalq_u16(v3,
> +                                       vld1q_u16((uint16_t *)buff + 24));
> +                               buff += 64;
> +                               count64--;
> +                       }
> +                       v0 = vaddq_u32(v0, v1);
> +                       v2 = vaddq_u32(v2, v3);
> +                       v0 = vaddq_u32(v0, v2);
> +
> +                       count %= 8;
> +                       while (count >= 2) { /* compute 16 byte circle */
> +                               v0 = vpadalq_u16(v0,
> +                                       vld1q_u16((uint16_t *)buff + 0));
> +                               buff += 16;
> +                               count -= 2;
> +                       }
> +
> +                       result += vgetq_lane_u32(v0, 0);
> +                       result += vgetq_lane_u32(v0, 1);
> +                       result += vgetq_lane_u32(v0, 2);
> +                       result += vgetq_lane_u32(v0, 3);
> +                       if (count & 1) {
> +                               result += *(unsigned long long *)buff;
> +                               buff += 8;
> +                       }
> +                       if (len & 4) {
> +                               result += *(unsigned int *)buff;
> +                               buff += 4;
> +                       }
> +               }
> +               if (len & 2) {
> +                       result += *(unsigned short *)buff;
> +                       buff += 2;
> +               }
> +       }
> +       if (len & 1)
> +               result += *buff;
> +       result = from64to16(result);
> +       if (odd)
> +               result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> +       return result;
> +}
> +#endif
> +
> +
> +unsigned int do_csum_arm(const unsigned char *buff, int len)
> +{
> +       unsigned int res;
> +
> +       kernel_neon_begin();
> +       res = do_csum_neon(buff, len);
> +       kernel_neon_end();
> +       return res;
> +}

As I said above, please drop this intermediate function and fold the
logic into do_csum()

Will Deacon Jan. 8, 2019, 1:54 p.m. UTC | #2

[re-adding Ard and LAKML -- not sure why the headers are so munged]

On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
> 在 2019/1/6 16:26, Ard Biesheuvel 写道:
>     Please change this into
> 
>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
>         len >= CSUM_NEON_THRESHOLD &&
>         may_use_simd()) {
>             kernel_neon_begin();
>             res = do_csum_neon(buff, len);
>             kernel_neon_end();
>         }
> 
>     and drop the intermediate do_csum_arm()
> 
> 
>         +               return do_csum_arm(buff, len);
>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
> 
>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
> 
> 
>         +#undef do_csum
> 
>     Can we drop this?
> 
> Using NEON instructions will bring some costs. The spending maybe introduced
> when reservering/restoring
> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
> is Only used when
> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
> checksum.c will be used in
> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
> to have the oppotunity to
> utilize the general codes.

I don't think that's how it works :/

Before we get deeper into the implementation, please could you justify the
need for a CPU-optimised checksum implementation at all? I thought this was
usually offloaded to the NIC?

Will

huanglingyan (A) Jan. 9, 2019, 2:03 a.m. UTC | #3

On 2019/1/8 21:54, Will Deacon wrote:
> [re-adding Ard and LAKML -- not sure why the headers are so munged]
>
> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
>>     Please change this into
>>
>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
>>         len >= CSUM_NEON_THRESHOLD &&
>>         may_use_simd()) {
>>             kernel_neon_begin();
>>             res = do_csum_neon(buff, len);
>>             kernel_neon_end();
>>         }
>>
>>     and drop the intermediate do_csum_arm()
>>
>>
>>         +               return do_csum_arm(buff, len);
>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
>>
>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
>>
>>
>>         +#undef do_csum
>>
>>     Can we drop this?
>>
>> Using NEON instructions will bring some costs. The spending maybe introduced
>> when reservering/restoring
>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
>> is Only used when
>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
>> checksum.c will be used in
>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
>> to have the oppotunity to
>> utilize the general codes.
> I don't think that's how it works :/
>
> Before we get deeper into the implementation, please could you justify the
> need for a CPU-optimised checksum implementation at all? I thought this was
> usually offloaded to the NIC?
>
> Will
>
> .
This problem is introduced when testing Intel x710 network card on my ARM server.
Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
machine and then receive.

The bandwidth in Intel 8180 is 9.5 Gbps while only 5.8 Gbps in ARM. It shows that
do_csum() costs 36% in ARM and only 6% in Intel with perf tools. That's why I decide
to modify do_csum() function in ARM.

As a newbee of linux kernel, I has little knowledge of the implementation of such
situation. Looking forward to get your help of improving this patch.

Lingyan Huang
.

**************************

**************************
>

Dave Martin Jan. 9, 2019, 2:58 p.m. UTC | #4

On Sun, Jan 06, 2019 at 09:55:29AM +0800, Lingyan Huang wrote:
> Function do_csum() in lib/checksum.c is used to compute checksum,
> which is turned out to be slowly and costs a lot of resources.
> Let's use neon instructions to accelerate the checksum computation
> for arm64.
> 
> ------
> V2 ==> V3:
>     only modify the arm64 codes instead of modifying headers
>     under asm-generic and code in lib/checksum.c.
> ------
> ------
> V1 ==> V2:
>     Change NEON assembly code to NEON intrinsic code which is built
>     on top of arm_neon.h to avoid dropping into assembly.
> ------
> 
> Here is the comparison results of function ip_compute_csum() between
> general do_csum() and neon instruction do_csum(). The test platform 
> is HUAWEI 1620 server with TAISHAN cores.
> 
>         len(1000cycle)      general(ns)        do_csum_neon(ns)
>           64B:                58060                 59460
>          128B:                82930                 83930
>          256B:               132480                 73570
>          512B:               230100                 86230
>         1024B:               426600                 98200

For testing purposes, you would need to cover all possible aligments of
buff and all values of (len % 16), as well as testing on big- and
little-endian.

Otherwise, a lot of code for handling edge cases won't be tested.


Also, it would be interesting to know which sizes are actually most
common at runtime, to avoid falling into the trap of optimising for
rare cases at the expense of the more common cases.

You could try adding some instrumentation to collect some statistics on
this.

> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
> ---
>  arch/arm64/include/asm/checksum.h |   5 ++
>  arch/arm64/lib/Makefile           |   8 +--
>  arch/arm64/lib/checksum.c         |  26 ++++++++
>  arch/arm64/lib/do_csum.c          | 136 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 171 insertions(+), 4 deletions(-)
>  create mode 100644 arch/arm64/lib/checksum.c
>  create mode 100644 arch/arm64/lib/do_csum.c
> 
> diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h
> index 0b6f5a7..7acd713 100644
> --- a/arch/arm64/include/asm/checksum.h
> +++ b/arch/arm64/include/asm/checksum.h
> @@ -26,6 +26,10 @@ static inline __sum16 csum_fold(__wsum csum)
>  }
>  #define csum_fold csum_fold
>  
> +#define do_csum do_csum
> +unsigned int do_csum(const unsigned char *buff, int len);
> +extern unsigned int do_csum_arm(const unsigned char *buff, int len);
> +
>  static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>  {
>  	__uint128_t tmp;
> @@ -46,6 +50,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>  }
>  #define ip_fast_csum ip_fast_csum
>  
> +
>  #include <asm-generic/checksum.h>
>  
>  #endif	/* __ASM_CHECKSUM_H */
> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
> index 5540a16..c0b5b8c 100644
> --- a/arch/arm64/lib/Makefile
> +++ b/arch/arm64/lib/Makefile
> @@ -3,12 +3,12 @@ lib-y		:= clear_user.o delay.o copy_from_user.o		\
>  		   copy_to_user.o copy_in_user.o copy_page.o		\
>  		   clear_page.o memchr.o memcpy.o memmove.o memset.o	\
>  		   memcmp.o strcmp.o strncmp.o strlen.o strnlen.o	\
> -		   strchr.o strrchr.o tishift.o
> +		   strchr.o strrchr.o tishift.o checksum.o
>  
>  ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
> -obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
> -CFLAGS_REMOVE_xor-neon.o	+= -mgeneral-regs-only
> -CFLAGS_xor-neon.o		+= -ffreestanding

Did you mean to delete these lines?

> +obj-y    += do_csum.o
> +CFLAGS_REMOVE_do_csum.o	+= -mgeneral-regs-only
> +CFLAGS_do_csum.o		+= -ffreestanding
>  endif
>  
>  # Tell the compiler to treat all general purpose registers (with the
> diff --git a/arch/arm64/lib/checksum.c b/arch/arm64/lib/checksum.c
> new file mode 100644
> index 0000000..15a31bb
> --- /dev/null
> +++ b/arch/arm64/lib/checksum.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + *
> + * Authors: Lingyan Huang <huanglingyan2@huawei.com>
> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved.
> + *
> + * Generic C or neon implementation of do_csum operations.
> + * Choose faster neon instructions when NEON is supported.
> + *
> + */
> +
> +#include <asm/neon.h>
> +#include <asm/simd.h>
> +#include <asm/checksum.h>
> +
> +#define CSUM_NEON_THRESHOLD 128
> +
> +unsigned int do_csum(const unsigned char *buff, int len)
> +{
> +#ifdef CONFIG_KERNEL_MODE_NEON
> +	if (len >= CSUM_NEON_THRESHOLD)
> +		return do_csum_arm(buff, len);
> +#endif  /* CONFIG_KERNEL_MODE_NEON */
> +#undef do_csum
> +	return 0;
> +}
> diff --git a/arch/arm64/lib/do_csum.c b/arch/arm64/lib/do_csum.c
> new file mode 100644
> index 0000000..893583f
> --- /dev/null
> +++ b/arch/arm64/lib/do_csum.c
> @@ -0,0 +1,136 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Authors: Lingyan Huang <huanglingyan2@huawei.com>
> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved.
> + *
> + * Generic C or neon implementation of do_csum operations.
> + * Choose faster neon instructions when NEON is supported.
> + *
> + */
> +
> +#include <asm/neon.h>
> +#include <asm/simd.h>

Unless you call may_use_simd() (see below) I can't see why you need to
include <asm/simd.h>.

> +#include <asm/checksum.h>
> +#include <asm/byteorder.h>
> +#include <asm/neon-intrinsics.h>
> +
> +#define CSUM_NEON_THRESHOLD 128
> +#ifdef CONFIG_KERNEL_MODE_NEON
> +static inline u32 from64to16(u64 x)
> +{
> +	/* add up 32-bit and 32-bit for 32+c bit */
> +	x = (x & 0xffffffff) + (x >> 32);
> +	/* add up carry.. */
> +	x = (x & 0xffffffff) + (x >> 32);
> +	/* add up 16-bit and 16-bit for 16+c bit */
> +	x = ((u32)x & 0xffff) + ((u32)x >> 16);
> +	/* add up carry.. */
> +	x = ((u32)x & 0xffff) + ((u32)x >> 16);
> +	return x;
> +}
> +
> +unsigned int do_csum_neon(const unsigned char *buff, int len)
> +{
> +	unsigned int odd, count;
> +	uint64_t result = 0;
> +	unsigned int count64;
> +	uint32x4_t vzero = (uint32x4_t){0, 0, 0, 0};
> +
> +	register uint32x4_t v0, v1, v2, v3;

Is "register" needed here?  Is there any impact on performance?

Usually it's best to leave register allocation decisions up to the
compiler.

> +
> +	if (unlikely(len <= 0))
> +		return result;
> +
> +	odd = 1 & (unsigned long)buff;
> +	if (unlikely(odd)) {
> +		result = *buff;

The generic code has a shift here for the little-endian case.  Why don't
we need that here?

> +		len--;
> +		buff++;
> +	}
> +
> +	count = len >> 1;
> +	if (count) {
> +		if (2 & (unsigned long)buff) {
> +			result += *(unsigned short *)buff;
> +			count--;
> +			len -= 2;
> +			buff += 2;
> +		}
> +		count >>= 1;            /* nr of 32-bit words.. */
> +		if (count) {
> +			if (4 & (unsigned long)buff) {
> +				result += *(unsigned int *)buff;
> +				count--;
> +				len -= 4;
> +				buff += 4;
> +			}
> +			count >>= 1;    /* nr of 64-bit words.. */
> +
> +			v0 = vzero;
> +			v1 = vzero;
> +			v2 = vzero;
> +			v3 = vzero;
> +
> +			count64 = count >> 3;  /* compute 64 Byte circle */
> +			while (count64) {
> +				v0 = vpadalq_u16(v0,
> +					vld1q_u16((uint16_t *)buff + 0));

Can this loop iterate more than 65536 times?  If it can, it looks like
we can overflow.

(I think the initial value of len would have to be > 0x400000 in order
for this to happen.)

> +				v1 = vpadalq_u16(v1,
> +					vld1q_u16((uint16_t *)buff + 8));
> +				v2 = vpadalq_u16(v2,
> +					vld1q_u16((uint16_t *)buff + 16));
> +				v3 = vpadalq_u16(v3,
> +					vld1q_u16((uint16_t *)buff + 24));
> +				buff += 64;
> +				count64--;
> +			}
> +			v0 = vaddq_u32(v0, v1);
> +			v2 = vaddq_u32(v2, v3);
> +			v0 = vaddq_u32(v0, v2);

Can't we defer the folding down until later?  We could just accumulate
the next 16 bytes' result into v0, and do the folding all in one go
later on.

> +
> +			count %= 8;
> +			while (count >= 2) { /* compute 16 byte circle */
> +				v0 = vpadalq_u16(v0,
> +					vld1q_u16((uint16_t *)buff + 0));
> +				buff += 16;
> +				count -= 2;
> +			}
> +
> +			result += vgetq_lane_u32(v0, 0);
> +			result += vgetq_lane_u32(v0, 1);
> +			result += vgetq_lane_u32(v0, 2);
> +			result += vgetq_lane_u32(v0, 3);
> +			if (count & 1) {
> +				result += *(unsigned long long *)buff;
> +				buff += 8;
> +			}
> +			if (len & 4) {
> +				result += *(unsigned int *)buff;
> +				buff += 4;
> +			}
> +		}
> +		if (len & 2) {
> +			result += *(unsigned short *)buff;
> +			buff += 2;
> +		}
> +	}
> +	if (len & 1)
> +		result += *buff;

What about the little-endian case?

> +	result = from64to16(result);
> +	if (odd)
> +		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> +	return result;
> +}
> +#endif
> +
> +
> +unsigned int do_csum_arm(const unsigned char *buff, int len)
> +{
> +	unsigned int res;
> +
> +	kernel_neon_begin();
> +	res = do_csum_neon(buff, len);

If len can be large, you should split into smaller blocks, with
kernel_neon_end()/_begin() between to provide a change to preempt if len
is large.  (Splitting into smaller blocks may also help avoid overflow.)

Ard may have may able to suggest how often this should be done.

Also, there is no guarantee that you can use NEON in softirq context,
because the kernel may already have been using NEON when the softirq
fired.  If this code may get called from softirq context, then you would
need something along the lines of

	if (may_use_simd()) {
		kernel_neon_begin();
		/* NEON accelerated code */
		kernel_neon_end();
	} else {
		/* Fallback C code */
	}

[...]

Cheers
---Dave

胡海 Jan. 10, 2019, 4:08 a.m. UTC | #5

hi, huanglingyan

Have you verified the correctness of do_csum?
I validated it on FT2000+ and do_csum got the wrong result.

You can test it by increasing the input length to more than 60000 and
setting all buffs to 0xff.
I think it's possible that vpadalq_u16 will overflow, and the code
doesn't take this into account.

huanglingyan (A) Jan. 10, 2019, 8:03 a.m. UTC | #6

On 2019/1/9 22:58, Dave Martin wrote:
> On Sun, Jan 06, 2019 at 09:55:29AM +0800, Lingyan Huang wrote:
>> Function do_csum() in lib/checksum.c is used to compute checksum,
>> which is turned out to be slowly and costs a lot of resources.
>> Let's use neon instructions to accelerate the checksum computation
>> for arm64.
>>
>> ------
>> V2 ==> V3:
>>     only modify the arm64 codes instead of modifying headers
>>     under asm-generic and code in lib/checksum.c.
>> ------
>> ------
>> V1 ==> V2:
>>     Change NEON assembly code to NEON intrinsic code which is built
>>     on top of arm_neon.h to avoid dropping into assembly.
>> ------
>>
>> Here is the comparison results of function ip_compute_csum() between
>> general do_csum() and neon instruction do_csum(). The test platform 
>> is HUAWEI 1620 server with TAISHAN cores.
>>
>>         len(1000cycle)      general(ns)        do_csum_neon(ns)
>>           64B:                58060                 59460
>>          128B:                82930                 83930
>>          256B:               132480                 73570
>>          512B:               230100                 86230
>>         1024B:               426600                 98200
> For testing purposes, you would need to cover all possible aligments of
> buff and all values of (len % 16), as well as testing on big- and
> little-endian.
>
> Otherwise, a lot of code for handling edge cases won't be tested.
>
>
> Also, it would be interesting to know which sizes are actually most
> common at runtime, to avoid falling into the trap of optimising for
> rare cases at the expense of the more common cases.
>
> You could try adding some instrumentation to collect some statistics on
> this.
Different alignements of buff and different lengths should be
showing here. The test-report will be more detailed as you said.

The main scene of do_csum is the Internet package checksum as I know.
The package length is usually 64 Byte - MTU(1500 Byte default).

My test platform is little-endian based. I will try my best to find a big-endian
test platform.
>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will.deacon@arm.com>
>> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
>> Signed-off-by: Lingyan Huang <huanglingyan2@huawei.com>
>> ---
>>  arch/arm64/include/asm/checksum.h |   5 ++
>>  arch/arm64/lib/Makefile           |   8 +--
>>  arch/arm64/lib/checksum.c         |  26 ++++++++
>>  arch/arm64/lib/do_csum.c          | 136 ++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 171 insertions(+), 4 deletions(-)
>>  create mode 100644 arch/arm64/lib/checksum.c
>>  create mode 100644 arch/arm64/lib/do_csum.c
>>
>> diff --git a/arch/arm64/include/asm/checksum.h b/arch/arm64/include/asm/checksum.h
>> index 0b6f5a7..7acd713 100644
>> --- a/arch/arm64/include/asm/checksum.h
>> +++ b/arch/arm64/include/asm/checksum.h
>> @@ -26,6 +26,10 @@ static inline __sum16 csum_fold(__wsum csum)
>>  }
>>  #define csum_fold csum_fold
>>  
>> +#define do_csum do_csum
>> +unsigned int do_csum(const unsigned char *buff, int len);
>> +extern unsigned int do_csum_arm(const unsigned char *buff, int len);
>> +
>>  static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>>  {
>>  	__uint128_t tmp;
>> @@ -46,6 +50,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned int ihl)
>>  }
>>  #define ip_fast_csum ip_fast_csum
>>  
>> +
>>  #include <asm-generic/checksum.h>
>>  
>>  #endif	/* __ASM_CHECKSUM_H */
>> diff --git a/arch/arm64/lib/Makefile b/arch/arm64/lib/Makefile
>> index 5540a16..c0b5b8c 100644
>> --- a/arch/arm64/lib/Makefile
>> +++ b/arch/arm64/lib/Makefile
>> @@ -3,12 +3,12 @@ lib-y		:= clear_user.o delay.o copy_from_user.o		\
>>  		   copy_to_user.o copy_in_user.o copy_page.o		\
>>  		   clear_page.o memchr.o memcpy.o memmove.o memset.o	\
>>  		   memcmp.o strcmp.o strncmp.o strlen.o strnlen.o	\
>> -		   strchr.o strrchr.o tishift.o
>> +		   strchr.o strrchr.o tishift.o checksum.o
>>  
>>  ifeq ($(CONFIG_KERNEL_MODE_NEON), y)
>> -obj-$(CONFIG_XOR_BLOCKS)	+= xor-neon.o
>> -CFLAGS_REMOVE_xor-neon.o	+= -mgeneral-regs-only
>> -CFLAGS_xor-neon.o		+= -ffreestanding
> Did you mean to delete these lines? 

It's my mistake.

>> +obj-y    += do_csum.o
>> +CFLAGS_REMOVE_do_csum.o	+= -mgeneral-regs-only
>> +CFLAGS_do_csum.o		+= -ffreestanding
>>  endif
>>  
>>  # Tell the compiler to treat all general purpose registers (with the
>> diff --git a/arch/arm64/lib/checksum.c b/arch/arm64/lib/checksum.c
>> new file mode 100644
>> index 0000000..15a31bb
>> --- /dev/null
>> +++ b/arch/arm64/lib/checksum.c
>> @@ -0,0 +1,26 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + *
>> + * Authors: Lingyan Huang <huanglingyan2@huawei.com>
>> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved.
>> + *
>> + * Generic C or neon implementation of do_csum operations.
>> + * Choose faster neon instructions when NEON is supported.
>> + *
>> + */
>> +
>> +#include <asm/neon.h>
>> +#include <asm/simd.h>
>> +#include <asm/checksum.h>
>> +
>> +#define CSUM_NEON_THRESHOLD 128
>> +
>> +unsigned int do_csum(const unsigned char *buff, int len)
>> +{
>> +#ifdef CONFIG_KERNEL_MODE_NEON
>> +	if (len >= CSUM_NEON_THRESHOLD)
>> +		return do_csum_arm(buff, len);
>> +#endif  /* CONFIG_KERNEL_MODE_NEON */
>> +#undef do_csum
>> +	return 0;
>> +}
>> diff --git a/arch/arm64/lib/do_csum.c b/arch/arm64/lib/do_csum.c
>> new file mode 100644
>> index 0000000..893583f
>> --- /dev/null
>> +++ b/arch/arm64/lib/do_csum.c
>> @@ -0,0 +1,136 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Authors: Lingyan Huang <huanglingyan2@huawei.com>
>> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved.
>> + *
>> + * Generic C or neon implementation of do_csum operations.
>> + * Choose faster neon instructions when NEON is supported.
>> + *
>> + */
>> +
>> +#include <asm/neon.h>
>> +#include <asm/simd.h>
> Unless you call may_use_simd() (see below) I can't see why you need to
> include <asm/simd.h>.

Yeah, may_use_simd() should be called.

>> +#include <asm/checksum.h>
>> +#include <asm/byteorder.h>
>> +#include <asm/neon-intrinsics.h>
>> +
>> +#define CSUM_NEON_THRESHOLD 128
>> +#ifdef CONFIG_KERNEL_MODE_NEON
>> +static inline u32 from64to16(u64 x)
>> +{
>> +	/* add up 32-bit and 32-bit for 32+c bit */
>> +	x = (x & 0xffffffff) + (x >> 32);
>> +	/* add up carry.. */
>> +	x = (x & 0xffffffff) + (x >> 32);
>> +	/* add up 16-bit and 16-bit for 16+c bit */
>> +	x = ((u32)x & 0xffff) + ((u32)x >> 16);
>> +	/* add up carry.. */
>> +	x = ((u32)x & 0xffff) + ((u32)x >> 16);
>> +	return x;
>> +}
>> +
>> +unsigned int do_csum_neon(const unsigned char *buff, int len)
>> +{
>> +	unsigned int odd, count;
>> +	uint64_t result = 0;
>> +	unsigned int count64;
>> +	uint32x4_t vzero = (uint32x4_t){0, 0, 0, 0};
>> +
>> +	register uint32x4_t v0, v1, v2, v3;
> Is "register" needed here?  Is there any impact on performance?
>
> Usually it's best to leave register allocation decisions up to the
> compiler.

OK

>
>> +
>> +	if (unlikely(len <= 0))
>> +		return result;
>> +
>> +	odd = 1 & (unsigned long)buff;
>> +	if (unlikely(odd)) {
>> +		result = *buff;
> The generic code has a shift here for the little-endian case.  Why don't
> we need that here?

I write this code in reference of arch x86 which not distinguish endianess.
Maybe little-endian should be different from big-endian.

>
>> +		len--;
>> +		buff++;
>> +	}
>> +
>> +	count = len >> 1;
>> +	if (count) {
>> +		if (2 & (unsigned long)buff) {
>> +			result += *(unsigned short *)buff;
>> +			count--;
>> +			len -= 2;
>> +			buff += 2;
>> +		}
>> +		count >>= 1;            /* nr of 32-bit words.. */
>> +		if (count) {
>> +			if (4 & (unsigned long)buff) {
>> +				result += *(unsigned int *)buff;
>> +				count--;
>> +				len -= 4;
>> +				buff += 4;
>> +			}
>> +			count >>= 1;    /* nr of 64-bit words.. */
>> +
>> +			v0 = vzero;
>> +			v1 = vzero;
>> +			v2 = vzero;
>> +			v3 = vzero;
>> +
>> +			count64 = count >> 3;  /* compute 64 Byte circle */
>> +			while (count64) {
>> +				v0 = vpadalq_u16(v0,
>> +					vld1q_u16((uint16_t *)buff + 0));
> Can this loop iterate more than 65536 times?  If it can, it looks like
> we can overflow.
>
> (I think the initial value of len would have to be > 0x400000 in order
> for this to happen.)

Yes, you are right. I didn't consider the situation of such long length. I wonder
if there are any application scenarios that have such long length. Besides,
arch x86 do_csum() in file csum-partial_64.c is not considered of
that situation either.

>> +				v1 = vpadalq_u16(v1,
>> +					vld1q_u16((uint16_t *)buff + 8));
>> +				v2 = vpadalq_u16(v2,
>> +					vld1q_u16((uint16_t *)buff + 16));
>> +				v3 = vpadalq_u16(v3,
>> +					vld1q_u16((uint16_t *)buff + 24));
>> +				buff += 64;
>> +				count64--;
>> +			}
>> +			v0 = vaddq_u32(v0, v1);
>> +			v2 = vaddq_u32(v2, v3);
>> +			v0 = vaddq_u32(v0, v2);
> Can't we defer the folding down until later?  We could just accumulate
> the next 16 bytes' result into v0, and do the folding all in one go
> later on.

Yes, we can. Since 16 bytes' computation doesn't need v1-v3. Folding down now or later seems
to have no diffenence.

>
>> +
>> +			count %= 8;
>> +			while (count >= 2) { /* compute 16 byte circle */
>> +				v0 = vpadalq_u16(v0,
>> +					vld1q_u16((uint16_t *)buff + 0));
>> +				buff += 16;
>> +				count -= 2;
>> +			}
>> +
>> +			result += vgetq_lane_u32(v0, 0);
>> +			result += vgetq_lane_u32(v0, 1);
>> +			result += vgetq_lane_u32(v0, 2);
>> +			result += vgetq_lane_u32(v0, 3);
>> +			if (count & 1) {
>> +				result += *(unsigned long long *)buff;
>> +				buff += 8;
>> +			}
>> +			if (len & 4) {
>> +				result += *(unsigned int *)buff;
>> +				buff += 4;
>> +			}
>> +		}
>> +		if (len & 2) {
>> +			result += *(unsigned short *)buff;
>> +			buff += 2;
>> +		}
>> +	}
>> +	if (len & 1)
>> +		result += *buff;
> What about the little-endian case?
>
>> +	result = from64to16(result);
>> +	if (odd)
>> +		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
>> +	return result;
>> +}
>> +#endif
>> +
>> +
>> +unsigned int do_csum_arm(const unsigned char *buff, int len)
>> +{
>> +	unsigned int res;
>> +
>> +	kernel_neon_begin();
>> +	res = do_csum_neon(buff, len);
> If len can be large, you should split into smaller blocks, with
> kernel_neon_end()/_begin() between to provide a change to preempt if len
> is large.  (Splitting into smaller blocks may also help avoid overflow.)
>
> Ard may have may able to suggest how often this should be done.

Good solutions for large length.

> Also, there is no guarantee that you can use NEON in softirq context,
> because the kernel may already have been using NEON when the softirq
> fired.  If this code may get called from softirq context, then you would
> need something along the lines of
>
> 	if (may_use_simd()) {
> 		kernel_neon_begin();
> 		/* NEON accelerated code */
> 		kernel_neon_end();
> 	} else {
> 		/* Fallback C code */
> 	}
>
> [...]

Yes, you are right.

>
> Cheers
> ---Dave
>
> .
>

huanglingyan (A) Jan. 10, 2019, 8:14 a.m. UTC | #7

On 2019/1/10 12:08, 胡海 wrote:
> hi, huanglingyan
>
> Have you verified the correctness of do_csum?
> I validated it on FT2000+ and do_csum got the wrong result.
>
> You can test it by increasing the input length to more than 60000 and
> setting all buffs to 0xff.
> I think it's possible that vpadalq_u16 will overflow, and the code
> doesn't take this into account.
>

Yes,  I actually didn't consider of such large length. I wonder if there are
some application scenaries that should use large lengths. Whatever,
this should be fixed in next version.

---Lingyan Huang
.

Dave Martin Jan. 10, 2019, 1:53 p.m. UTC | #8

On Thu, Jan 10, 2019 at 04:03:59PM +0800, huanglingyan (A) wrote:
> 
> On 2019/1/9 22:58, Dave Martin wrote:
> > On Sun, Jan 06, 2019 at 09:55:29AM +0800, Lingyan Huang wrote:
> >> Function do_csum() in lib/checksum.c is used to compute checksum,
> >> which is turned out to be slowly and costs a lot of resources.
> >> Let's use neon instructions to accelerate the checksum computation
> >> for arm64.
> >>
> >> ------
> >> V2 ==> V3:
> >>     only modify the arm64 codes instead of modifying headers
> >>     under asm-generic and code in lib/checksum.c.
> >> ------
> >> ------
> >> V1 ==> V2:
> >>     Change NEON assembly code to NEON intrinsic code which is built
> >>     on top of arm_neon.h to avoid dropping into assembly.
> >> ------
> >>
> >> Here is the comparison results of function ip_compute_csum() between
> >> general do_csum() and neon instruction do_csum(). The test platform 
> >> is HUAWEI 1620 server with TAISHAN cores.
> >>
> >>         len(1000cycle)      general(ns)        do_csum_neon(ns)
> >>           64B:                58060                 59460
> >>          128B:                82930                 83930
> >>          256B:               132480                 73570
> >>          512B:               230100                 86230
> >>         1024B:               426600                 98200
> > For testing purposes, you would need to cover all possible aligments of
> > buff and all values of (len % 16), as well as testing on big- and
> > little-endian.
> >
> > Otherwise, a lot of code for handling edge cases won't be tested.
> >
> >
> > Also, it would be interesting to know which sizes are actually most
> > common at runtime, to avoid falling into the trap of optimising for
> > rare cases at the expense of the more common cases.
> >
> > You could try adding some instrumentation to collect some statistics on
> > this.
> Different alignements of buff and different lengths should be
> showing here. The test-report will be more detailed as you said.
> 
> The main scene of do_csum is the Internet package checksum as I know.
> The package length is usually 64 Byte - MTU(1500 Byte default).

Sure, but it is used for some other things too.  So we should avoid
assuming that it is used for buffers strictly no larger than a network
packet unless there is clear evidence of that.

> My test platform is little-endian based. I will try my best to find a
> big-endian test platform.

That would be good.

You could also get some information on which sizes are commonest
by some trick like

#include <linux/atomic.h>
#include <linux/log2.h>

#define NR_COUNTERS 17

static atomic_t counters[NR_COUNTERS];

do_csum(..., int len)
{
	/* ... */
	atomic_inc(&counters[max(ilog2(len), NR_COUNTERS - 1)]);
	/* ... */
}

And then print out the counts periodically or expose them via debugfs.
(Probably not useful to upstream that code, but it could provide some
interesting statistics.)

[...]

> >> diff --git a/arch/arm64/lib/do_csum.c b/arch/arm64/lib/do_csum.c
> >> new file mode 100644
> >> index 0000000..893583f
> >> --- /dev/null
> >> +++ b/arch/arm64/lib/do_csum.c
> >> @@ -0,0 +1,136 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Authors: Lingyan Huang <huanglingyan2@huawei.com>
> >> + * Copyright (C) 2018 Hisilicon, Inc. All Rights Reserved.
> >> + *
> >> + * Generic C or neon implementation of do_csum operations.
> >> + * Choose faster neon instructions when NEON is supported.
> >> + *
> >> + */
> >> +
> >> +#include <asm/neon.h>
> >> +#include <asm/simd.h>
> > Unless you call may_use_simd() (see below) I can't see why you need to
> > include <asm/simd.h>.
> 
> Yeah, may_use_simd() should be called.
> 
> >> +#include <asm/checksum.h>
> >> +#include <asm/byteorder.h>
> >> +#include <asm/neon-intrinsics.h>
> >> +
> >> +#define CSUM_NEON_THRESHOLD 128
> >> +#ifdef CONFIG_KERNEL_MODE_NEON
> >> +static inline u32 from64to16(u64 x)
> >> +{
> >> +	/* add up 32-bit and 32-bit for 32+c bit */
> >> +	x = (x & 0xffffffff) + (x >> 32);
> >> +	/* add up carry.. */
> >> +	x = (x & 0xffffffff) + (x >> 32);
> >> +	/* add up 16-bit and 16-bit for 16+c bit */
> >> +	x = ((u32)x & 0xffff) + ((u32)x >> 16);
> >> +	/* add up carry.. */
> >> +	x = ((u32)x & 0xffff) + ((u32)x >> 16);
> >> +	return x;
> >> +}
> >> +
> >> +unsigned int do_csum_neon(const unsigned char *buff, int len)
> >> +{
> >> +	unsigned int odd, count;
> >> +	uint64_t result = 0;
> >> +	unsigned int count64;
> >> +	uint32x4_t vzero = (uint32x4_t){0, 0, 0, 0};
> >> +
> >> +	register uint32x4_t v0, v1, v2, v3;
> > Is "register" needed here?  Is there any impact on performance?
> >
> > Usually it's best to leave register allocation decisions up to the
> > compiler.
> 
> OK
> 
> >
> >> +
> >> +	if (unlikely(len <= 0))
> >> +		return result;
> >> +
> >> +	odd = 1 & (unsigned long)buff;
> >> +	if (unlikely(odd)) {
> >> +		result = *buff;
> > The generic code has a shift here for the little-endian case.  Why don't
> > we need that here?
> 
> I write this code in reference of arch x86 which not distinguish endianess.
> Maybe little-endian should be different from big-endian.

Ah, I see.  I was assuming you adapted the code from the generic version
in lib/checksum.c (which does handle endianness).  I suggest you take a
look there.

There is no issue on x86, since x86 is little-endian only.

> 
> >
> >> +		len--;
> >> +		buff++;
> >> +	}
> >> +
> >> +	count = len >> 1;
> >> +	if (count) {
> >> +		if (2 & (unsigned long)buff) {
> >> +			result += *(unsigned short *)buff;
> >> +			count--;
> >> +			len -= 2;
> >> +			buff += 2;
> >> +		}
> >> +		count >>= 1;            /* nr of 32-bit words.. */
> >> +		if (count) {
> >> +			if (4 & (unsigned long)buff) {
> >> +				result += *(unsigned int *)buff;
> >> +				count--;
> >> +				len -= 4;
> >> +				buff += 4;
> >> +			}
> >> +			count >>= 1;    /* nr of 64-bit words.. */
> >> +
> >> +			v0 = vzero;
> >> +			v1 = vzero;
> >> +			v2 = vzero;
> >> +			v3 = vzero;
> >> +
> >> +			count64 = count >> 3;  /* compute 64 Byte circle */
> >> +			while (count64) {
> >> +				v0 = vpadalq_u16(v0,
> >> +					vld1q_u16((uint16_t *)buff + 0));
> > Can this loop iterate more than 65536 times?  If it can, it looks like
> > we can overflow.
> >
> > (I think the initial value of len would have to be > 0x400000 in order
> > for this to happen.)
> 
> Yes, you are right. I didn't consider the situation of such long length. I wonder
> if there are any application scenarios that have such long length. Besides,
> arch x86 do_csum() in file csum-partial_64.c is not considered of
> that situation either.

If we can find more than 1 other arch (say) that can only work with
limited buffer sizes, than that gives us confidence that this isn't
simply a bug or design flaw that has been pasted from one place to
another.

Using this checksum algorithm for data much larger than a page or so
would be a bit surprising, but it's difficult to prove this doesn't
happen without auditing all the callers...

One option would be to propose an official limit on len, by adding an
RFC patch in your series than adds a suitable comment to
include/asm-generic/checksum.h to document the limit.  If you do that,
you should CC the relevant maintainers and linux-arch (since most
arches have their own versions).

> 
> >> +				v1 = vpadalq_u16(v1,
> >> +					vld1q_u16((uint16_t *)buff + 8));
> >> +				v2 = vpadalq_u16(v2,
> >> +					vld1q_u16((uint16_t *)buff + 16));
> >> +				v3 = vpadalq_u16(v3,
> >> +					vld1q_u16((uint16_t *)buff + 24));
> >> +				buff += 64;
> >> +				count64--;
> >> +			}
> >> +			v0 = vaddq_u32(v0, v1);
> >> +			v2 = vaddq_u32(v2, v3);
> >> +			v0 = vaddq_u32(v0, v2);
> > Can't we defer the folding down until later?  We could just accumulate
> > the next 16 bytes' result into v0, and do the folding all in one go
> > later on.
> 
> Yes, we can. Since 16 bytes' computation doesn't need v1-v3. Folding
> down now or later seems to have no diffenence.

OK, that should reduce code duplication a little.

> >> +
> >> +			count %= 8;
> >> +			while (count >= 2) { /* compute 16 byte circle */
> >> +				v0 = vpadalq_u16(v0,
> >> +					vld1q_u16((uint16_t *)buff + 0));
> >> +				buff += 16;
> >> +				count -= 2;
> >> +			}
> >> +
> >> +			result += vgetq_lane_u32(v0, 0);
> >> +			result += vgetq_lane_u32(v0, 1);
> >> +			result += vgetq_lane_u32(v0, 2);
> >> +			result += vgetq_lane_u32(v0, 3);
> >> +			if (count & 1) {
> >> +				result += *(unsigned long long *)buff;
> >> +				buff += 8;
> >> +			}
> >> +			if (len & 4) {
> >> +				result += *(unsigned int *)buff;
> >> +				buff += 4;
> >> +			}
> >> +		}
> >> +		if (len & 2) {
> >> +			result += *(unsigned short *)buff;
> >> +			buff += 2;
> >> +		}
> >> +	}
> >> +	if (len & 1)
> >> +		result += *buff;
> > What about the little-endian case?

(See lib/checksum.c for this too.)

> >
> >> +	result = from64to16(result);
> >> +	if (odd)
> >> +		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);
> >> +	return result;
> >> +}
> >> +#endif
> >> +
> >> +
> >> +unsigned int do_csum_arm(const unsigned char *buff, int len)
> >> +{
> >> +	unsigned int res;
> >> +
> >> +	kernel_neon_begin();
> >> +	res = do_csum_neon(buff, len);
> > If len can be large, you should split into smaller blocks, with
> > kernel_neon_end()/_begin() between to provide a change to preempt if len
> > is large.  (Splitting into smaller blocks may also help avoid overflow.)
> >
> > Ard may have may able to suggest how often this should be done.
> 
> Good solutions for large length.
> 
> > Also, there is no guarantee that you can use NEON in softirq context,
> > because the kernel may already have been using NEON when the softirq
> > fired.  If this code may get called from softirq context, then you would
> > need something along the lines of
> >
> > 	if (may_use_simd()) {
> > 		kernel_neon_begin();
> > 		/* NEON accelerated code */
> > 		kernel_neon_end();
> > 	} else {
> > 		/* Fallback C code */
> > 	}
> >
> > [...]

[...]

Cheers
---Dave

Will Deacon Jan. 16, 2019, 4:46 p.m. UTC | #9

On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
> 
> On 2019/1/8 21:54, Will Deacon wrote:
> > [re-adding Ard and LAKML -- not sure why the headers are so munged]
> >
> > On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
> >> On 2019/1/6 16:26, Ard Biesheuvel wrote:
> >>     Please change this into
> >>
> >>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
> >>         len >= CSUM_NEON_THRESHOLD &&
> >>         may_use_simd()) {
> >>             kernel_neon_begin();
> >>             res = do_csum_neon(buff, len);
> >>             kernel_neon_end();
> >>         }
> >>
> >>     and drop the intermediate do_csum_arm()
> >>
> >>
> >>         +               return do_csum_arm(buff, len);
> >>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
> >>
> >>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
> >>
> >>
> >>         +#undef do_csum
> >>
> >>     Can we drop this?
> >>
> >> Using NEON instructions will bring some costs. The spending maybe introduced
> >> when reservering/restoring
> >> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
> >> is Only used when
> >> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
> >> checksum.c will be used in
> >> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
> >> to have the oppotunity to
> >> utilize the general codes.
> > I don't think that's how it works :/
> >
> > Before we get deeper into the implementation, please could you justify the
> > need for a CPU-optimised checksum implementation at all? I thought this was
> > usually offloaded to the NIC?
> >
> > Will
> >
> > .
> This problem is introduced when testing Intel x710 network card on my ARM server.
> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
> machine and then receive.

In the marketing blurb, that card boasts:

  `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'

so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
this given that it /really/ shouldn't be used on arm64 machines that care
about network performance.

Will

huanglingyan (A) Jan. 18, 2019, 1:07 a.m. UTC | #10

On 2019/1/17 0:46, Will Deacon wrote:
> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
>> On 2019/1/8 21:54, Will Deacon wrote:
>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
>>>
>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
>>>>     Please change this into
>>>>
>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
>>>>         len >= CSUM_NEON_THRESHOLD &&
>>>>         may_use_simd()) {
>>>>             kernel_neon_begin();
>>>>             res = do_csum_neon(buff, len);
>>>>             kernel_neon_end();
>>>>         }
>>>>
>>>>     and drop the intermediate do_csum_arm()
>>>>
>>>>
>>>>         +               return do_csum_arm(buff, len);
>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
>>>>
>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
>>>>
>>>>
>>>>         +#undef do_csum
>>>>
>>>>     Can we drop this?
>>>>
>>>> Using NEON instructions will bring some costs. The spending maybe introduced
>>>> when reservering/restoring
>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
>>>> is Only used when
>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
>>>> checksum.c will be used in
>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
>>>> to have the oppotunity to
>>>> utilize the general codes.
>>> I don't think that's how it works :/
>>>
>>> Before we get deeper into the implementation, please could you justify the
>>> need for a CPU-optimised checksum implementation at all? I thought this was
>>> usually offloaded to the NIC?
>>>
>>> Will
>>>
>>> .
>> This problem is introduced when testing Intel x710 network card on my ARM server.
>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
>> machine and then receive.
> In the marketing blurb, that card boasts:
>
>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
>
> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
> this given that it /really/ shouldn't be used on arm64 machines that care
> about network performance.
>
> Will
>
> .
Yeah, you are right. Checksum is usually done in network card which is told by
someone familiar with NIC. However, it may be used in testing scenaries and
some primary network cards. I think it's no harm to optimize this code while
other ARCHs have their own optimized versions.
>

Ard Biesheuvel Jan. 18, 2019, 11:14 a.m. UTC | #11

On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>
>
> On 2019/1/17 0:46, Will Deacon wrote:
> > On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
> >> On 2019/1/8 21:54, Will Deacon wrote:
> >>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
> >>>
> >>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
> >>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
> >>>>     Please change this into
> >>>>
> >>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
> >>>>         len >= CSUM_NEON_THRESHOLD &&
> >>>>         may_use_simd()) {
> >>>>             kernel_neon_begin();
> >>>>             res = do_csum_neon(buff, len);
> >>>>             kernel_neon_end();
> >>>>         }
> >>>>
> >>>>     and drop the intermediate do_csum_arm()
> >>>>
> >>>>
> >>>>         +               return do_csum_arm(buff, len);
> >>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
> >>>>
> >>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
> >>>>
> >>>>
> >>>>         +#undef do_csum
> >>>>
> >>>>     Can we drop this?
> >>>>
> >>>> Using NEON instructions will bring some costs. The spending maybe introduced
> >>>> when reservering/restoring
> >>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
> >>>> is Only used when
> >>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
> >>>> checksum.c will be used in
> >>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
> >>>> to have the oppotunity to
> >>>> utilize the general codes.
> >>> I don't think that's how it works :/
> >>>
> >>> Before we get deeper into the implementation, please could you justify the
> >>> need for a CPU-optimised checksum implementation at all? I thought this was
> >>> usually offloaded to the NIC?
> >>>
> >>> Will
> >>>
> >>> .
> >> This problem is introduced when testing Intel x710 network card on my ARM server.
> >> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
> >> machine and then receive.
> > In the marketing blurb, that card boasts:
> >
> >   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
> >
> > so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
> > this given that it /really/ shouldn't be used on arm64 machines that care
> > about network performance.
> >
> > Will
> >
> > .
> Yeah, you are right. Checksum is usually done in network card which is told by
> someone familiar with NIC. However, it may be used in testing scenaries and
> some primary network cards. I think it's no harm to optimize this code while
> other ARCHs have their own optimized versions.
> >
>

I disagree. If this code path is never exercised, we should not
include it. We can revisit this decision when there is a use case
where the checksumming performance is an actual bottleneck.

huanglingyan (A) Feb. 12, 2019, 2:26 a.m. UTC | #12

On 2019/1/18 19:14, Ard Biesheuvel wrote:
> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>
>> On 2019/1/17 0:46, Will Deacon wrote:
>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
>>>> On 2019/1/8 21:54, Will Deacon wrote:
>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
>>>>>
>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
>>>>>>     Please change this into
>>>>>>
>>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
>>>>>>         len >= CSUM_NEON_THRESHOLD &&
>>>>>>         may_use_simd()) {
>>>>>>             kernel_neon_begin();
>>>>>>             res = do_csum_neon(buff, len);
>>>>>>             kernel_neon_end();
>>>>>>         }
>>>>>>
>>>>>>     and drop the intermediate do_csum_arm()
>>>>>>
>>>>>>
>>>>>>         +               return do_csum_arm(buff, len);
>>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
>>>>>>
>>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
>>>>>>
>>>>>>
>>>>>>         +#undef do_csum
>>>>>>
>>>>>>     Can we drop this?
>>>>>>
>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
>>>>>> when reservering/restoring
>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
>>>>>> is Only used when
>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
>>>>>> checksum.c will be used in
>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
>>>>>> to have the oppotunity to
>>>>>> utilize the general codes.
>>>>> I don't think that's how it works :/
>>>>>
>>>>> Before we get deeper into the implementation, please could you justify the
>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
>>>>> usually offloaded to the NIC?
>>>>>
>>>>> Will
>>>>>
>>>>> .
>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
>>>> machine and then receive.
>>> In the marketing blurb, that card boasts:
>>>
>>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
>>>
>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
>>> this given that it /really/ shouldn't be used on arm64 machines that care
>>> about network performance.
>>>
>>> Will
>>>
>>> .
>> Yeah, you are right. Checksum is usually done in network card which is told by
>> someone familiar with NIC. However, it may be used in testing scenaries and
>> some primary network cards. I think it's no harm to optimize this code while
>> other ARCHs have their own optimized versions.
> I disagree. If this code path is never exercised, we should not
> include it. We can revisit this decision when there is a use case
> where the checksumming performance is an actual bottleneck.
>
> .
The mainstream network cards has an option to switch the csum pattern.
Users can determine the one who calculate csum, hardware or software.

        ethtool -K eth0 rx-checksum off
        ethtool -K eth0 tx-checksum-ip-generic off

What's more, there's some network features that may cause hardware
checksum not work, like gso ( not so sure). Which means, the software
checksum has its existing meaning.

.

Ard Biesheuvel Feb. 12, 2019, 7:07 a.m. UTC | #13

On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>
>
> On 2019/1/18 19:14, Ard Biesheuvel wrote:
> > On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> >>
> >> On 2019/1/17 0:46, Will Deacon wrote:
> >>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
> >>>> On 2019/1/8 21:54, Will Deacon wrote:
> >>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
> >>>>>
> >>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
> >>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
> >>>>>>     Please change this into
> >>>>>>
> >>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
> >>>>>>         len >= CSUM_NEON_THRESHOLD &&
> >>>>>>         may_use_simd()) {
> >>>>>>             kernel_neon_begin();
> >>>>>>             res = do_csum_neon(buff, len);
> >>>>>>             kernel_neon_end();
> >>>>>>         }
> >>>>>>
> >>>>>>     and drop the intermediate do_csum_arm()
> >>>>>>
> >>>>>>
> >>>>>>         +               return do_csum_arm(buff, len);
> >>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
> >>>>>>
> >>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
> >>>>>>
> >>>>>>
> >>>>>>         +#undef do_csum
> >>>>>>
> >>>>>>     Can we drop this?
> >>>>>>
> >>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
> >>>>>> when reservering/restoring
> >>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
> >>>>>> is Only used when
> >>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
> >>>>>> checksum.c will be used in
> >>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
> >>>>>> to have the oppotunity to
> >>>>>> utilize the general codes.
> >>>>> I don't think that's how it works :/
> >>>>>
> >>>>> Before we get deeper into the implementation, please could you justify the
> >>>>> need for a CPU-optimised checksum implementation at all? I thought this was
> >>>>> usually offloaded to the NIC?
> >>>>>
> >>>>> Will
> >>>>>
> >>>>> .
> >>>> This problem is introduced when testing Intel x710 network card on my ARM server.
> >>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
> >>>> machine and then receive.
> >>> In the marketing blurb, that card boasts:
> >>>
> >>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
> >>>
> >>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
> >>> this given that it /really/ shouldn't be used on arm64 machines that care
> >>> about network performance.
> >>>
> >>> Will
> >>>
> >>> .
> >> Yeah, you are right. Checksum is usually done in network card which is told by
> >> someone familiar with NIC. However, it may be used in testing scenaries and
> >> some primary network cards. I think it's no harm to optimize this code while
> >> other ARCHs have their own optimized versions.
> > I disagree. If this code path is never exercised, we should not
> > include it. We can revisit this decision when there is a use case
> > where the checksumming performance is an actual bottleneck.
> >
> > .
> The mainstream network cards has an option to switch the csum pattern.
> Users can determine the one who calculate csum, hardware or software.
>
>         ethtool -K eth0 rx-checksum off
>         ethtool -K eth0 tx-checksum-ip-generic off
>
> What's more, there's some network features that may cause hardware
> checksum not work, like gso ( not so sure). Which means, the software
> checksum has its existing meaning.
>

This does not make any sense to me. Segmentation offload relies on the
hardware generating the actual packets, and I don't see how it would
be able to do that if it cannot generate the checksum as well.

huanglingyan (A) Feb. 13, 2019, 8:42 a.m. UTC | #14

On 2019/2/12 15:07, Ard Biesheuvel wrote:
> On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>
>> On 2019/1/18 19:14, Ard Biesheuvel wrote:
>>> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>> On 2019/1/17 0:46, Will Deacon wrote:
>>>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
>>>>>> On 2019/1/8 21:54, Will Deacon wrote:
>>>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
>>>>>>>
>>>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
>>>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
>>>>>>>>     Please change this into
>>>>>>>>
>>>>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
>>>>>>>>         len >= CSUM_NEON_THRESHOLD &&
>>>>>>>>         may_use_simd()) {
>>>>>>>>             kernel_neon_begin();
>>>>>>>>             res = do_csum_neon(buff, len);
>>>>>>>>             kernel_neon_end();
>>>>>>>>         }
>>>>>>>>
>>>>>>>>     and drop the intermediate do_csum_arm()
>>>>>>>>
>>>>>>>>
>>>>>>>>         +               return do_csum_arm(buff, len);
>>>>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
>>>>>>>>
>>>>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
>>>>>>>>
>>>>>>>>
>>>>>>>>         +#undef do_csum
>>>>>>>>
>>>>>>>>     Can we drop this?
>>>>>>>>
>>>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
>>>>>>>> when reservering/restoring
>>>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
>>>>>>>> is Only used when
>>>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
>>>>>>>> checksum.c will be used in
>>>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
>>>>>>>> to have the oppotunity to
>>>>>>>> utilize the general codes.
>>>>>>> I don't think that's how it works :/
>>>>>>>
>>>>>>> Before we get deeper into the implementation, please could you justify the
>>>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
>>>>>>> usually offloaded to the NIC?
>>>>>>>
>>>>>>> Will
>>>>>>>
>>>>>>> .
>>>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
>>>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
>>>>>> machine and then receive.
>>>>> In the marketing blurb, that card boasts:
>>>>>
>>>>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
>>>>>
>>>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
>>>>> this given that it /really/ shouldn't be used on arm64 machines that care
>>>>> about network performance.
>>>>>
>>>>> Will
>>>>>
>>>>> .
>>>> Yeah, you are right. Checksum is usually done in network card which is told by
>>>> someone familiar with NIC. However, it may be used in testing scenaries and
>>>> some primary network cards. I think it's no harm to optimize this code while
>>>> other ARCHs have their own optimized versions.
>>> I disagree. If this code path is never exercised, we should not
>>> include it. We can revisit this decision when there is a use case
>>> where the checksumming performance is an actual bottleneck.
>>>
>>> .
>> The mainstream network cards has an option to switch the csum pattern.
>> Users can determine the one who calculate csum, hardware or software.
>>
>>         ethtool -K eth0 rx-checksum off
>>         ethtool -K eth0 tx-checksum-ip-generic off
>>
>> What's more, there's some network features that may cause hardware
>> checksum not work, like gso ( not so sure). Which means, the software
>> checksum has its existing meaning.
>>
> This does not make any sense to me. Segmentation offload relies on the
> hardware generating the actual packets, and I don't see how it would
> be able to do that if it cannot generate the checksum as well.
I test on my platform of  IP-forward scenery.  The network card has checksum capability.
The hardware do checksum when gro feature is off. However, checksum is done by
software when gro is on. In this sceney, do_csum function has 60% percentage of CPU load
and the performance decreases 20% due to software checksum.

The command I use is
        ethtool -K eth0 gro off

> .
>

Ard Biesheuvel Feb. 13, 2019, 9:15 a.m. UTC | #15

On Wed, 13 Feb 2019 at 09:42, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>
>
> On 2019/2/12 15:07, Ard Biesheuvel wrote:
> > On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> >>
> >> On 2019/1/18 19:14, Ard Biesheuvel wrote:
> >>> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> >>>> On 2019/1/17 0:46, Will Deacon wrote:
> >>>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
> >>>>>> On 2019/1/8 21:54, Will Deacon wrote:
> >>>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
> >>>>>>>
> >>>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
> >>>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
> >>>>>>>>     Please change this into
> >>>>>>>>
> >>>>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
> >>>>>>>>         len >= CSUM_NEON_THRESHOLD &&
> >>>>>>>>         may_use_simd()) {
> >>>>>>>>             kernel_neon_begin();
> >>>>>>>>             res = do_csum_neon(buff, len);
> >>>>>>>>             kernel_neon_end();
> >>>>>>>>         }
> >>>>>>>>
> >>>>>>>>     and drop the intermediate do_csum_arm()
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>         +               return do_csum_arm(buff, len);
> >>>>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
> >>>>>>>>
> >>>>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>         +#undef do_csum
> >>>>>>>>
> >>>>>>>>     Can we drop this?
> >>>>>>>>
> >>>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
> >>>>>>>> when reservering/restoring
> >>>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
> >>>>>>>> is Only used when
> >>>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
> >>>>>>>> checksum.c will be used in
> >>>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
> >>>>>>>> to have the oppotunity to
> >>>>>>>> utilize the general codes.
> >>>>>>> I don't think that's how it works :/
> >>>>>>>
> >>>>>>> Before we get deeper into the implementation, please could you justify the
> >>>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
> >>>>>>> usually offloaded to the NIC?
> >>>>>>>
> >>>>>>> Will
> >>>>>>>
> >>>>>>> .
> >>>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
> >>>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
> >>>>>> machine and then receive.
> >>>>> In the marketing blurb, that card boasts:
> >>>>>
> >>>>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
> >>>>>
> >>>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
> >>>>> this given that it /really/ shouldn't be used on arm64 machines that care
> >>>>> about network performance.
> >>>>>
> >>>>> Will
> >>>>>
> >>>>> .
> >>>> Yeah, you are right. Checksum is usually done in network card which is told by
> >>>> someone familiar with NIC. However, it may be used in testing scenaries and
> >>>> some primary network cards. I think it's no harm to optimize this code while
> >>>> other ARCHs have their own optimized versions.
> >>> I disagree. If this code path is never exercised, we should not
> >>> include it. We can revisit this decision when there is a use case
> >>> where the checksumming performance is an actual bottleneck.
> >>>
> >>> .
> >> The mainstream network cards has an option to switch the csum pattern.
> >> Users can determine the one who calculate csum, hardware or software.
> >>
> >>         ethtool -K eth0 rx-checksum off
> >>         ethtool -K eth0 tx-checksum-ip-generic off
> >>
> >> What's more, there's some network features that may cause hardware
> >> checksum not work, like gso ( not so sure). Which means, the software
> >> checksum has its existing meaning.
> >>
> > This does not make any sense to me. Segmentation offload relies on the
> > hardware generating the actual packets, and I don't see how it would
> > be able to do that if it cannot generate the checksum as well.
> I test on my platform of  IP-forward scenery.  The network card has checksum capability.
> The hardware do checksum when gro feature is off. However, checksum is done by
> software when gro is on. In this sceney, do_csum function has 60% percentage of CPU load
> and the performance decreases 20% due to software checksum.
>
> The command I use is
>         ethtool -K eth0 gro off
>

But this is about IP forwarding, right? So GRO is enabled, which means
the packets are combined at the rx side. So does this mean the kernel
always recalculates the checksum in software in this case? Or only for
forwarded packets, where I would expect the outgoing interface to
recalculate the checksum if TX checksum offload is enabled.

Ard Biesheuvel Feb. 13, 2019, 5:55 p.m. UTC | #16

(+ Ilias)

On Wed, 13 Feb 2019 at 10:15, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
> On Wed, 13 Feb 2019 at 09:42, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> >
> >
> > On 2019/2/12 15:07, Ard Biesheuvel wrote:
> > > On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> > >>
> > >> On 2019/1/18 19:14, Ard Biesheuvel wrote:
> > >>> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> > >>>> On 2019/1/17 0:46, Will Deacon wrote:
> > >>>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
> > >>>>>> On 2019/1/8 21:54, Will Deacon wrote:
> > >>>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
> > >>>>>>>
> > >>>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
> > >>>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
> > >>>>>>>>     Please change this into
> > >>>>>>>>
> > >>>>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
> > >>>>>>>>         len >= CSUM_NEON_THRESHOLD &&
> > >>>>>>>>         may_use_simd()) {
> > >>>>>>>>             kernel_neon_begin();
> > >>>>>>>>             res = do_csum_neon(buff, len);
> > >>>>>>>>             kernel_neon_end();
> > >>>>>>>>         }
> > >>>>>>>>
> > >>>>>>>>     and drop the intermediate do_csum_arm()
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>         +               return do_csum_arm(buff, len);
> > >>>>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
> > >>>>>>>>
> > >>>>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>         +#undef do_csum
> > >>>>>>>>
> > >>>>>>>>     Can we drop this?
> > >>>>>>>>
> > >>>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
> > >>>>>>>> when reservering/restoring
> > >>>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
> > >>>>>>>> is Only used when
> > >>>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
> > >>>>>>>> checksum.c will be used in
> > >>>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
> > >>>>>>>> to have the oppotunity to
> > >>>>>>>> utilize the general codes.
> > >>>>>>> I don't think that's how it works :/
> > >>>>>>>
> > >>>>>>> Before we get deeper into the implementation, please could you justify the
> > >>>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
> > >>>>>>> usually offloaded to the NIC?
> > >>>>>>>
> > >>>>>>> Will
> > >>>>>>>
> > >>>>>>> .
> > >>>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
> > >>>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
> > >>>>>> machine and then receive.
> > >>>>> In the marketing blurb, that card boasts:
> > >>>>>
> > >>>>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
> > >>>>>
> > >>>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
> > >>>>> this given that it /really/ shouldn't be used on arm64 machines that care
> > >>>>> about network performance.
> > >>>>>
> > >>>>> Will
> > >>>>>
> > >>>>> .
> > >>>> Yeah, you are right. Checksum is usually done in network card which is told by
> > >>>> someone familiar with NIC. However, it may be used in testing scenaries and
> > >>>> some primary network cards. I think it's no harm to optimize this code while
> > >>>> other ARCHs have their own optimized versions.
> > >>> I disagree. If this code path is never exercised, we should not
> > >>> include it. We can revisit this decision when there is a use case
> > >>> where the checksumming performance is an actual bottleneck.
> > >>>
> > >>> .
> > >> The mainstream network cards has an option to switch the csum pattern.
> > >> Users can determine the one who calculate csum, hardware or software.
> > >>
> > >>         ethtool -K eth0 rx-checksum off
> > >>         ethtool -K eth0 tx-checksum-ip-generic off
> > >>
> > >> What's more, there's some network features that may cause hardware
> > >> checksum not work, like gso ( not so sure). Which means, the software
> > >> checksum has its existing meaning.
> > >>
> > > This does not make any sense to me. Segmentation offload relies on the
> > > hardware generating the actual packets, and I don't see how it would
> > > be able to do that if it cannot generate the checksum as well.
> > I test on my platform of  IP-forward scenery.  The network card has checksum capability.
> > The hardware do checksum when gro feature is off. However, checksum is done by
> > software when gro is on. In this sceney, do_csum function has 60% percentage of CPU load
> > and the performance decreases 20% due to software checksum.
> >
> > The command I use is
> >         ethtool -K eth0 gro off
> >
>
> But this is about IP forwarding, right? So GRO is enabled, which means
> the packets are combined at the rx side. So does this mean the kernel
> always recalculates the checksum in software in this case? Or only for
> forwarded packets, where I would expect the outgoing interface to
> recalculate the checksum if TX checksum offload is enabled.

OK, after digging into this a bit more (with the help of Ilias -
thanks!), I agree that there may be cases where we still rely on
software IP checksumming even when using offload capable hardware. So
I also agree that it makes sense to provide an optimized
implementation for arm64.

However, I am not yet convinced that a SIMD implementation is worth
the hassle. I did some background reading [0] and came up with a
scalar arm64 assembler implementation [1] that is almost as fast on
Cortex-A57, and so I would like to get a feeling for how it performs
on other micro-architectures. (Do note that the code has not been
tested on big endian yet.)

Lingyan, could you please compare the scalar performance with the NEON
performance on your CPU? Thanks.

huanglingyan (A) Feb. 14, 2019, 9:57 a.m. UTC | #17

On 2019/2/14 1:55, Ard Biesheuvel wrote:
> (+ Ilias)
>
> On Wed, 13 Feb 2019 at 10:15, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>> On Wed, 13 Feb 2019 at 09:42, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>
>>> On 2019/2/12 15:07, Ard Biesheuvel wrote:
>>>> On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>>> On 2019/1/18 19:14, Ard Biesheuvel wrote:
>>>>>> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>>>>> On 2019/1/17 0:46, Will Deacon wrote:
>>>>>>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
>>>>>>>>> On 2019/1/8 21:54, Will Deacon wrote:
>>>>>>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
>>>>>>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
>>>>>>>>>>>     Please change this into
>>>>>>>>>>>
>>>>>>>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
>>>>>>>>>>>         len >= CSUM_NEON_THRESHOLD &&
>>>>>>>>>>>         may_use_simd()) {
>>>>>>>>>>>             kernel_neon_begin();
>>>>>>>>>>>             res = do_csum_neon(buff, len);
>>>>>>>>>>>             kernel_neon_end();
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>>     and drop the intermediate do_csum_arm()
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>         +               return do_csum_arm(buff, len);
>>>>>>>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
>>>>>>>>>>>
>>>>>>>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>         +#undef do_csum
>>>>>>>>>>>
>>>>>>>>>>>     Can we drop this?
>>>>>>>>>>>
>>>>>>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
>>>>>>>>>>> when reservering/restoring
>>>>>>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
>>>>>>>>>>> is Only used when
>>>>>>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
>>>>>>>>>>> checksum.c will be used in
>>>>>>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
>>>>>>>>>>> to have the oppotunity to
>>>>>>>>>>> utilize the general codes.
>>>>>>>>>> I don't think that's how it works :/
>>>>>>>>>>
>>>>>>>>>> Before we get deeper into the implementation, please could you justify the
>>>>>>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
>>>>>>>>>> usually offloaded to the NIC?
>>>>>>>>>>
>>>>>>>>>> Will
>>>>>>>>>>
>>>>>>>>>> .
>>>>>>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
>>>>>>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
>>>>>>>>> machine and then receive.
>>>>>>>> In the marketing blurb, that card boasts:
>>>>>>>>
>>>>>>>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
>>>>>>>>
>>>>>>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
>>>>>>>> this given that it /really/ shouldn't be used on arm64 machines that care
>>>>>>>> about network performance.
>>>>>>>>
>>>>>>>> Will
>>>>>>>>
>>>>>>>> .
>>>>>>> Yeah, you are right. Checksum is usually done in network card which is told by
>>>>>>> someone familiar with NIC. However, it may be used in testing scenaries and
>>>>>>> some primary network cards. I think it's no harm to optimize this code while
>>>>>>> other ARCHs have their own optimized versions.
>>>>>> I disagree. If this code path is never exercised, we should not
>>>>>> include it. We can revisit this decision when there is a use case
>>>>>> where the checksumming performance is an actual bottleneck.
>>>>>>
>>>>>> .
>>>>> The mainstream network cards has an option to switch the csum pattern.
>>>>> Users can determine the one who calculate csum, hardware or software.
>>>>>
>>>>>         ethtool -K eth0 rx-checksum off
>>>>>         ethtool -K eth0 tx-checksum-ip-generic off
>>>>>
>>>>> What's more, there's some network features that may cause hardware
>>>>> checksum not work, like gso ( not so sure). Which means, the software
>>>>> checksum has its existing meaning.
>>>>>
>>>> This does not make any sense to me. Segmentation offload relies on the
>>>> hardware generating the actual packets, and I don't see how it would
>>>> be able to do that if it cannot generate the checksum as well.
>>> I test on my platform of  IP-forward scenery.  The network card has checksum capability.
>>> The hardware do checksum when gro feature is off. However, checksum is done by
>>> software when gro is on. In this sceney, do_csum function has 60% percentage of CPU load
>>> and the performance decreases 20% due to software checksum.
>>>
>>> The command I use is
>>>         ethtool -K eth0 gro off
>>>
>> But this is about IP forwarding, right? So GRO is enabled, which means
>> the packets are combined at the rx side. So does this mean the kernel
>> always recalculates the checksum in software in this case? Or only for
>> forwarded packets, where I would expect the outgoing interface to
>> recalculate the checksum if TX checksum offload is enabled.
> OK, after digging into this a bit more (with the help of Ilias -
> thanks!), I agree that there may be cases where we still rely on
> software IP checksumming even when using offload capable hardware. So
> I also agree that it makes sense to provide an optimized
> implementation for arm64.
>
> However, I am not yet convinced that a SIMD implementation is worth
> the hassle. I did some background reading [0] and came up with a
> scalar arm64 assembler implementation [1] that is almost as fast on
> Cortex-A57, and so I would like to get a feeling for how it performs
> on other micro-architectures. (Do note that the code has not been
> tested on big endian yet.)
>
> Lingyan, could you please compare the scalar performance with the NEON
> performance on your CPU? Thanks.
OK, I'll test it on my CPU. The experimental platform should be built again.
I will inform you as soon as I get the results.

huanglingyan (A) Feb. 18, 2019, 8:49 a.m. UTC | #18

On 2019/2/14 17:57, huanglingyan (A) wrote:
> On 2019/2/14 1:55, Ard Biesheuvel wrote:
>> (+ Ilias)
>>
>> On Wed, 13 Feb 2019 at 10:15, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>>> On Wed, 13 Feb 2019 at 09:42, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>> On 2019/2/12 15:07, Ard Biesheuvel wrote:
>>>>> On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>>>> On 2019/1/18 19:14, Ard Biesheuvel wrote:
>>>>>>> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>>>>>>>> On 2019/1/17 0:46, Will Deacon wrote:
>>>>>>>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
>>>>>>>>>> On 2019/1/8 21:54, Will Deacon wrote:
>>>>>>>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
>>>>>>>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
>>>>>>>>>>>>     Please change this into
>>>>>>>>>>>>
>>>>>>>>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
>>>>>>>>>>>>         len >= CSUM_NEON_THRESHOLD &&
>>>>>>>>>>>>         may_use_simd()) {
>>>>>>>>>>>>             kernel_neon_begin();
>>>>>>>>>>>>             res = do_csum_neon(buff, len);
>>>>>>>>>>>>             kernel_neon_end();
>>>>>>>>>>>>         }
>>>>>>>>>>>>
>>>>>>>>>>>>     and drop the intermediate do_csum_arm()
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>         +               return do_csum_arm(buff, len);
>>>>>>>>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
>>>>>>>>>>>>
>>>>>>>>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>         +#undef do_csum
>>>>>>>>>>>>
>>>>>>>>>>>>     Can we drop this?
>>>>>>>>>>>>
>>>>>>>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
>>>>>>>>>>>> when reservering/restoring
>>>>>>>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
>>>>>>>>>>>> is Only used when
>>>>>>>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
>>>>>>>>>>>> checksum.c will be used in
>>>>>>>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
>>>>>>>>>>>> to have the oppotunity to
>>>>>>>>>>>> utilize the general codes.
>>>>>>>>>>> I don't think that's how it works :/
>>>>>>>>>>>
>>>>>>>>>>> Before we get deeper into the implementation, please could you justify the
>>>>>>>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
>>>>>>>>>>> usually offloaded to the NIC?
>>>>>>>>>>>
>>>>>>>>>>> Will
>>>>>>>>>>>
>>>>>>>>>>> .
>>>>>>>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
>>>>>>>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
>>>>>>>>>> machine and then receive.
>>>>>>>>> In the marketing blurb, that card boasts:
>>>>>>>>>
>>>>>>>>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
>>>>>>>>>
>>>>>>>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
>>>>>>>>> this given that it /really/ shouldn't be used on arm64 machines that care
>>>>>>>>> about network performance.
>>>>>>>>>
>>>>>>>>> Will
>>>>>>>>>
>>>>>>>>> .
>>>>>>>> Yeah, you are right. Checksum is usually done in network card which is told by
>>>>>>>> someone familiar with NIC. However, it may be used in testing scenaries and
>>>>>>>> some primary network cards. I think it's no harm to optimize this code while
>>>>>>>> other ARCHs have their own optimized versions.
>>>>>>> I disagree. If this code path is never exercised, we should not
>>>>>>> include it. We can revisit this decision when there is a use case
>>>>>>> where the checksumming performance is an actual bottleneck.
>>>>>>>
>>>>>>> .
>>>>>> The mainstream network cards has an option to switch the csum pattern.
>>>>>> Users can determine the one who calculate csum, hardware or software.
>>>>>>
>>>>>>         ethtool -K eth0 rx-checksum off
>>>>>>         ethtool -K eth0 tx-checksum-ip-generic off
>>>>>>
>>>>>> What's more, there's some network features that may cause hardware
>>>>>> checksum not work, like gso ( not so sure). Which means, the software
>>>>>> checksum has its existing meaning.
>>>>>>
>>>>> This does not make any sense to me. Segmentation offload relies on the
>>>>> hardware generating the actual packets, and I don't see how it would
>>>>> be able to do that if it cannot generate the checksum as well.
>>>> I test on my platform of  IP-forward scenery.  The network card has checksum capability.
>>>> The hardware do checksum when gro feature is off. However, checksum is done by
>>>> software when gro is on. In this sceney, do_csum function has 60% percentage of CPU load
>>>> and the performance decreases 20% due to software checksum.
>>>>
>>>> The command I use is
>>>>         ethtool -K eth0 gro off
>>>>
>>> But this is about IP forwarding, right? So GRO is enabled, which means
>>> the packets are combined at the rx side. So does this mean the kernel
>>> always recalculates the checksum in software in this case? Or only for
>>> forwarded packets, where I would expect the outgoing interface to
>>> recalculate the checksum if TX checksum offload is enabled.
>> OK, after digging into this a bit more (with the help of Ilias -
>> thanks!), I agree that there may be cases where we still rely on
>> software IP checksumming even when using offload capable hardware. So
>> I also agree that it makes sense to provide an optimized
>> implementation for arm64.
>>
>> However, I am not yet convinced that a SIMD implementation is worth
>> the hassle. I did some background reading [0] and came up with a
>> scalar arm64 assembler implementation [1] that is almost as fast on
>> Cortex-A57, and so I would like to get a feeling for how it performs
>> on other micro-architectures. (Do note that the code has not been
>> tested on big endian yet.)
>>
>> Lingyan, could you please compare the scalar performance with the NEON
>> performance on your CPU? Thanks.
> OK, I'll test it on my CPU. The experimental platform should be built again.
> I will inform you as soon as I get the results.
Below is the results tested on my platform. The performance of your patch is really nice.
The 2nd colomn is general do_csum now in Linux. The 3rd is your patch. The 4th is
neon realization. Last is neon realization without kernel_neon_begin/kernel_neon_end.

1000cycle  general(ns)     csum_ard(ns)    csum_neon(ns) csum_neon_no_kerbegin(ns)
   64B:          75690                 40890                76710                  57440
  256B:       171740                 54050               109640                 63730
 1023B:      553220                105930               155630                93520
 1024B:      554680                103500               148610                86890
 1500B:      793810                134540               164510               104590
 2048B:    1070880                167800               178700               119570
 4095B:    2091000                299140               249580               189740
 4096B:    2091610                296760               244310               183130

The reason should be analyzed that data width of NEON instruction is twice than the
general registers while performance is not. The kernel_neon_begin/end() seems to cost
a lot. Other reasons may include complex code implementations due to lack of experience.


>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>
> .
>

Ard Biesheuvel Feb. 18, 2019, 9:03 a.m. UTC | #19

On Mon, 18 Feb 2019 at 09:49, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
>
>
> On 2019/2/14 17:57, huanglingyan (A) wrote:
> > On 2019/2/14 1:55, Ard Biesheuvel wrote:
> >> (+ Ilias)
> >>
> >> On Wed, 13 Feb 2019 at 10:15, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >>> On Wed, 13 Feb 2019 at 09:42, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> >>>> On 2019/2/12 15:07, Ard Biesheuvel wrote:
> >>>>> On Tue, 12 Feb 2019 at 03:25, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> >>>>>> On 2019/1/18 19:14, Ard Biesheuvel wrote:
> >>>>>>> On Fri, 18 Jan 2019 at 02:07, huanglingyan (A) <huanglingyan2@huawei.com> wrote:
> >>>>>>>> On 2019/1/17 0:46, Will Deacon wrote:
> >>>>>>>>> On Wed, Jan 09, 2019 at 10:03:05AM +0800, huanglingyan (A) wrote:
> >>>>>>>>>> On 2019/1/8 21:54, Will Deacon wrote:
> >>>>>>>>>>> [re-adding Ard and LAKML -- not sure why the headers are so munged]
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Jan 07, 2019 at 10:38:55AM +0800, huanglingyan (A) wrote:
> >>>>>>>>>>>> On 2019/1/6 16:26, Ard Biesheuvel wrote:
> >>>>>>>>>>>>     Please change this into
> >>>>>>>>>>>>
> >>>>>>>>>>>>     if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
> >>>>>>>>>>>>         len >= CSUM_NEON_THRESHOLD &&
> >>>>>>>>>>>>         may_use_simd()) {
> >>>>>>>>>>>>             kernel_neon_begin();
> >>>>>>>>>>>>             res = do_csum_neon(buff, len);
> >>>>>>>>>>>>             kernel_neon_end();
> >>>>>>>>>>>>         }
> >>>>>>>>>>>>
> >>>>>>>>>>>>     and drop the intermediate do_csum_arm()
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>         +               return do_csum_arm(buff, len);
> >>>>>>>>>>>>         +#endif  /* CONFIG_KERNEL_MODE_NEON */
> >>>>>>>>>>>>
> >>>>>>>>>>>>     No else? What happens if len < CSUM_NEON_THRESHOLD ?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>         +#undef do_csum
> >>>>>>>>>>>>
> >>>>>>>>>>>>     Can we drop this?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Using NEON instructions will bring some costs. The spending maybe introduced
> >>>>>>>>>>>> when reservering/restoring
> >>>>>>>>>>>> neon registers with kernel_neon_begin()/kernel_neon_end(). Therefore NEON code
> >>>>>>>>>>>> is Only used when
> >>>>>>>>>>>> the length exceeds CSUM_NEON_THRESHOLD. General do csum() codes in lib/
> >>>>>>>>>>>> checksum.c will be used in
> >>>>>>>>>>>> shorter length. To achieve this goal, I use the "#undef do_csum" in else clause
> >>>>>>>>>>>> to have the oppotunity to
> >>>>>>>>>>>> utilize the general codes.
> >>>>>>>>>>> I don't think that's how it works :/
> >>>>>>>>>>>
> >>>>>>>>>>> Before we get deeper into the implementation, please could you justify the
> >>>>>>>>>>> need for a CPU-optimised checksum implementation at all? I thought this was
> >>>>>>>>>>> usually offloaded to the NIC?
> >>>>>>>>>>>
> >>>>>>>>>>> Will
> >>>>>>>>>>>
> >>>>>>>>>>> .
> >>>>>>>>>> This problem is introduced when testing Intel x710 network card on my ARM server.
> >>>>>>>>>> Ip forward is set for ease of testing. Then send lots of packages to server by Tesgine
> >>>>>>>>>> machine and then receive.
> >>>>>>>>> In the marketing blurb, that card boasts:
> >>>>>>>>>
> >>>>>>>>>   `Tx/Rx IP, SCTP, TCP, and UDP checksum offloading (IPv4, IPv6) capabilities'
> >>>>>>>>>
> >>>>>>>>> so we shouldn't need to run this on the CPU. Again, I'm not keen to optimise
> >>>>>>>>> this given that it /really/ shouldn't be used on arm64 machines that care
> >>>>>>>>> about network performance.
> >>>>>>>>>
> >>>>>>>>> Will
> >>>>>>>>>
> >>>>>>>>> .
> >>>>>>>> Yeah, you are right. Checksum is usually done in network card which is told by
> >>>>>>>> someone familiar with NIC. However, it may be used in testing scenaries and
> >>>>>>>> some primary network cards. I think it's no harm to optimize this code while
> >>>>>>>> other ARCHs have their own optimized versions.
> >>>>>>> I disagree. If this code path is never exercised, we should not
> >>>>>>> include it. We can revisit this decision when there is a use case
> >>>>>>> where the checksumming performance is an actual bottleneck.
> >>>>>>>
> >>>>>>> .
> >>>>>> The mainstream network cards has an option to switch the csum pattern.
> >>>>>> Users can determine the one who calculate csum, hardware or software.
> >>>>>>
> >>>>>>         ethtool -K eth0 rx-checksum off
> >>>>>>         ethtool -K eth0 tx-checksum-ip-generic off
> >>>>>>
> >>>>>> What's more, there's some network features that may cause hardware
> >>>>>> checksum not work, like gso ( not so sure). Which means, the software
> >>>>>> checksum has its existing meaning.
> >>>>>>
> >>>>> This does not make any sense to me. Segmentation offload relies on the
> >>>>> hardware generating the actual packets, and I don't see how it would
> >>>>> be able to do that if it cannot generate the checksum as well.
> >>>> I test on my platform of  IP-forward scenery.  The network card has checksum capability.
> >>>> The hardware do checksum when gro feature is off. However, checksum is done by
> >>>> software when gro is on. In this sceney, do_csum function has 60% percentage of CPU load
> >>>> and the performance decreases 20% due to software checksum.
> >>>>
> >>>> The command I use is
> >>>>         ethtool -K eth0 gro off
> >>>>
> >>> But this is about IP forwarding, right? So GRO is enabled, which means
> >>> the packets are combined at the rx side. So does this mean the kernel
> >>> always recalculates the checksum in software in this case? Or only for
> >>> forwarded packets, where I would expect the outgoing interface to
> >>> recalculate the checksum if TX checksum offload is enabled.
> >> OK, after digging into this a bit more (with the help of Ilias -
> >> thanks!), I agree that there may be cases where we still rely on
> >> software IP checksumming even when using offload capable hardware. So
> >> I also agree that it makes sense to provide an optimized
> >> implementation for arm64.
> >>
> >> However, I am not yet convinced that a SIMD implementation is worth
> >> the hassle. I did some background reading [0] and came up with a
> >> scalar arm64 assembler implementation [1] that is almost as fast on
> >> Cortex-A57, and so I would like to get a feeling for how it performs
> >> on other micro-architectures. (Do note that the code has not been
> >> tested on big endian yet.)
> >>
> >> Lingyan, could you please compare the scalar performance with the NEON
> >> performance on your CPU? Thanks.
> > OK, I'll test it on my CPU. The experimental platform should be built again.
> > I will inform you as soon as I get the results.
> Below is the results tested on my platform. The performance of your patch is really nice.
> The 2nd colomn is general do_csum now in Linux. The 3rd is your patch. The 4th is
> neon realization. Last is neon realization without kernel_neon_begin/kernel_neon_end.
>
> 1000cycle  general(ns)     csum_ard(ns)    csum_neon(ns) csum_neon_no_kerbegin(ns)
>    64B:          75690                 40890                76710                  57440
>   256B:       171740                 54050               109640                 63730
>  1023B:      553220                105930               155630                93520
>  1024B:      554680                103500               148610                86890
>  1500B:      793810                134540               164510               104590
>  2048B:    1070880                167800               178700               119570
>  4095B:    2091000                299140               249580               189740
>  4096B:    2091610                296760               244310               183130
>
> The reason should be analyzed that data width of NEON instruction is twice than the
> general registers while performance is not. The kernel_neon_begin/end() seems to cost
> a lot. Other reasons may include complex code implementations due to lack of experience.
>

Thank you Lingyan, that is really helpful.

It is clear from these numbers that the overhead of using the SIMD
unit is not worth it for typical network packet sizes, so we should go
with a scalar implementation instead.

My implementation was transliterated from x86 assembly, so I am pretty
sure it is correct for little endian, but I haven't tested big endian
at all. I will try to find some time this week to test it properly,
and send it out as a patch.

[v3] arm64: lib: accelerate do_csum with NEON instruction

Commit Message

Comments

Patch