[v5] arm64: kernel: implement fast refcount checking

Message ID	20190619105431.2630-1-ard.biesheuvel@linaro.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Ard Biesheuvel <ard.biesheuvel@linaro.org> To: linux-arm-kernel@lists.infradead.org Subject: [PATCH v5] arm64: kernel: implement fast refcount checking Date: Wed, 19 Jun 2019 12:54:31 +0200 Message-Id: <20190619105431.2630-1-ard.biesheuvel@linaro.org> Precedence: list Cc: Kees Cook <keescook@chromium.org>, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Catalin Marinas <catalin.marinas@arm.com>, Jan Glauber <jglauber@marvell.com>, Will Deacon <will.deacon@arm.com>, Jayachandran Chandrasekharan Nair <jnair@marvell.com>, Hanjun Guo <guohanjun@huawei.com>, Linus Torvalds <torvalds@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	[v5] arm64: kernel: implement fast refcount checking \| expand [v5] arm64: kernel: implement fast refcount checking

Ard Biesheuvel June 19, 2019, 10:54 a.m. UTC

This adds support to arm64 for fast refcount checking, as contributed
by Kees for x86 based on the implementation by grsecurity/PaX.

The general approach is identical: the existing atomic_t helpers are
cloned for refcount_t, with the arithmetic instruction modified to set
the PSTATE flags, and one or two branch instructions added that jump to
an out of line handler if overflow, decrement to zero or increment from
zero are detected.

One complication that we have to deal with on arm64 is the fact that
it has two atomics implementations: the original LL/SC implementation
using load/store exclusive loops, and the newer LSE one that does mostly
the same in a single instruction. So we need to clone some parts of
both for the refcount handlers, but we also need to deal with the way
LSE builds fall back to LL/SC at runtime if the hardware does not
support it.

As is the case with the x86 version, the performance gain is substantial
(ThunderX2 @ 2.2 GHz, using LSE), even though the arm64 implementation
incorporates an add-from-zero check as well:

perf stat -B -- echo ATOMIC_TIMING >/sys/kernel/debug/provoke-crash/DIRECT

      116252672661      cycles                    #    2.207 GHz

      52.689793525 seconds time elapsed

perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT

      127060259162      cycles                    #    2.207 GHz

      57.243690077 seconds time elapsed

For comparison, the numbers below were captured using CONFIG_REFCOUNT_FULL,
which uses the validation routines implemented in C using cmpxchg():

perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT

 Performance counter stats for 'cat /dev/fd/63':

      191057942484      cycles                    #    2.207 GHz

      86.568269904 seconds time elapsed

As a bonus, this code has been found to perform significantly better on
systems with many CPUs, due to the fact that it no longer relies on the
load/compare-and-swap combo performed in a tight loop, which is what we
emit for cmpxchg() on arm64.

Cc: Will Deacon <will.deacon@arm.com>
Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
Cc: Kees Cook <keescook@chromium.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Cc: Jan Glauber <jglauber@marvell.com>,
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Cc: Hanjun Guo <guohanjun@huawei.com>

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
---
v5: - rebase onto mainline
    - add ACQUIRE ordering semantics to [dec|sub]_and_test() on success
      (as per 47b8f3ab9c49)
    - introduce ARCH_HAS_REFCOUNT_FULL implying ARCH_HAS_REFCOUNT, but also
      hiding the REFCOUNT_FULL option from Kconfig entirely
    - update commit log with TX2 results

 arch/Kconfig                          | 12 ++-
 arch/arm64/Kconfig                    |  2 +-
 arch/arm64/include/asm/atomic.h       | 24 ++++++
 arch/arm64/include/asm/atomic_ll_sc.h | 50 ++++++++++++
 arch/arm64/include/asm/atomic_lse.h   | 81 ++++++++++++++++++++
 arch/arm64/include/asm/brk-imm.h      |  1 +
 arch/arm64/include/asm/refcount.h     | 60 +++++++++++++++
 arch/arm64/kernel/traps.c             | 37 +++++++++
 arch/arm64/lib/atomic_ll_sc.c         | 12 +++
 9 files changed, 277 insertions(+), 2 deletions(-)

Ard Biesheuvel June 19, 2019, 10:56 a.m. UTC | #1

(add Andrew, who has been looking at the atomics code as well)

On Wed, 19 Jun 2019 at 12:54, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
> This adds support to arm64 for fast refcount checking, as contributed
> by Kees for x86 based on the implementation by grsecurity/PaX.
>
> The general approach is identical: the existing atomic_t helpers are
> cloned for refcount_t, with the arithmetic instruction modified to set
> the PSTATE flags, and one or two branch instructions added that jump to
> an out of line handler if overflow, decrement to zero or increment from
> zero are detected.
>
> One complication that we have to deal with on arm64 is the fact that
> it has two atomics implementations: the original LL/SC implementation
> using load/store exclusive loops, and the newer LSE one that does mostly
> the same in a single instruction. So we need to clone some parts of
> both for the refcount handlers, but we also need to deal with the way
> LSE builds fall back to LL/SC at runtime if the hardware does not
> support it.
>
> As is the case with the x86 version, the performance gain is substantial
> (ThunderX2 @ 2.2 GHz, using LSE), even though the arm64 implementation
> incorporates an add-from-zero check as well:
>
> perf stat -B -- echo ATOMIC_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
>
>       116252672661      cycles                    #    2.207 GHz
>
>       52.689793525 seconds time elapsed
>
> perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
>
>       127060259162      cycles                    #    2.207 GHz
>
>       57.243690077 seconds time elapsed
>
> For comparison, the numbers below were captured using CONFIG_REFCOUNT_FULL,
> which uses the validation routines implemented in C using cmpxchg():
>
> perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
>
>  Performance counter stats for 'cat /dev/fd/63':
>
>       191057942484      cycles                    #    2.207 GHz
>
>       86.568269904 seconds time elapsed
>
> As a bonus, this code has been found to perform significantly better on
> systems with many CPUs, due to the fact that it no longer relies on the
> load/compare-and-swap combo performed in a tight loop, which is what we
> emit for cmpxchg() on arm64.
>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>,
> Cc: Jan Glauber <jglauber@marvell.com>,
> Cc: Linus Torvalds <torvalds@linux-foundation.org>,
> Cc: Hanjun Guo <guohanjun@huawei.com>
>
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> ---
> v5: - rebase onto mainline
>     - add ACQUIRE ordering semantics to [dec|sub]_and_test() on success
>       (as per 47b8f3ab9c49)
>     - introduce ARCH_HAS_REFCOUNT_FULL implying ARCH_HAS_REFCOUNT, but also
>       hiding the REFCOUNT_FULL option from Kconfig entirely
>     - update commit log with TX2 results
>
>  arch/Kconfig                          | 12 ++-
>  arch/arm64/Kconfig                    |  2 +-
>  arch/arm64/include/asm/atomic.h       | 24 ++++++
>  arch/arm64/include/asm/atomic_ll_sc.h | 50 ++++++++++++
>  arch/arm64/include/asm/atomic_lse.h   | 81 ++++++++++++++++++++
>  arch/arm64/include/asm/brk-imm.h      |  1 +
>  arch/arm64/include/asm/refcount.h     | 60 +++++++++++++++
>  arch/arm64/kernel/traps.c             | 37 +++++++++
>  arch/arm64/lib/atomic_ll_sc.c         | 12 +++
>  9 files changed, 277 insertions(+), 2 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index c47b328eada0..a19abb00fc4b 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -876,6 +876,16 @@ config STRICT_MODULE_RWX
>  config ARCH_HAS_PHYS_TO_DMA
>         bool
>
> +config ARCH_HAS_REFCOUNT_FULL
> +       bool
> +       select ARCH_HAS_REFCOUNT
> +       help
> +         An architecture selects this when the optimized refcount_t
> +         implementation it provides covers all the cases that
> +         CONFIG_REFCOUNT_FULL covers as well, in which case it makes no
> +         sense to even offer CONFIG_REFCOUNT_FULL as a user selectable
> +         option.
> +
>  config ARCH_HAS_REFCOUNT
>         bool
>         help
> @@ -889,7 +899,7 @@ config ARCH_HAS_REFCOUNT
>           against bugs in reference counts.
>
>  config REFCOUNT_FULL
> -       bool "Perform full reference count validation at the expense of speed"
> +       bool "Perform full reference count validation at the expense of speed" if !ARCH_HAS_REFCOUNT_FULL
>         help
>           Enabling this switches the refcounting infrastructure from a fast
>           unchecked atomic_t implementation to a fully state checked
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 697ea0510729..fa0d02e111e2 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -26,6 +26,7 @@ config ARM64
>         select ARCH_HAS_MEMBARRIER_SYNC_CORE
>         select ARCH_HAS_PTE_SPECIAL
>         select ARCH_HAS_SETUP_DMA_OPS
> +       select ARCH_HAS_REFCOUNT_FULL
>         select ARCH_HAS_SET_MEMORY
>         select ARCH_HAS_STRICT_KERNEL_RWX
>         select ARCH_HAS_STRICT_MODULE_RWX
> @@ -173,7 +174,6 @@ config ARM64
>         select PCI_SYSCALL if PCI
>         select POWER_RESET
>         select POWER_SUPPLY
> -       select REFCOUNT_FULL
>         select SPARSE_IRQ
>         select SWIOTLB
>         select SYSCTL_EXCEPTION_TRACE
> diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
> index 1f4e9ee641c9..c2f7058395a2 100644
> --- a/arch/arm64/include/asm/atomic.h
> +++ b/arch/arm64/include/asm/atomic.h
> @@ -21,13 +21,37 @@
>  #define __ASM_ATOMIC_H
>
>  #include <linux/compiler.h>
> +#include <linux/stringify.h>
>  #include <linux/types.h>
>
>  #include <asm/barrier.h>
> +#include <asm/brk-imm.h>
>  #include <asm/lse.h>
>
>  #ifdef __KERNEL__
>
> +/*
> + * To avoid having to allocate registers that pass the counter address and
> + * address of the call site to the overflow handler, encode the register and
> + * call site offset in a dummy cbz instruction that we can decode later.
> + */
> +#define REFCOUNT_CHECK_TAIL                                            \
> +"      .subsection     1\n"                                            \
> +"33:   brk "           __stringify(REFCOUNT_BRK_IMM) "\n"              \
> +"      cbz             %[counter], 22b\n"      /* never reached */     \
> +"      .previous\n"
> +
> +#define REFCOUNT_POST_CHECK_NEG                                                \
> +"22:   b.mi            33f\n"                                          \
> +       REFCOUNT_CHECK_TAIL
> +
> +#define REFCOUNT_POST_CHECK_NEG_OR_ZERO                                        \
> +"      b.eq            33f\n"                                          \
> +       REFCOUNT_POST_CHECK_NEG
> +
> +#define REFCOUNT_PRE_CHECK_ZERO(reg)   "ccmp " #reg ", wzr, #8, pl\n"
> +#define REFCOUNT_PRE_CHECK_NONE(reg)
> +
>  #define __ARM64_IN_ATOMIC_IMPL
>
>  #if defined(CONFIG_ARM64_LSE_ATOMICS) && defined(CONFIG_AS_LSE)
> diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
> index e321293e0c89..4598575a73a2 100644
> --- a/arch/arm64/include/asm/atomic_ll_sc.h
> +++ b/arch/arm64/include/asm/atomic_ll_sc.h
> @@ -336,4 +336,54 @@ __CMPXCHG_DBL(_mb, dmb ish, l, "memory")
>
>  #undef __CMPXCHG_DBL
>
> +#define REFCOUNT_OP(op, asm_op, pre, post, l)                          \
> +__LL_SC_INLINE int                                                     \
> +__LL_SC_PREFIX(__refcount_##op(int i, atomic_t *r))                    \
> +{                                                                      \
> +       unsigned int tmp;                                               \
> +       int result;                                                     \
> +                                                                       \
> +       asm volatile("// refcount_" #op "\n"                            \
> +"      prfm            pstl1strm, %[cval]\n"                           \
> +"1:    ldxr            %w1, %[cval]\n"                                 \
> +"      " #asm_op "     %w[val], %w1, %w[i]\n"                          \
> +       REFCOUNT_PRE_CHECK_ ## pre (%w1)                                \
> +"      st" #l "xr      %w1, %w[val], %[cval]\n"                        \
> +"      cbnz            %w1, 1b\n"                                      \
> +       REFCOUNT_POST_CHECK_ ## post                                    \
> +       : [val] "=&r"(result), "=&r"(tmp), [cval] "+Q"(r->counter)      \
> +       : [counter] "r"(&r->counter), [i] "Ir" (i)                      \
> +       : "cc");                                                        \
> +                                                                       \
> +       return result;                                                  \
> +}                                                                      \
> +__LL_SC_EXPORT(__refcount_##op);
> +
> +REFCOUNT_OP(add_lt, adds, ZERO, NEG_OR_ZERO,  );
> +REFCOUNT_OP(sub_lt, subs, NONE, NEG,         l);
> +REFCOUNT_OP(sub_le, subs, NONE, NEG_OR_ZERO, l);
> +
> +__LL_SC_INLINE int
> +__LL_SC_PREFIX(__refcount_add_not_zero(int i, atomic_t *r))
> +{
> +       unsigned int tmp;
> +       int result;
> +
> +       asm volatile("// refcount_add_not_zero\n"
> +"      prfm    pstl1strm, %[cval]\n"
> +"1:    ldxr    %w[val], %[cval]\n"
> +"      cbz     %w[val], 2f\n"
> +"      adds    %w[val], %w[val], %w[i]\n"
> +"      stxr    %w1, %w[val], %[cval]\n"
> +"      cbnz    %w1, 1b\n"
> +       REFCOUNT_POST_CHECK_NEG
> +"2:"
> +       : [val] "=&r" (result), "=&r" (tmp), [cval] "+Q" (r->counter)
> +       : [counter] "r"(&r->counter), [i] "Ir" (i)
> +       : "cc");
> +
> +       return result;
> +}
> +__LL_SC_EXPORT(__refcount_add_not_zero);
> +
>  #endif /* __ASM_ATOMIC_LL_SC_H */
> diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
> index 9256a3921e4b..99637c0b0a92 100644
> --- a/arch/arm64/include/asm/atomic_lse.h
> +++ b/arch/arm64/include/asm/atomic_lse.h
> @@ -531,4 +531,85 @@ __CMPXCHG_DBL(_mb, al, "memory")
>  #undef __LL_SC_CMPXCHG_DBL
>  #undef __CMPXCHG_DBL
>
> +#define REFCOUNT_ADD_OP(op, pre, post)                                 \
> +static inline int __refcount_##op(int i, atomic_t *r)                  \
> +{                                                                      \
> +       register int w0 asm ("w0") = i;                                 \
> +       register atomic_t *x1 asm ("x1") = r;                           \
> +                                                                       \
> +       asm volatile(ARM64_LSE_ATOMIC_INSN(                             \
> +       /* LL/SC */                                                     \
> +       __LL_SC_CALL(__refcount_##op)                                   \
> +       "       cmp     %w0, wzr\n"                                     \
> +       __nops(1),                                                      \
> +       /* LSE atomics */                                               \
> +       "       ldadd           %w[i], w30, %[cval]\n"                  \
> +       "       adds            %w[i], %w[i], w30\n"                    \
> +       REFCOUNT_PRE_CHECK_ ## pre (w30))                               \
> +       REFCOUNT_POST_CHECK_ ## post                                    \
> +       : [i] "+r" (w0), [cval] "+Q" (r->counter)                       \
> +       : [counter] "r"(&r->counter), "r" (x1)                          \
> +       : __LL_SC_CLOBBERS, "cc");                                      \
> +                                                                       \
> +       return w0;                                                      \
> +}
> +
> +REFCOUNT_ADD_OP(add_lt, ZERO, NEG_OR_ZERO);
> +
> +#define REFCOUNT_SUB_OP(op, post)                                      \
> +static inline int __refcount_##op(int i, atomic_t *r)                  \
> +{                                                                      \
> +       register int w0 asm ("w0") = i;                                 \
> +       register atomic_t *x1 asm ("x1") = r;                           \
> +                                                                       \
> +       asm volatile(ARM64_LSE_ATOMIC_INSN(                             \
> +       /* LL/SC */                                                     \
> +       __LL_SC_CALL(__refcount_##op)                                   \
> +       "       cmp     %w0, wzr\n"                                     \
> +       __nops(1),                                                      \
> +       /* LSE atomics */                                               \
> +       "       neg     %w[i], %w[i]\n"                                 \
> +       "       ldaddl  %w[i], w30, %[cval]\n"                          \
> +       "       adds    %w[i], %w[i], w30")                             \
> +       REFCOUNT_POST_CHECK_ ## post                                    \
> +       : [i] "+r" (w0), [cval] "+Q" (r->counter)                       \
> +       : [counter] "r" (&r->counter), "r" (x1)                         \
> +       : __LL_SC_CLOBBERS, "cc");                                      \
> +                                                                       \
> +       return w0;                                                      \
> +}
> +
> +REFCOUNT_SUB_OP(sub_lt, NEG);
> +REFCOUNT_SUB_OP(sub_le, NEG_OR_ZERO);
> +
> +static inline int __refcount_add_not_zero(int i, atomic_t *r)
> +{
> +       register int result asm ("w0");
> +       register atomic_t *x1 asm ("x1") = r;
> +
> +       asm volatile(ARM64_LSE_ATOMIC_INSN(
> +       /* LL/SC */
> +       "       mov     %w0, %w[i]\n"
> +       __LL_SC_CALL(__refcount_add_not_zero)
> +       "       cmp     %w0, wzr\n"
> +       __nops(6),
> +       /* LSE atomics */
> +       "       ldr     %w0, %[cval]\n"
> +       "1:     cmp     %w0, wzr\n"
> +       "       b.eq    2f\n"
> +       "       add     w30, %w0, %w[i]\n"
> +       "       cas     %w0, w30, %[cval]\n"
> +       "       sub     w30, w30, %w[i]\n"
> +       "       cmp     %w0, w30\n"
> +       "       b.ne    1b\n"
> +       "       adds    %w0, w30, %w[i]\n"
> +       "2:\n")
> +       REFCOUNT_POST_CHECK_NEG
> +       : "=&r" (result), [cval] "+Q" (r->counter)
> +       : [counter] "r" (&r->counter), [i] "Ir" (i), "r" (x1)
> +       : __LL_SC_CLOBBERS, "cc");
> +
> +       return result;
> +}
> +
>  #endif /* __ASM_ATOMIC_LSE_H */
> diff --git a/arch/arm64/include/asm/brk-imm.h b/arch/arm64/include/asm/brk-imm.h
> index d84294064e6a..943f11ebe9af 100644
> --- a/arch/arm64/include/asm/brk-imm.h
> +++ b/arch/arm64/include/asm/brk-imm.h
> @@ -23,6 +23,7 @@
>  #define KPROBES_BRK_IMM                        0x004
>  #define UPROBES_BRK_IMM                        0x005
>  #define FAULT_BRK_IMM                  0x100
> +#define REFCOUNT_BRK_IMM               0x101
>  #define KGDB_DYN_DBG_BRK_IMM           0x400
>  #define KGDB_COMPILED_DBG_BRK_IMM      0x401
>  #define BUG_BRK_IMM                    0x800
> diff --git a/arch/arm64/include/asm/refcount.h b/arch/arm64/include/asm/refcount.h
> new file mode 100644
> index 000000000000..3c99b29f4549
> --- /dev/null
> +++ b/arch/arm64/include/asm/refcount.h
> @@ -0,0 +1,60 @@
> +/*
> + * arm64-specific implementation of refcount_t. Based on x86 version and
> + * PAX_REFCOUNT from PaX/grsecurity.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef __ASM_REFCOUNT_H
> +#define __ASM_REFCOUNT_H
> +
> +#include <linux/refcount.h>
> +
> +#include <asm/atomic.h>
> +
> +static __always_inline void refcount_add(int i, refcount_t *r)
> +{
> +       __refcount_add_lt(i, &r->refs);
> +}
> +
> +static __always_inline void refcount_inc(refcount_t *r)
> +{
> +       __refcount_add_lt(1, &r->refs);
> +}
> +
> +static __always_inline void refcount_dec(refcount_t *r)
> +{
> +       __refcount_sub_le(1, &r->refs);
> +}
> +
> +static __always_inline __must_check bool refcount_sub_and_test(unsigned int i,
> +                                                              refcount_t *r)
> +{
> +       bool ret = __refcount_sub_lt(i, &r->refs) == 0;
> +
> +       if (ret) {
> +               smp_acquire__after_ctrl_dep();
> +               return true;
> +       }
> +       return false;
> +}
> +
> +static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
> +{
> +       return refcount_sub_and_test(1, r);
> +}
> +
> +static __always_inline __must_check bool refcount_add_not_zero(unsigned int i,
> +                                                              refcount_t *r)
> +{
> +       return __refcount_add_not_zero(i, &r->refs) != 0;
> +}
> +
> +static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
> +{
> +       return __refcount_add_not_zero(1, &r->refs) != 0;
> +}
> +
> +#endif
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 177c0f6ebabf..5003cfb48d9b 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -1037,6 +1037,42 @@ int __init early_brk64(unsigned long addr, unsigned int esr,
>         return bug_handler(regs, esr) != DBG_HOOK_HANDLED;
>  }
>
> +static int refcount_overflow_handler(struct pt_regs *regs, unsigned int esr)
> +{
> +       u32 dummy_cbz = le32_to_cpup((__le32 *)(regs->pc + 4));
> +       bool zero = regs->pstate & PSR_Z_BIT;
> +       u32 rt;
> +
> +       /*
> +        * Find the register that holds the counter address from the
> +        * dummy 'cbz' instruction that follows the 'brk' instruction
> +        * that sent us here.
> +        */
> +       rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, dummy_cbz);
> +
> +       /* First unconditionally saturate the refcount. */
> +       *(int *)regs->regs[rt] = INT_MIN / 2;
> +
> +       /*
> +        * This function has been called because either a negative refcount
> +        * value was seen by any of the refcount functions, or a zero
> +        * refcount value was seen by refcount_{add,dec}().
> +        */
> +
> +       /* point pc to the branch instruction that detected the overflow */
> +       regs->pc += 4 + aarch64_get_branch_offset(dummy_cbz);
> +       refcount_error_report(regs, zero ? "hit zero" : "overflow");
> +
> +       /* advance pc and proceed */
> +       regs->pc += 4;
> +       return DBG_HOOK_HANDLED;
> +}
> +
> +static struct break_hook refcount_break_hook = {
> +       .fn     = refcount_overflow_handler,
> +       .imm    = REFCOUNT_BRK_IMM,
> +};
> +
>  /* This registration must happen early, before debug_traps_init(). */
>  void __init trap_init(void)
>  {
> @@ -1044,4 +1080,5 @@ void __init trap_init(void)
>  #ifdef CONFIG_KASAN_SW_TAGS
>         register_kernel_break_hook(&kasan_break_hook);
>  #endif
> +       register_kernel_break_hook(&refcount_break_hook);
>  }
> diff --git a/arch/arm64/lib/atomic_ll_sc.c b/arch/arm64/lib/atomic_ll_sc.c
> index b0c538b0da28..8a335cd9f0e2 100644
> --- a/arch/arm64/lib/atomic_ll_sc.c
> +++ b/arch/arm64/lib/atomic_ll_sc.c
> @@ -1,3 +1,15 @@
>  #include <asm/atomic.h>
>  #define __ARM64_IN_ATOMIC_IMPL
> +
> +/*
> + * Disarm the refcount checks in the out-of-line LL/SC routines. These are
> + * redundant, given that the LSE callers already perform the same checks.
> + * We do have to make sure that we exit with a zero value if the pre-check
> + * detected a zero value.
> + */
> +#undef REFCOUNT_POST_CHECK_NEG
> +#undef REFCOUNT_POST_CHECK_NEG_OR_ZERO
> +#define REFCOUNT_POST_CHECK_NEG
> +#define REFCOUNT_POST_CHECK_NEG_OR_ZERO   "csel %w[val], wzr, %w[val], eq\n"
> +
>  #include <asm/atomic_ll_sc.h>
> --
> 2.17.1
>

Jan Glauber June 20, 2019, 11:03 a.m. UTC | #2

On Wed, Jun 19, 2019 at 12:56:41PM +0200, Ard Biesheuvel wrote:
> (add Andrew, who has been looking at the atomics code as well)
> 
> On Wed, 19 Jun 2019 at 12:54, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >
> > This adds support to arm64 for fast refcount checking, as contributed
> > by Kees for x86 based on the implementation by grsecurity/PaX.
> >
> > The general approach is identical: the existing atomic_t helpers are
> > cloned for refcount_t, with the arithmetic instruction modified to set
> > the PSTATE flags, and one or two branch instructions added that jump to
> > an out of line handler if overflow, decrement to zero or increment from
> > zero are detected.
> >
> > One complication that we have to deal with on arm64 is the fact that
> > it has two atomics implementations: the original LL/SC implementation
> > using load/store exclusive loops, and the newer LSE one that does mostly
> > the same in a single instruction. So we need to clone some parts of
> > both for the refcount handlers, but we also need to deal with the way
> > LSE builds fall back to LL/SC at runtime if the hardware does not
> > support it.
> >
> > As is the case with the x86 version, the performance gain is substantial
> > (ThunderX2 @ 2.2 GHz, using LSE), even though the arm64 implementation
> > incorporates an add-from-zero check as well:
> >
> > perf stat -B -- echo ATOMIC_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> >
> >       116252672661      cycles                    #    2.207 GHz
> >
> >       52.689793525 seconds time elapsed
> >
> > perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> >
> >       127060259162      cycles                    #    2.207 GHz
> >
> >       57.243690077 seconds time elapsed
> >
> > For comparison, the numbers below were captured using CONFIG_REFCOUNT_FULL,
> > which uses the validation routines implemented in C using cmpxchg():
> >
> > perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> >
> >  Performance counter stats for 'cat /dev/fd/63':
> >
> >       191057942484      cycles                    #    2.207 GHz
> >
> >       86.568269904 seconds time elapsed
> >
> > As a bonus, this code has been found to perform significantly better on
> > systems with many CPUs, due to the fact that it no longer relies on the
> > load/compare-and-swap combo performed in a tight loop, which is what we
> > emit for cmpxchg() on arm64.

Great work! With that series refcount is no longer the top spot for the
open-close testcase on TX2 (with a distro-like config).

Minor unrelated nit - shouldn't the new refcount.h use the modern
SPDX-License-Identifier thing?

--Jan

Kees Cook June 20, 2019, 6:10 p.m. UTC | #3

On Wed, Jun 19, 2019 at 12:54:31PM +0200, Ard Biesheuvel wrote:
> This adds support to arm64 for fast refcount checking, as contributed
> by Kees for x86 based on the implementation by grsecurity/PaX.
> 
> The general approach is identical: the existing atomic_t helpers are
> cloned for refcount_t, with the arithmetic instruction modified to set
> the PSTATE flags, and one or two branch instructions added that jump to
> an out of line handler if overflow, decrement to zero or increment from
> zero are detected.
> 
> One complication that we have to deal with on arm64 is the fact that
> it has two atomics implementations: the original LL/SC implementation
> using load/store exclusive loops, and the newer LSE one that does mostly
> the same in a single instruction. So we need to clone some parts of
> both for the refcount handlers, but we also need to deal with the way
> LSE builds fall back to LL/SC at runtime if the hardware does not
> support it.
> 
> As is the case with the x86 version, the performance gain is substantial
> (ThunderX2 @ 2.2 GHz, using LSE), even though the arm64 implementation
> incorporates an add-from-zero check as well:
> 
> perf stat -B -- echo ATOMIC_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> 
>       116252672661      cycles                    #    2.207 GHz
> 
>       52.689793525 seconds time elapsed
> 
> perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> 
>       127060259162      cycles                    #    2.207 GHz
> 
>       57.243690077 seconds time elapsed
> 
> For comparison, the numbers below were captured using CONFIG_REFCOUNT_FULL,
> which uses the validation routines implemented in C using cmpxchg():
> 
> perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> 
>  Performance counter stats for 'cat /dev/fd/63':
> 
>       191057942484      cycles                    #    2.207 GHz
> 
>       86.568269904 seconds time elapsed
> 
> As a bonus, this code has been found to perform significantly better on
> systems with many CPUs, due to the fact that it no longer relies on the
> load/compare-and-swap combo performed in a tight loop, which is what we
> emit for cmpxchg() on arm64.
> 
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>,
> Cc: Jan Glauber <jglauber@marvell.com>,
> Cc: Linus Torvalds <torvalds@linux-foundation.org>,
> Cc: Hanjun Guo <guohanjun@huawei.com>
> 
> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

Reviewed-by: Kees Cook <keescook@chromium.org>

Yay! (And I like the Kconfig solution too.)

-Kees

> ---
> v5: - rebase onto mainline
>     - add ACQUIRE ordering semantics to [dec|sub]_and_test() on success
>       (as per 47b8f3ab9c49)
>     - introduce ARCH_HAS_REFCOUNT_FULL implying ARCH_HAS_REFCOUNT, but also
>       hiding the REFCOUNT_FULL option from Kconfig entirely
>     - update commit log with TX2 results
> 
>  arch/Kconfig                          | 12 ++-
>  arch/arm64/Kconfig                    |  2 +-
>  arch/arm64/include/asm/atomic.h       | 24 ++++++
>  arch/arm64/include/asm/atomic_ll_sc.h | 50 ++++++++++++
>  arch/arm64/include/asm/atomic_lse.h   | 81 ++++++++++++++++++++
>  arch/arm64/include/asm/brk-imm.h      |  1 +
>  arch/arm64/include/asm/refcount.h     | 60 +++++++++++++++
>  arch/arm64/kernel/traps.c             | 37 +++++++++
>  arch/arm64/lib/atomic_ll_sc.c         | 12 +++
>  9 files changed, 277 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/Kconfig b/arch/Kconfig
> index c47b328eada0..a19abb00fc4b 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -876,6 +876,16 @@ config STRICT_MODULE_RWX
>  config ARCH_HAS_PHYS_TO_DMA
>  	bool
>  
> +config ARCH_HAS_REFCOUNT_FULL
> +	bool
> +	select ARCH_HAS_REFCOUNT
> +	help
> +	  An architecture selects this when the optimized refcount_t
> +	  implementation it provides covers all the cases that
> +	  CONFIG_REFCOUNT_FULL covers as well, in which case it makes no
> +	  sense to even offer CONFIG_REFCOUNT_FULL as a user selectable
> +	  option.
> +
>  config ARCH_HAS_REFCOUNT
>  	bool
>  	help
> @@ -889,7 +899,7 @@ config ARCH_HAS_REFCOUNT
>  	  against bugs in reference counts.
>  
>  config REFCOUNT_FULL
> -	bool "Perform full reference count validation at the expense of speed"
> +	bool "Perform full reference count validation at the expense of speed" if !ARCH_HAS_REFCOUNT_FULL
>  	help
>  	  Enabling this switches the refcounting infrastructure from a fast
>  	  unchecked atomic_t implementation to a fully state checked
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 697ea0510729..fa0d02e111e2 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -26,6 +26,7 @@ config ARM64
>  	select ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	select ARCH_HAS_PTE_SPECIAL
>  	select ARCH_HAS_SETUP_DMA_OPS
> +	select ARCH_HAS_REFCOUNT_FULL
>  	select ARCH_HAS_SET_MEMORY
>  	select ARCH_HAS_STRICT_KERNEL_RWX
>  	select ARCH_HAS_STRICT_MODULE_RWX
> @@ -173,7 +174,6 @@ config ARM64
>  	select PCI_SYSCALL if PCI
>  	select POWER_RESET
>  	select POWER_SUPPLY
> -	select REFCOUNT_FULL
>  	select SPARSE_IRQ
>  	select SWIOTLB
>  	select SYSCTL_EXCEPTION_TRACE
> diff --git a/arch/arm64/include/asm/atomic.h b/arch/arm64/include/asm/atomic.h
> index 1f4e9ee641c9..c2f7058395a2 100644
> --- a/arch/arm64/include/asm/atomic.h
> +++ b/arch/arm64/include/asm/atomic.h
> @@ -21,13 +21,37 @@
>  #define __ASM_ATOMIC_H
>  
>  #include <linux/compiler.h>
> +#include <linux/stringify.h>
>  #include <linux/types.h>
>  
>  #include <asm/barrier.h>
> +#include <asm/brk-imm.h>
>  #include <asm/lse.h>
>  
>  #ifdef __KERNEL__
>  
> +/*
> + * To avoid having to allocate registers that pass the counter address and
> + * address of the call site to the overflow handler, encode the register and
> + * call site offset in a dummy cbz instruction that we can decode later.
> + */
> +#define REFCOUNT_CHECK_TAIL						\
> +"	.subsection	1\n"						\
> +"33:	brk "		__stringify(REFCOUNT_BRK_IMM) "\n"		\
> +"	cbz		%[counter], 22b\n"	/* never reached */	\
> +"	.previous\n"
> +
> +#define REFCOUNT_POST_CHECK_NEG						\
> +"22:	b.mi		33f\n"						\
> +	REFCOUNT_CHECK_TAIL
> +
> +#define REFCOUNT_POST_CHECK_NEG_OR_ZERO					\
> +"	b.eq		33f\n"						\
> +	REFCOUNT_POST_CHECK_NEG
> +
> +#define REFCOUNT_PRE_CHECK_ZERO(reg)	"ccmp " #reg ", wzr, #8, pl\n"
> +#define REFCOUNT_PRE_CHECK_NONE(reg)
> +
>  #define __ARM64_IN_ATOMIC_IMPL
>  
>  #if defined(CONFIG_ARM64_LSE_ATOMICS) && defined(CONFIG_AS_LSE)
> diff --git a/arch/arm64/include/asm/atomic_ll_sc.h b/arch/arm64/include/asm/atomic_ll_sc.h
> index e321293e0c89..4598575a73a2 100644
> --- a/arch/arm64/include/asm/atomic_ll_sc.h
> +++ b/arch/arm64/include/asm/atomic_ll_sc.h
> @@ -336,4 +336,54 @@ __CMPXCHG_DBL(_mb, dmb ish, l, "memory")
>  
>  #undef __CMPXCHG_DBL
>  
> +#define REFCOUNT_OP(op, asm_op, pre, post, l)				\
> +__LL_SC_INLINE int							\
> +__LL_SC_PREFIX(__refcount_##op(int i, atomic_t *r))			\
> +{									\
> +	unsigned int tmp;						\
> +	int result;							\
> +									\
> +	asm volatile("// refcount_" #op "\n"				\
> +"	prfm		pstl1strm, %[cval]\n"				\
> +"1:	ldxr		%w1, %[cval]\n"					\
> +"	" #asm_op "	%w[val], %w1, %w[i]\n"				\
> +	REFCOUNT_PRE_CHECK_ ## pre (%w1)				\
> +"	st" #l "xr	%w1, %w[val], %[cval]\n"			\
> +"	cbnz		%w1, 1b\n"					\
> +	REFCOUNT_POST_CHECK_ ## post					\
> +	: [val] "=&r"(result), "=&r"(tmp), [cval] "+Q"(r->counter)	\
> +	: [counter] "r"(&r->counter), [i] "Ir" (i)			\
> +	: "cc");							\
> +									\
> +	return result;							\
> +}									\
> +__LL_SC_EXPORT(__refcount_##op);
> +
> +REFCOUNT_OP(add_lt, adds, ZERO, NEG_OR_ZERO,  );
> +REFCOUNT_OP(sub_lt, subs, NONE, NEG,         l);
> +REFCOUNT_OP(sub_le, subs, NONE, NEG_OR_ZERO, l);
> +
> +__LL_SC_INLINE int
> +__LL_SC_PREFIX(__refcount_add_not_zero(int i, atomic_t *r))
> +{
> +	unsigned int tmp;
> +	int result;
> +
> +	asm volatile("// refcount_add_not_zero\n"
> +"	prfm	pstl1strm, %[cval]\n"
> +"1:	ldxr	%w[val], %[cval]\n"
> +"	cbz	%w[val], 2f\n"
> +"	adds	%w[val], %w[val], %w[i]\n"
> +"	stxr	%w1, %w[val], %[cval]\n"
> +"	cbnz	%w1, 1b\n"
> +	REFCOUNT_POST_CHECK_NEG
> +"2:"
> +	: [val] "=&r" (result), "=&r" (tmp), [cval] "+Q" (r->counter)
> +	: [counter] "r"(&r->counter), [i] "Ir" (i)
> +	: "cc");
> +
> +	return result;
> +}
> +__LL_SC_EXPORT(__refcount_add_not_zero);
> +
>  #endif	/* __ASM_ATOMIC_LL_SC_H */
> diff --git a/arch/arm64/include/asm/atomic_lse.h b/arch/arm64/include/asm/atomic_lse.h
> index 9256a3921e4b..99637c0b0a92 100644
> --- a/arch/arm64/include/asm/atomic_lse.h
> +++ b/arch/arm64/include/asm/atomic_lse.h
> @@ -531,4 +531,85 @@ __CMPXCHG_DBL(_mb, al, "memory")
>  #undef __LL_SC_CMPXCHG_DBL
>  #undef __CMPXCHG_DBL
>  
> +#define REFCOUNT_ADD_OP(op, pre, post)					\
> +static inline int __refcount_##op(int i, atomic_t *r)			\
> +{									\
> +	register int w0 asm ("w0") = i;					\
> +	register atomic_t *x1 asm ("x1") = r;				\
> +									\
> +	asm volatile(ARM64_LSE_ATOMIC_INSN(				\
> +	/* LL/SC */							\
> +	__LL_SC_CALL(__refcount_##op)					\
> +	"	cmp	%w0, wzr\n"					\
> +	__nops(1),							\
> +	/* LSE atomics */						\
> +	"	ldadd		%w[i], w30, %[cval]\n"			\
> +	"	adds		%w[i], %w[i], w30\n"			\
> +	REFCOUNT_PRE_CHECK_ ## pre (w30))				\
> +	REFCOUNT_POST_CHECK_ ## post					\
> +	: [i] "+r" (w0), [cval] "+Q" (r->counter)			\
> +	: [counter] "r"(&r->counter), "r" (x1)				\
> +	: __LL_SC_CLOBBERS, "cc");					\
> +									\
> +	return w0;							\
> +}
> +
> +REFCOUNT_ADD_OP(add_lt, ZERO, NEG_OR_ZERO);
> +
> +#define REFCOUNT_SUB_OP(op, post)					\
> +static inline int __refcount_##op(int i, atomic_t *r)			\
> +{									\
> +	register int w0 asm ("w0") = i;					\
> +	register atomic_t *x1 asm ("x1") = r;				\
> +									\
> +	asm volatile(ARM64_LSE_ATOMIC_INSN(				\
> +	/* LL/SC */							\
> +	__LL_SC_CALL(__refcount_##op)					\
> +	"	cmp	%w0, wzr\n"					\
> +	__nops(1),							\
> +	/* LSE atomics */						\
> +	"	neg	%w[i], %w[i]\n"					\
> +	"	ldaddl	%w[i], w30, %[cval]\n"				\
> +	"	adds	%w[i], %w[i], w30")				\
> +	REFCOUNT_POST_CHECK_ ## post					\
> +	: [i] "+r" (w0), [cval] "+Q" (r->counter)			\
> +	: [counter] "r" (&r->counter), "r" (x1)				\
> +	: __LL_SC_CLOBBERS, "cc");					\
> +									\
> +	return w0;							\
> +}
> +
> +REFCOUNT_SUB_OP(sub_lt, NEG);
> +REFCOUNT_SUB_OP(sub_le, NEG_OR_ZERO);
> +
> +static inline int __refcount_add_not_zero(int i, atomic_t *r)
> +{
> +	register int result asm ("w0");
> +	register atomic_t *x1 asm ("x1") = r;
> +
> +	asm volatile(ARM64_LSE_ATOMIC_INSN(
> +	/* LL/SC */
> +	"	mov	%w0, %w[i]\n"
> +	__LL_SC_CALL(__refcount_add_not_zero)
> +	"	cmp	%w0, wzr\n"
> +	__nops(6),
> +	/* LSE atomics */
> +	"	ldr	%w0, %[cval]\n"
> +	"1:	cmp	%w0, wzr\n"
> +	"	b.eq	2f\n"
> +	"	add	w30, %w0, %w[i]\n"
> +	"	cas	%w0, w30, %[cval]\n"
> +	"	sub	w30, w30, %w[i]\n"
> +	"	cmp	%w0, w30\n"
> +	"	b.ne	1b\n"
> +	"	adds	%w0, w30, %w[i]\n"
> +	"2:\n")
> +	REFCOUNT_POST_CHECK_NEG
> +	: "=&r" (result), [cval] "+Q" (r->counter)
> +	: [counter] "r" (&r->counter), [i] "Ir" (i), "r" (x1)
> +	: __LL_SC_CLOBBERS, "cc");
> +
> +	return result;
> +}
> +
>  #endif	/* __ASM_ATOMIC_LSE_H */
> diff --git a/arch/arm64/include/asm/brk-imm.h b/arch/arm64/include/asm/brk-imm.h
> index d84294064e6a..943f11ebe9af 100644
> --- a/arch/arm64/include/asm/brk-imm.h
> +++ b/arch/arm64/include/asm/brk-imm.h
> @@ -23,6 +23,7 @@
>  #define KPROBES_BRK_IMM			0x004
>  #define UPROBES_BRK_IMM			0x005
>  #define FAULT_BRK_IMM			0x100
> +#define REFCOUNT_BRK_IMM		0x101
>  #define KGDB_DYN_DBG_BRK_IMM		0x400
>  #define KGDB_COMPILED_DBG_BRK_IMM	0x401
>  #define BUG_BRK_IMM			0x800
> diff --git a/arch/arm64/include/asm/refcount.h b/arch/arm64/include/asm/refcount.h
> new file mode 100644
> index 000000000000..3c99b29f4549
> --- /dev/null
> +++ b/arch/arm64/include/asm/refcount.h
> @@ -0,0 +1,60 @@
> +/*
> + * arm64-specific implementation of refcount_t. Based on x86 version and
> + * PAX_REFCOUNT from PaX/grsecurity.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef __ASM_REFCOUNT_H
> +#define __ASM_REFCOUNT_H
> +
> +#include <linux/refcount.h>
> +
> +#include <asm/atomic.h>
> +
> +static __always_inline void refcount_add(int i, refcount_t *r)
> +{
> +	__refcount_add_lt(i, &r->refs);
> +}
> +
> +static __always_inline void refcount_inc(refcount_t *r)
> +{
> +	__refcount_add_lt(1, &r->refs);
> +}
> +
> +static __always_inline void refcount_dec(refcount_t *r)
> +{
> +	__refcount_sub_le(1, &r->refs);
> +}
> +
> +static __always_inline __must_check bool refcount_sub_and_test(unsigned int i,
> +							       refcount_t *r)
> +{
> +	bool ret = __refcount_sub_lt(i, &r->refs) == 0;
> +
> +	if (ret) {
> +		smp_acquire__after_ctrl_dep();
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
> +{
> +	return refcount_sub_and_test(1, r);
> +}
> +
> +static __always_inline __must_check bool refcount_add_not_zero(unsigned int i,
> +							       refcount_t *r)
> +{
> +	return __refcount_add_not_zero(i, &r->refs) != 0;
> +}
> +
> +static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
> +{
> +	return __refcount_add_not_zero(1, &r->refs) != 0;
> +}
> +
> +#endif
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 177c0f6ebabf..5003cfb48d9b 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -1037,6 +1037,42 @@ int __init early_brk64(unsigned long addr, unsigned int esr,
>  	return bug_handler(regs, esr) != DBG_HOOK_HANDLED;
>  }
>  
> +static int refcount_overflow_handler(struct pt_regs *regs, unsigned int esr)
> +{
> +	u32 dummy_cbz = le32_to_cpup((__le32 *)(regs->pc + 4));
> +	bool zero = regs->pstate & PSR_Z_BIT;
> +	u32 rt;
> +
> +	/*
> +	 * Find the register that holds the counter address from the
> +	 * dummy 'cbz' instruction that follows the 'brk' instruction
> +	 * that sent us here.
> +	 */
> +	rt = aarch64_insn_decode_register(AARCH64_INSN_REGTYPE_RT, dummy_cbz);
> +
> +	/* First unconditionally saturate the refcount. */
> +	*(int *)regs->regs[rt] = INT_MIN / 2;
> +
> +	/*
> +	 * This function has been called because either a negative refcount
> +	 * value was seen by any of the refcount functions, or a zero
> +	 * refcount value was seen by refcount_{add,dec}().
> +	 */
> +
> +	/* point pc to the branch instruction that detected the overflow */
> +	regs->pc += 4 + aarch64_get_branch_offset(dummy_cbz);
> +	refcount_error_report(regs, zero ? "hit zero" : "overflow");
> +
> +	/* advance pc and proceed */
> +	regs->pc += 4;
> +	return DBG_HOOK_HANDLED;
> +}
> +
> +static struct break_hook refcount_break_hook = {
> +	.fn	= refcount_overflow_handler,
> +	.imm	= REFCOUNT_BRK_IMM,
> +};
> +
>  /* This registration must happen early, before debug_traps_init(). */
>  void __init trap_init(void)
>  {
> @@ -1044,4 +1080,5 @@ void __init trap_init(void)
>  #ifdef CONFIG_KASAN_SW_TAGS
>  	register_kernel_break_hook(&kasan_break_hook);
>  #endif
> +	register_kernel_break_hook(&refcount_break_hook);
>  }
> diff --git a/arch/arm64/lib/atomic_ll_sc.c b/arch/arm64/lib/atomic_ll_sc.c
> index b0c538b0da28..8a335cd9f0e2 100644
> --- a/arch/arm64/lib/atomic_ll_sc.c
> +++ b/arch/arm64/lib/atomic_ll_sc.c
> @@ -1,3 +1,15 @@
>  #include <asm/atomic.h>
>  #define __ARM64_IN_ATOMIC_IMPL
> +
> +/*
> + * Disarm the refcount checks in the out-of-line LL/SC routines. These are
> + * redundant, given that the LSE callers already perform the same checks.
> + * We do have to make sure that we exit with a zero value if the pre-check
> + * detected a zero value.
> + */
> +#undef REFCOUNT_POST_CHECK_NEG
> +#undef REFCOUNT_POST_CHECK_NEG_OR_ZERO
> +#define REFCOUNT_POST_CHECK_NEG
> +#define REFCOUNT_POST_CHECK_NEG_OR_ZERO   "csel %w[val], wzr, %w[val], eq\n"
> +
>  #include <asm/atomic_ll_sc.h>
> -- 
> 2.17.1
>

Hanjun Guo June 24, 2019, 6:37 a.m. UTC | #4

On 2019/6/19 18:54, Ard Biesheuvel wrote:
> This adds support to arm64 for fast refcount checking, as contributed
> by Kees for x86 based on the implementation by grsecurity/PaX.
> 
> The general approach is identical: the existing atomic_t helpers are
> cloned for refcount_t, with the arithmetic instruction modified to set
> the PSTATE flags, and one or two branch instructions added that jump to
> an out of line handler if overflow, decrement to zero or increment from
> zero are detected.
> 
> One complication that we have to deal with on arm64 is the fact that
> it has two atomics implementations: the original LL/SC implementation
> using load/store exclusive loops, and the newer LSE one that does mostly
> the same in a single instruction. So we need to clone some parts of
> both for the refcount handlers, but we also need to deal with the way
> LSE builds fall back to LL/SC at runtime if the hardware does not
> support it.
> 
> As is the case with the x86 version, the performance gain is substantial
> (ThunderX2 @ 2.2 GHz, using LSE), even though the arm64 implementation
> incorporates an add-from-zero check as well:
> 
> perf stat -B -- echo ATOMIC_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> 
>       116252672661      cycles                    #    2.207 GHz
> 
>       52.689793525 seconds time elapsed
> 
> perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> 
>       127060259162      cycles                    #    2.207 GHz
> 
>       57.243690077 seconds time elapsed
> 
> For comparison, the numbers below were captured using CONFIG_REFCOUNT_FULL,
> which uses the validation routines implemented in C using cmpxchg():
> 
> perf stat -B -- echo REFCOUNT_TIMING >/sys/kernel/debug/provoke-crash/DIRECT
> 
>  Performance counter stats for 'cat /dev/fd/63':
> 
>       191057942484      cycles                    #    2.207 GHz
> 
>       86.568269904 seconds time elapsed
> 
> As a bonus, this code has been found to perform significantly better on
> systems with many CPUs, due to the fact that it no longer relies on the
> load/compare-and-swap combo performed in a tight loop, which is what we
> emit for cmpxchg() on arm64.

It helps my 96 cores ARM64 server as well,

Tested-by: Hanjun Guo <guohanjun@huawei.com>

Thanks
Hanjun

Will Deacon July 3, 2019, 1:40 p.m. UTC | #5

Hi Ard,

On Wed, Jun 19, 2019 at 12:54:31PM +0200, Ard Biesheuvel wrote:
> This adds support to arm64 for fast refcount checking, as contributed
> by Kees for x86 based on the implementation by grsecurity/PaX.
> 
> The general approach is identical: the existing atomic_t helpers are
> cloned for refcount_t, with the arithmetic instruction modified to set
> the PSTATE flags, and one or two branch instructions added that jump to
> an out of line handler if overflow, decrement to zero or increment from
> zero are detected.
> 
> One complication that we have to deal with on arm64 is the fact that
> it has two atomics implementations: the original LL/SC implementation
> using load/store exclusive loops, and the newer LSE one that does mostly
> the same in a single instruction. So we need to clone some parts of
> both for the refcount handlers, but we also need to deal with the way
> LSE builds fall back to LL/SC at runtime if the hardware does not
> support it.

I've been looking at this today and I still don't understand why this can't
be written as simple wrappers around atomic_t. What am I missing? To be more
concrete, why can't we implement the refcount functions along the lines of
the code below?

We can't call refcount_error_report() like this, but perhaps a
WARN_ON_RATELIMIT would be enough. However, I'm sure there's a reason
for the complexity in your proposal -- I just can't spot it.

Will

--->8

static __always_inline void refcount_add(int i, refcount_t *r)
{
	int old = atomic_fetch_add_relaxed(i, &r->refs);

	if (unlikely(old <= 0 || old + i <= 0))
		refcount_set(r, INT_MIN / 2);
}

static __always_inline void refcount_inc(refcount_t *r)
{
	refcount_add(1, r);
}

static __always_inline void refcount_dec(refcount_t *r)
{
	int old = atomic_fetch_sub_release(1, &r->refs);

	if (unlikely(old <= 1))
		refcount_set(r, INT_MIN / 2);
}

static __always_inline __must_check bool refcount_sub_and_test(int i,
							       refcount_t *r)
{
	int old = atomic_fetch_sub(i, &r->refs);

	if (old - i < 0)
		refcount_set(r, INT_MIN / 2);

	return old == i;
}

static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
{
	return refcount_sub_and_test(1, r);
}

static __always_inline __must_check bool refcount_add_not_zero(int i,
							       refcount_t *r)
{
	int old = refcount_read(r);

	do {
		if (!old)
			break;
	} while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));

	if (old < 0 || old + i < 0)
		refcount_set(r, INT_MIN / 2);

	return old;
}

static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
{
	return refcount_add_not_zero(1, r);
}

Ard Biesheuvel July 3, 2019, 6:12 p.m. UTC | #6

On Wed, 3 Jul 2019 at 15:40, Will Deacon <will@kernel.org> wrote:
>
> Hi Ard,
>
> On Wed, Jun 19, 2019 at 12:54:31PM +0200, Ard Biesheuvel wrote:
> > This adds support to arm64 for fast refcount checking, as contributed
> > by Kees for x86 based on the implementation by grsecurity/PaX.
> >
> > The general approach is identical: the existing atomic_t helpers are
> > cloned for refcount_t, with the arithmetic instruction modified to set
> > the PSTATE flags, and one or two branch instructions added that jump to
> > an out of line handler if overflow, decrement to zero or increment from
> > zero are detected.
> >
> > One complication that we have to deal with on arm64 is the fact that
> > it has two atomics implementations: the original LL/SC implementation
> > using load/store exclusive loops, and the newer LSE one that does mostly
> > the same in a single instruction. So we need to clone some parts of
> > both for the refcount handlers, but we also need to deal with the way
> > LSE builds fall back to LL/SC at runtime if the hardware does not
> > support it.
>
> I've been looking at this today and I still don't understand why this can't
> be written as simple wrappers around atomic_t. What am I missing? To be more
> concrete, why can't we implement the refcount functions along the lines of
> the code below?
>

There was a lot of pushback against the use of refcount_t in the
beginning, given that the checked flavor was slower than unchecked
atomic_t, and IIRC, it was mainly the networking folks that opposed
it. So the whole idea is that the code performs as closely to atomic_t
as possible, which is why the code is simply the atomic_t asm
implementations, but with a -s suffix added to the arithmetic
instructions so they set PSTATE, and one or two conditional branch
instructions added.

Your approach is going to require one or two additional compare
instructions, increasing the instruction count. This may not matter on
fast OoO cores, but it probably will affect somebody's benchmark
somewhere.

However, I'd be in favour of switching to your code, since it is much
simpler and more maintainable, so if you spin it as a proper patch, we
can do some comparative analysis of the performance.

> We can't call refcount_error_report() like this, but perhaps a
> WARN_ON_RATELIMIT would be enough. However, I'm sure there's a reason
> for the complexity in your proposal -- I just can't spot it.
>


> --->8
>
> static __always_inline void refcount_add(int i, refcount_t *r)
> {
>         int old = atomic_fetch_add_relaxed(i, &r->refs);
>
>         if (unlikely(old <= 0 || old + i <= 0))
>                 refcount_set(r, INT_MIN / 2);
> }
>
> static __always_inline void refcount_inc(refcount_t *r)
> {
>         refcount_add(1, r);
> }
>
> static __always_inline void refcount_dec(refcount_t *r)
> {
>         int old = atomic_fetch_sub_release(1, &r->refs);
>
>         if (unlikely(old <= 1))
>                 refcount_set(r, INT_MIN / 2);
> }
>
> static __always_inline __must_check bool refcount_sub_and_test(int i,
>                                                                refcount_t *r)
> {
>         int old = atomic_fetch_sub(i, &r->refs);
>
>         if (old - i < 0)
>                 refcount_set(r, INT_MIN / 2);
>
>         return old == i;
> }
>
> static __always_inline __must_check bool refcount_dec_and_test(refcount_t *r)
> {
>         return refcount_sub_and_test(1, r);
> }
>
> static __always_inline __must_check bool refcount_add_not_zero(int i,
>                                                                refcount_t *r)
> {
>         int old = refcount_read(r);
>
>         do {
>                 if (!old)
>                         break;
>         } while (!atomic_try_cmpxchg_relaxed(&r->refs, &old, old + i));
>
>         if (old < 0 || old + i < 0)
>                 refcount_set(r, INT_MIN / 2);
>
>         return old;
> }
>
> static __always_inline __must_check bool refcount_inc_not_zero(refcount_t *r)
> {
>         return refcount_add_not_zero(1, r);
> }

Will Deacon July 10, 2019, 12:21 p.m. UTC | #7

On Wed, Jul 03, 2019 at 08:12:12PM +0200, Ard Biesheuvel wrote:
> There was a lot of pushback against the use of refcount_t in the
> beginning, given that the checked flavor was slower than unchecked
> atomic_t, and IIRC, it was mainly the networking folks that opposed
> it. So the whole idea is that the code performs as closely to atomic_t
> as possible, which is why the code is simply the atomic_t asm
> implementations, but with a -s suffix added to the arithmetic
> instructions so they set PSTATE, and one or two conditional branch
> instructions added.
> 
> Your approach is going to require one or two additional compare
> instructions, increasing the instruction count. This may not matter on
> fast OoO cores, but it probably will affect somebody's benchmark
> somewhere.
> 
> However, I'd be in favour of switching to your code, since it is much
> simpler and more maintainable, so if you spin it as a proper patch, we
> can do some comparative analysis of the performance.

I'll post the patches after the merge window, but I've pushed them here in
the meantime in case anybody gets a chance to take them for a spin:

https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=refcount/full

Will

Jan Glauber July 15, 2019, 12:44 p.m. UTC | #8

On Wed, Jul 10, 2019 at 01:21:18PM +0100, Will Deacon wrote:
> I'll post the patches after the merge window, but I've pushed them here in
> the meantime in case anybody gets a chance to take them for a spin:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=refcount/full

Performance of this series is on par to Ard's v5 on TX2.

--Jan

Hanjun Guo July 17, 2019, 12:53 p.m. UTC | #9

On 2019/7/10 20:21, Will Deacon wrote:
> On Wed, Jul 03, 2019 at 08:12:12PM +0200, Ard Biesheuvel wrote:
>> There was a lot of pushback against the use of refcount_t in the
>> beginning, given that the checked flavor was slower than unchecked
>> atomic_t, and IIRC, it was mainly the networking folks that opposed
>> it. So the whole idea is that the code performs as closely to atomic_t
>> as possible, which is why the code is simply the atomic_t asm
>> implementations, but with a -s suffix added to the arithmetic
>> instructions so they set PSTATE, and one or two conditional branch
>> instructions added.
>>
>> Your approach is going to require one or two additional compare
>> instructions, increasing the instruction count. This may not matter on
>> fast OoO cores, but it probably will affect somebody's benchmark
>> somewhere.
>>
>> However, I'd be in favour of switching to your code, since it is much
>> simpler and more maintainable, so if you spin it as a proper patch, we
>> can do some comparative analysis of the performance.
> 
> I'll post the patches after the merge window, but I've pushed them here in
> the meantime in case anybody gets a chance to take them for a spin:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=refcount/full

Hmm, tested with Jan's test case on a 96 cores ARM server, I can
see 5% better performance than Ard's patch for 16, 32, 48 cores,
and even 9% better for 72 cores and 96 cores (48 cores per socket,
24 cores per CPU die, two CPU dies in the same socket, and the latency
from socket to socket is higher than the latency between CPU dies in
the same socket). I tested this for two rounds and the test result is
stable. Not sure why it's better than Ard's patch, maybe it's related
our fast OoO cores :)

Anyway,

Tested-by: Hanjun Guo <guohanjun@huawei.com>

Thanks
Hanjun

Hanjun Guo July 17, 2019, 1:23 p.m. UTC | #10

[resend for that the previous email seems didn't hit linux-arm-kernel mailing list]

On 2019/7/10 20:21, Will Deacon wrote:
> On Wed, Jul 03, 2019 at 08:12:12PM +0200, Ard Biesheuvel wrote:
>> There was a lot of pushback against the use of refcount_t in the
>> beginning, given that the checked flavor was slower than unchecked
>> atomic_t, and IIRC, it was mainly the networking folks that opposed
>> it. So the whole idea is that the code performs as closely to atomic_t
>> as possible, which is why the code is simply the atomic_t asm
>> implementations, but with a -s suffix added to the arithmetic
>> instructions so they set PSTATE, and one or two conditional branch
>> instructions added.
>>
>> Your approach is going to require one or two additional compare
>> instructions, increasing the instruction count. This may not matter on
>> fast OoO cores, but it probably will affect somebody's benchmark
>> somewhere.
>>
>> However, I'd be in favour of switching to your code, since it is much
>> simpler and more maintainable, so if you spin it as a proper patch, we
>> can do some comparative analysis of the performance.
> 
> I'll post the patches after the merge window, but I've pushed them here in
> the meantime in case anybody gets a chance to take them for a spin:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=refcount/full

Hmm, tested with Jan's test case on a 96 cores ARM server, I can
see 5% better performance than Ard's patch for 16, 32, 48 cores,
and even 9% better for 72 cores and 96 cores (48 cores per socket,
24 cores per CPU die, two CPU dies in the same socket, and the latency
from socket to socket is higher than the latency between CPU dies in
the same socket). I tested this for two rounds and the test result is
stable. Not sure why it's better than Ard's patch, maybe it's related
our fast OoO cores :)

Anyway,

Tested-by: Hanjun Guo <guohanjun@huawei.com>

Thanks
Hanjun

Kees Cook July 22, 2019, 4:43 p.m. UTC | #11

On Wed, Jul 10, 2019 at 01:21:18PM +0100, Will Deacon wrote:
> On Wed, Jul 03, 2019 at 08:12:12PM +0200, Ard Biesheuvel wrote:
> > There was a lot of pushback against the use of refcount_t in the
> > beginning, given that the checked flavor was slower than unchecked
> > atomic_t, and IIRC, it was mainly the networking folks that opposed
> > it. So the whole idea is that the code performs as closely to atomic_t
> > as possible, which is why the code is simply the atomic_t asm
> > implementations, but with a -s suffix added to the arithmetic
> > instructions so they set PSTATE, and one or two conditional branch
> > instructions added.
> > 
> > Your approach is going to require one or two additional compare
> > instructions, increasing the instruction count. This may not matter on
> > fast OoO cores, but it probably will affect somebody's benchmark
> > somewhere.
> > 
> > However, I'd be in favour of switching to your code, since it is much
> > simpler and more maintainable, so if you spin it as a proper patch, we
> > can do some comparative analysis of the performance.
> 
> I'll post the patches after the merge window, but I've pushed them here in
> the meantime in case anybody gets a chance to take them for a spin:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=refcount/full

Can the last patch get split up a bit? There's the inlining move, then
there is the atomic_fetch*() changes. Putting that all in one patch
makes the series a bit harder to review.

(Also, what happened to the *_checked() variations?)

Will Deacon July 22, 2019, 5:11 p.m. UTC | #12

On Mon, Jul 22, 2019 at 09:43:54AM -0700, Kees Cook wrote:
> On Wed, Jul 10, 2019 at 01:21:18PM +0100, Will Deacon wrote:
> > On Wed, Jul 03, 2019 at 08:12:12PM +0200, Ard Biesheuvel wrote:
> > > There was a lot of pushback against the use of refcount_t in the
> > > beginning, given that the checked flavor was slower than unchecked
> > > atomic_t, and IIRC, it was mainly the networking folks that opposed
> > > it. So the whole idea is that the code performs as closely to atomic_t
> > > as possible, which is why the code is simply the atomic_t asm
> > > implementations, but with a -s suffix added to the arithmetic
> > > instructions so they set PSTATE, and one or two conditional branch
> > > instructions added.
> > > 
> > > Your approach is going to require one or two additional compare
> > > instructions, increasing the instruction count. This may not matter on
> > > fast OoO cores, but it probably will affect somebody's benchmark
> > > somewhere.
> > > 
> > > However, I'd be in favour of switching to your code, since it is much
> > > simpler and more maintainable, so if you spin it as a proper patch, we
> > > can do some comparative analysis of the performance.
> > 
> > I'll post the patches after the merge window, but I've pushed them here in
> > the meantime in case anybody gets a chance to take them for a spin:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=refcount/full
> 
> Can the last patch get split up a bit? There's the inlining move, then
> there is the atomic_fetch*() changes. Putting that all in one patch
> makes the series a bit harder to review.

Sure thing, I'll do that when I get round to posting them properly, since
the feedback so far on the performance front looks positive.

> (Also, what happened to the *_checked() variations?)

The new implementation is intended to replace the *_checked() variants,
and the discrepancy in naming doesn't make any sense to me once everything
is inline in the header file. Am I missing something?

Will

Kees Cook July 22, 2019, 5:27 p.m. UTC | #13

On Mon, Jul 22, 2019 at 06:11:42PM +0100, Will Deacon wrote:
> On Mon, Jul 22, 2019 at 09:43:54AM -0700, Kees Cook wrote:
> > (Also, what happened to the *_checked() variations?)
> 
> The new implementation is intended to replace the *_checked() variants,
> and the discrepancy in naming doesn't make any sense to me once everything
> is inline in the header file. Am I missing something?

I haven't looked at the resulting builds, but the reason for the
_checked() macro stuff was to provide a way for callers to opt into a
checked refcount_t regardless of the state of REFCOUNT_FULL (especially
for architectures without special refcount handling). If that is
retained, then all is well. It just looked odd to me in the patch.

Will Deacon July 29, 2019, 5:24 p.m. UTC | #14

On Mon, Jul 22, 2019 at 10:27:07AM -0700, Kees Cook wrote:
> On Mon, Jul 22, 2019 at 06:11:42PM +0100, Will Deacon wrote:
> > On Mon, Jul 22, 2019 at 09:43:54AM -0700, Kees Cook wrote:
> > > (Also, what happened to the *_checked() variations?)
> > 
> > The new implementation is intended to replace the *_checked() variants,
> > and the discrepancy in naming doesn't make any sense to me once everything
> > is inline in the header file. Am I missing something?
> 
> I haven't looked at the resulting builds, but the reason for the
> _checked() macro stuff was to provide a way for callers to opt into a
> checked refcount_t regardless of the state of REFCOUNT_FULL (especially
> for architectures without special refcount handling). If that is
> retained, then all is well. It just looked odd to me in the patch.

Hmm, so that has a grand total of zero users in mainline afaict. Do you
expect that to change?

Will

Kees Cook July 29, 2019, 9:38 p.m. UTC | #15

On Mon, Jul 29, 2019 at 06:24:15PM +0100, Will Deacon wrote:
> On Mon, Jul 22, 2019 at 10:27:07AM -0700, Kees Cook wrote:
> > On Mon, Jul 22, 2019 at 06:11:42PM +0100, Will Deacon wrote:
> > > On Mon, Jul 22, 2019 at 09:43:54AM -0700, Kees Cook wrote:
> > > > (Also, what happened to the *_checked() variations?)
> > > 
> > > The new implementation is intended to replace the *_checked() variants,
> > > and the discrepancy in naming doesn't make any sense to me once everything
> > > is inline in the header file. Am I missing something?
> > 
> > I haven't looked at the resulting builds, but the reason for the
> > _checked() macro stuff was to provide a way for callers to opt into a
> > checked refcount_t regardless of the state of REFCOUNT_FULL (especially
> > for architectures without special refcount handling). If that is
> > retained, then all is well. It just looked odd to me in the patch.
> 
> Hmm, so that has a grand total of zero users in mainline afaict. Do you
> expect that to change?

Hm, I thought Mark Rutland had one (or plans for now)... adding to Cc.

But yeah, if nothing is using it, away it goes! ;)

[v5] arm64: kernel: implement fast refcount checking

Commit Message

Comments

Patch