diff mbox series

arm64: implement support for static call trampolines

Message ID 20201019141247.25122-1-ardb@kernel.org (mailing list archive)
State New, archived
Headers show
Series arm64: implement support for static call trampolines | expand

Commit Message

Ard Biesheuvel Oct. 19, 2020, 2:12 p.m. UTC
Implement arm64 support for the 'unoptimized' static call variety,
which routes all calls through a single trampoline that is patched
to perform a tail call to the selected function.

Since static call targets may be located in modules loaded out of
direct branching range, we need to be able to fall back to issuing
a ADRP/ADD pair to load the branch target into R16 and use a BR
instruction. As this involves patching more than a single B or NOP
instruction (for which the architecture makes special provisions
in terms of the synchronization needed), we should take care to
only use aarch64_insn_patch_text_nosync() if the subsequent
instruction is still a 'RET' (which guarantees that the one being
patched is a B or a NOP)

Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
Note that I needed the changes after the --8<-- below to enable the
selftest, since it assumes HAVE_STATIC_CALL_INLINE

 arch/arm64/Kconfig                   |  1 +
 arch/arm64/include/asm/static_call.h | 26 ++++++++++
 arch/arm64/kernel/Makefile           |  2 +-
 arch/arm64/kernel/static_call.c      | 52 ++++++++++++++++++++
 arch/arm64/kernel/vmlinux.lds.S      |  1 +
 5 files changed, 81 insertions(+), 1 deletion(-)

Comments

Peter Zijlstra Oct. 19, 2020, 5:05 p.m. UTC | #1
On Mon, Oct 19, 2020 at 04:12:47PM +0200, Ard Biesheuvel wrote:
> Implement arm64 support for the 'unoptimized' static call variety,
> which routes all calls through a single trampoline that is patched
> to perform a tail call to the selected function.
> 
> Since static call targets may be located in modules loaded out of
> direct branching range, we need to be able to fall back to issuing
> a ADRP/ADD pair to load the branch target into R16 and use a BR
> instruction. As this involves patching more than a single B or NOP
> instruction (for which the architecture makes special provisions
> in terms of the synchronization needed), we should take care to
> only use aarch64_insn_patch_text_nosync() if the subsequent
> instruction is still a 'RET' (which guarantees that the one being
> patched is a B or a NOP)

Aside of lacking objtool support (which is being worked on), is there
anything else in the way of also doing inline patching for ARM64?

That is; if the function is not reachable by the immediate you can
always leave (re-instate) the call to the trampoline after patching
that.

Anyway, nice to see ARM64 support, thanks!
Ard Biesheuvel Oct. 19, 2020, 5:23 p.m. UTC | #2
On Mon, 19 Oct 2020 at 19:05, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Oct 19, 2020 at 04:12:47PM +0200, Ard Biesheuvel wrote:
> > Implement arm64 support for the 'unoptimized' static call variety,
> > which routes all calls through a single trampoline that is patched
> > to perform a tail call to the selected function.
> >
> > Since static call targets may be located in modules loaded out of
> > direct branching range, we need to be able to fall back to issuing
> > a ADRP/ADD pair to load the branch target into R16 and use a BR
> > instruction. As this involves patching more than a single B or NOP
> > instruction (for which the architecture makes special provisions
> > in terms of the synchronization needed), we should take care to
> > only use aarch64_insn_patch_text_nosync() if the subsequent
> > instruction is still a 'RET' (which guarantees that the one being
> > patched is a B or a NOP)
>
> Aside of lacking objtool support (which is being worked on), is there
> anything else in the way of also doing inline patching for ARM64?
>

I implemented a GCC plugin for this a while ago [0], but I guess we'd
need to see some evidence first that it will actually make a
difference.

> That is; if the function is not reachable by the immediate you can
> always leave (re-instate) the call to the trampoline after patching
> that.
>

I don't think there is anything fundamentally preventing us from doing
that: the trampolines are guaranteed to be in range for all the
callers, as modules that are far away will have their own PLT
trampoline inside the module itself, so we can always fall back to the
trampoline if we cannot patch the branch itself. However, we might be
able to do better here, i.e., at patch time, we could detect whether
the call site points into a module trampoline, and update that one,
instead of bouncing from one trampoline to the next.

But again, we'd need to see some use cases first where it makes a
difference. And actually, the same reasoning applies to this patch - I
haven't done any performance testing, nor do I have any use cases in
mind where this sits on a hot path. The reason I have been looking at
static calls is to clean up kludges such as the one we have in
lib/crc-t10dif.c for switching to an arch optimized implementation at
runtime.

> Anyway, nice to see ARM64 support, thanks!

[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls
Peter Zijlstra Oct. 20, 2020, 7:51 a.m. UTC | #3
On Mon, Oct 19, 2020 at 07:23:20PM +0200, Ard Biesheuvel wrote:
> I implemented a GCC plugin for this a while ago [0], but I guess we'd
> need to see some evidence first that it will actually make a
> difference.

  https://lkml.kernel.org/r/20201009174554.GS2611@hirez.programming.kicks-ass.net

might provide fertile benchmarking ground. Of course, you'd need to port
it to arm64 first :-)

Anyway, Frederic will reportedly go work on getting that in shape and
merged.

There's also this:

  https://lkml.kernel.org/r/20200820164753.3256899-1-jackmanb@chromium.org
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 96633b174e39..7841026694e8 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -172,6 +172,7 @@  config ARM64
 	select HAVE_PERF_REGS
 	select HAVE_PERF_USER_STACK_DUMP
 	select HAVE_REGS_AND_STACK_ACCESS_API
+	select HAVE_STATIC_CALL
 	select HAVE_FUNCTION_ARG_ACCESS_API
 	select HAVE_FUTEX_CMPXCHG if FUTEX
 	select MMU_GATHER_RCU_TABLE_FREE
diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
new file mode 100644
index 000000000000..d70e6495768a
--- /dev/null
+++ b/arch/arm64/include/asm/static_call.h
@@ -0,0 +1,26 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_STATIC_CALL_H
+#define _ASM_STATIC_CALL_H
+
+#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insns)			\
+	asm(".pushsection	.static_call.text, \"ax\"	\n"	\
+	    ".align		2				\n"	\
+	    ".globl		" STATIC_CALL_TRAMP_STR(name) "	\n"	\
+	    STATIC_CALL_TRAMP_STR(name) ":			\n"	\
+	    "hint 34;		/* BTI C */			\n"	\
+	    insns "						\n"	\
+	    ".popsection					\n")
+
+/*
+ * We have to account for the possibility that the static call site may
+ * be updated to refer to a target that is out of range for an ordinary
+ * 'B' branch instruction, and so we need to pre-allocate some space for
+ * a ADRP/ADD/BR sequence.
+ */
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func)			\
+	__ARCH_DEFINE_STATIC_CALL_TRAMP(name, "b " #func "; ret; br x16")
+
+#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name)			\
+	__ARCH_DEFINE_STATIC_CALL_TRAMP(name, "nop; ret; br x16")
+
+#endif /* _ASM_STATIC_CALL_H */
diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index bbaf0bc4ad60..f579800eb860 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -17,7 +17,7 @@  obj-y			:= debug-monitors.o entry.o irq.o fpsimd.o		\
 			   return_address.o cpuinfo.o cpu_errata.o		\
 			   cpufeature.o alternative.o cacheinfo.o		\
 			   smp.o smp_spin_table.o topology.o smccc-call.o	\
-			   syscall.o proton-pack.o
+			   syscall.o proton-pack.o static_call.o
 
 targets			+= efi-entry.o
 
diff --git a/arch/arm64/kernel/static_call.c b/arch/arm64/kernel/static_call.c
new file mode 100644
index 000000000000..e6e42b52aea4
--- /dev/null
+++ b/arch/arm64/kernel/static_call.c
@@ -0,0 +1,52 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/static_call.h>
+#include <linux/memory.h>
+#include <asm/debug-monitors.h>
+#include <asm/insn.h>
+
+/*
+ * The static call trampoline consists of one of the following sequences:
+ *
+ *        (A)                  (B)                  (C)
+ * 0x0:  BTI   C              BTI   C              BTI   C
+ * 0x4:  B     <func>         NOP                  ADRP  X16, <func>
+ * 0x8:  RET                  RET                  ADD   X16, X16, :lo12:<func>
+ * 0xc:  BR    X16            BR    X16            BR    X16
+ *
+ * The architecture permits us to patch B instructions into NOPs or vice versa
+ * directly, but patching a multi-instruction sequence requires careful
+ * synchronization.
+ */
+void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
+{
+	unsigned long pc = (unsigned long)tramp + 4;
+	unsigned long dst = (unsigned long)func;
+	void *addrs[] = { (void *)pc, (void *)(pc + 4) };
+	u32 ret = le32_to_cpup(addrs[1]);
+	u32 insn[2];
+
+	insn[0] = func ? aarch64_insn_gen_branch_imm(pc, dst,
+						     AARCH64_INSN_BRANCH_NOLINK)
+		       : AARCH64_INSN_HINT_NOP;
+	insn[1] = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_30,
+					      AARCH64_INSN_BRANCH_RETURN);
+
+	if (ret == insn[1] && insn[0] != AARCH64_BREAK_FAULT) {
+		aarch64_insn_patch_text_nosync(addrs[0], insn[0]);
+		return;
+	}
+
+	if (insn[0] == AARCH64_BREAK_FAULT) {
+		insn[0] = aarch64_insn_gen_adr(pc, dst, AARCH64_INSN_REG_16,
+					       AARCH64_INSN_ADR_TYPE_ADRP);
+		insn[1] = aarch64_insn_gen_add_sub_imm(AARCH64_INSN_REG_16,
+						       AARCH64_INSN_REG_16,
+						       dst % SZ_4K,
+						       AARCH64_INSN_VARIANT_64BIT,
+						       AARCH64_INSN_ADSB_ADD);
+
+		BUG_ON(insn[0] == AARCH64_BREAK_FAULT);
+	}
+	aarch64_insn_patch_text(addrs, insn, ARRAY_SIZE(insn));
+}
+EXPORT_SYMBOL_GPL(arch_static_call_transform);
diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S
index 5ca957e656ab..77f9f23934db 100644
--- a/arch/arm64/kernel/vmlinux.lds.S
+++ b/arch/arm64/kernel/vmlinux.lds.S
@@ -123,6 +123,7 @@  SECTIONS
 			IDMAP_TEXT
 			HIBERNATE_TEXT
 			TRAMP_TEXT
+			STATIC_CALL_TEXT
 			*(.fixup)
 			*(.gnu.warning)
 		. = ALIGN(16);
-- 
2.17.1


---------------8<---------------------
diff --git a/kernel/Makefile b/kernel/Makefile
index b74820d8b264..3d7fec961b40 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -111,7 +111,7 @@  obj-$(CONFIG_CPU_PM) += cpu_pm.o
 obj-$(CONFIG_BPF) += bpf/
 obj-$(CONFIG_KCSAN) += kcsan/
 obj-$(CONFIG_SHADOW_CALL_STACK) += scs.o
-obj-$(CONFIG_HAVE_STATIC_CALL_INLINE) += static_call.o
+obj-$(CONFIG_HAVE_STATIC_CALL) += static_call.o
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/static_call.c b/kernel/static_call.c
index 84565c2a41b8..efee9f72dc06 100644
--- a/kernel/static_call.c
+++ b/kernel/static_call.c
@@ -10,6 +10,8 @@ 
 #include <linux/processor.h>
 #include <asm/sections.h>
 
+#ifdef CONFIG_STATIC_CALL_INLINE
+
 extern struct static_call_site __start_static_call_sites[],
                               __stop_static_call_sites[];
 
@@ -437,6 +439,7 @@  int __init static_call_init(void)
        return 0;
 }
 early_initcall(static_call_init);
+#endif
 
 #ifdef CONFIG_STATIC_CALL_SELFTEST