[v2] arm64: implement support for static call trampolines

Message ID	20201028184114.6834-1-ardb@kernel.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=AM15=ED=lists.infradead.org=linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4827C247CC From: Ard Biesheuvel <ardb@kernel.org> To: linux-arm-kernel@lists.infradead.org Subject: [PATCH v2] arm64: implement support for static call trampolines Date: Wed, 28 Oct 2020 19:41:14 +0100 Message-Id: <20201028184114.6834-1-ardb@kernel.org> Precedence: list Cc: mark.rutland@arm.com, Peter Zijlstra <peterz@infradead.org>, catalin.marinas@arm.com, james.morse@arm.com, will@kernel.org, Ard Biesheuvel <ardb@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	[v2] arm64: implement support for static call trampolines \| expand [v2] arm64: implement support for static call trampolines

Ard Biesheuvel Oct. 28, 2020, 6:41 p.m. UTC

Implement arm64 support for the 'unoptimized' static call variety,
which routes all calls through a single trampoline that is patched
to perform a tail call to the selected function.

Since static call targets may be located in modules loaded out of
direct branching range, we need to be able to fall back to issuing
a ADRP/ADD pair to load the branch target into R16 and use a BR
instruction. As this involves patching more than a single B or NOP
instruction (for which the architecture makes special provisions
in terms of the synchronization needed), we may need to run the
full blown instruction patching logic that uses stop_machine(). It
also means that once we've patched in a ADRP/ADD pair once, we are
quite restricted in the patching we can code subsequently, and we
may end up using an indirect call after all (note that 

Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
---
v2:
This turned nasty really quickly when I realized that any sleeping task
could have been interrupted right in the middle of the ADRP/ADD pair
that we emit for static call targets that are out of immediate branching
range.

 arch/arm64/Kconfig                   |   1 +
 arch/arm64/include/asm/static_call.h |  32 +++++
 arch/arm64/kernel/Makefile           |   2 +-
 arch/arm64/kernel/static_call.c      | 129 ++++++++++++++++++++
 arch/arm64/kernel/vmlinux.lds.S      |   1 +
 5 files changed, 164 insertions(+), 1 deletion(-)

Peter Zijlstra Oct. 29, 2020, 10:28 a.m. UTC | #1

On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> This turned nasty really quickly when I realized that any sleeping task
> could have been interrupted right in the middle of the ADRP/ADD pair
> that we emit for static call targets that are out of immediate branching
> range.

Ha! Yes, that's a 'problem' allright. On x86 we normally only patch a
single instruction because of this, however we have some kprobe code
that has this issue and there we use tasks_rcu when PREEMPT=y.

Now, let me try and read this patch properly..

Peter Zijlstra Oct. 29, 2020, 10:40 a.m. UTC | #2

On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> +/*
> + * The static call trampoline consists of one of the following sequences:
> + *
> + *      (A)           (B)           (C)           (D)           (E)
> + * 00: BTI  C        BTI  C        BTI  C        BTI  C        BTI  C
> + * 04: B    fn       NOP           NOP           NOP           NOP
> + * 08: RET           RET           ADRP X16, fn  ADRP X16, fn  ADRP X16, fn
> + * 0c: NOP           NOP           ADD  X16, fn  ADD  X16, fn  ADD  X16, fn
> + * 10:                             BR   X16      RET           NOP
> + * 14:                                                         ADRP X16, &fn
> + * 18:                                                         LDR  X16, [X16, &fn]
> + * 1c:                                                         BR   X16
> + *
> + * The architecture permits us to patch B instructions into NOPs or vice versa
> + * directly, but patching any other instruction sequence requires careful
> + * synchronization. Since branch targets may be out of range for ordinary
> + * immediate branch instructions, we may have to fall back to ADRP/ADD/BR
> + * sequences in some cases, which complicates things considerably; since any
> + * sleeping tasks may have been preempted right in the middle of any of these
> + * sequences, we have to carefully transform one into the other, and ensure
> + * that it is safe to resume execution at any point in the sequence for tasks
> + * that have already executed part of it.
> + *
> + * So the rules are:
> + * - we start out with (A) or (B)
> + * - a branch within immediate range can always be patched in at offset 0x4;
> + * - sequence (A) can be turned into (B) for NULL branch targets;
> + * - a branch outside of immediate range can be patched using (C), but only if
> + *   . the sequence being updated is (A) or (B), or
> + *   . the branch target address modulo 4k results in the same ADD opcode
> + *     (which could occur when patching the same far target a second time)
> + * - once we have patched in (C) we cannot go back to (A) or (B), so patching
> + *   in a NULL target now requires sequence (D);
> + * - if we cannot patch a far target using (C), we fall back to sequence (E),
> + *   which loads the function pointer from memory.
> + *
> + * If we abide by these rules, then the following must hold for tasks that were
> + * interrupted halfway through execution of the trampoline:
> + * - when resuming at offset 0x8, we can only encounter a RET if (B) or (D)
> + *   was patched in at any point, and therefore a NULL target is valid;
> + * - when resuming at offset 0xc, we are executing the ADD opcode that is only
> + *   reachable via the preceding ADRP, and which is patched in only a single
> + *   time, and is therefore guaranteed to be consistent with the ADRP target;
> + * - when resuming at offset 0x10, X16 must refer to a valid target, since it
> + *   is only reachable via a ADRP/ADD pair that is guaranteed to be consistent.
> + *
> + * Note that sequence (E) is only used when switching between multiple far
> + * targets, and that it is not a terminal degraded state.
> + */

Would it make things easier if your trampoline consisted of two complete
slots, between which you can flip?

Something like:

	0x00	B 0x24 / NOP
	0x04    < slot 1 >
		....
	0x20
	0x24	< slot 2 >
		....
	0x40

Then each (20 byte) slot can contain any of the variants above and you
can write the unused slot without stop-machine. Then, when the unused
slot is populated, flip the initial instruction (like a static-branch),
issue synchronize_rcu_tasks() and flip to using the other slot for next
time.

Alternatively, you can patch the call-sites to point to the alternative
trampoline slot, but that might be pushing things a bit.

Ard Biesheuvel Oct. 29, 2020, 10:58 a.m. UTC | #3

On Thu, 29 Oct 2020 at 11:40, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> > +/*
> > + * The static call trampoline consists of one of the following sequences:
> > + *
> > + *      (A)           (B)           (C)           (D)           (E)
> > + * 00: BTI  C        BTI  C        BTI  C        BTI  C        BTI  C
> > + * 04: B    fn       NOP           NOP           NOP           NOP
> > + * 08: RET           RET           ADRP X16, fn  ADRP X16, fn  ADRP X16, fn
> > + * 0c: NOP           NOP           ADD  X16, fn  ADD  X16, fn  ADD  X16, fn
> > + * 10:                             BR   X16      RET           NOP
> > + * 14:                                                         ADRP X16, &fn
> > + * 18:                                                         LDR  X16, [X16, &fn]
> > + * 1c:                                                         BR   X16
> > + *
> > + * The architecture permits us to patch B instructions into NOPs or vice versa
> > + * directly, but patching any other instruction sequence requires careful
> > + * synchronization. Since branch targets may be out of range for ordinary
> > + * immediate branch instructions, we may have to fall back to ADRP/ADD/BR
> > + * sequences in some cases, which complicates things considerably; since any
> > + * sleeping tasks may have been preempted right in the middle of any of these
> > + * sequences, we have to carefully transform one into the other, and ensure
> > + * that it is safe to resume execution at any point in the sequence for tasks
> > + * that have already executed part of it.
> > + *
> > + * So the rules are:
> > + * - we start out with (A) or (B)
> > + * - a branch within immediate range can always be patched in at offset 0x4;
> > + * - sequence (A) can be turned into (B) for NULL branch targets;
> > + * - a branch outside of immediate range can be patched using (C), but only if
> > + *   . the sequence being updated is (A) or (B), or
> > + *   . the branch target address modulo 4k results in the same ADD opcode
> > + *     (which could occur when patching the same far target a second time)
> > + * - once we have patched in (C) we cannot go back to (A) or (B), so patching
> > + *   in a NULL target now requires sequence (D);
> > + * - if we cannot patch a far target using (C), we fall back to sequence (E),
> > + *   which loads the function pointer from memory.
> > + *
> > + * If we abide by these rules, then the following must hold for tasks that were
> > + * interrupted halfway through execution of the trampoline:
> > + * - when resuming at offset 0x8, we can only encounter a RET if (B) or (D)
> > + *   was patched in at any point, and therefore a NULL target is valid;
> > + * - when resuming at offset 0xc, we are executing the ADD opcode that is only
> > + *   reachable via the preceding ADRP, and which is patched in only a single
> > + *   time, and is therefore guaranteed to be consistent with the ADRP target;
> > + * - when resuming at offset 0x10, X16 must refer to a valid target, since it
> > + *   is only reachable via a ADRP/ADD pair that is guaranteed to be consistent.
> > + *
> > + * Note that sequence (E) is only used when switching between multiple far
> > + * targets, and that it is not a terminal degraded state.
> > + */
>
> Would it make things easier if your trampoline consisted of two complete
> slots, between which you can flip?
>
> Something like:
>
>         0x00    B 0x24 / NOP
>         0x04    < slot 1 >
>                 ....
>         0x20
>         0x24    < slot 2 >
>                 ....
>         0x40
>
> Then each (20 byte) slot can contain any of the variants above and you
> can write the unused slot without stop-machine. Then, when the unused
> slot is populated, flip the initial instruction (like a static-branch),
> issue synchronize_rcu_tasks() and flip to using the other slot for next
> time.
>

Once we've populated a slot and activated it, we have to assume that
it is live and we can no longer modify it freely.

> Alternatively, you can patch the call-sites to point to the alternative
> trampoline slot, but that might be pushing things a bit.

Quentin Perret Oct. 29, 2020, 11:27 a.m. UTC | #4

Hi Ard,

On Wednesday 28 Oct 2020 at 19:41:14 (+0100), Ard Biesheuvel wrote:
> diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
> new file mode 100644
> index 000000000000..7ddf939d57f5
> --- /dev/null
> +++ b/arch/arm64/include/asm/static_call.h
> @@ -0,0 +1,32 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ASM_STATIC_CALL_H
> +#define _ASM_STATIC_CALL_H
> +
> +/*
> + * We have to account for the possibility that the static call site may
> + * be updated to refer to a target that is out of range for an ordinary
> + * 'B' branch instruction, and so we need to pre-allocate some space for
> + * a ADRP/ADD/BR sequence.
> + */
> +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn)			    \
> +	asm(".pushsection	.static_call.text, \"ax\"		\n" \
> +	    ".align		5					\n" \
> +	    ".globl		" STATIC_CALL_TRAMP_STR(name) "		\n" \
> +	    STATIC_CALL_TRAMP_STR(name) ":				\n" \
> +	    "hint 	34	/* BTI C */				\n" \
> +	    insn "							\n" \
> +	    "ret							\n" \
> +	    "nop							\n" \
> +	    "nop							\n" \
> +	    "adrp	x16, " STATIC_CALL_KEY(name) "			\n" \
> +	    "ldr	x16, [x16, :lo12:" STATIC_CALL_KEY(name) "]	\n" \
> +	    "br		x16						\n" \
> +	    ".popsection						\n")

Still trying to understand all this in details, so bear with me, but is
there any way this could be turned into the 'inline' static call
variant?

That is, could we have a piece of inline assembly at all static_call
locations that would do a branch-link, with basically x0-x18 and
x29,x30 in the clobber list (+ some have some magic to place the
parameters in the right register, like we do for SMCCC calls for
instance). That'd save a branch to the trampoline.

Ideally we'd need a way to tell the compile 'this inline assembly code
does a function call', but I'm not aware of any easy way to do that
(other that marking register clobbered + probably other things that I'm
missing).

In any case, I was looking forward to an arm64 static call port so
thanks for putting that together.

Quentin

Ard Biesheuvel Oct. 29, 2020, 11:32 a.m. UTC | #5

On Thu, 29 Oct 2020 at 12:27, Quentin Perret <qperret@google.com> wrote:
>
> Hi Ard,
>
> On Wednesday 28 Oct 2020 at 19:41:14 (+0100), Ard Biesheuvel wrote:
> > diff --git a/arch/arm64/include/asm/static_call.h b/arch/arm64/include/asm/static_call.h
> > new file mode 100644
> > index 000000000000..7ddf939d57f5
> > --- /dev/null
> > +++ b/arch/arm64/include/asm/static_call.h
> > @@ -0,0 +1,32 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_STATIC_CALL_H
> > +#define _ASM_STATIC_CALL_H
> > +
> > +/*
> > + * We have to account for the possibility that the static call site may
> > + * be updated to refer to a target that is out of range for an ordinary
> > + * 'B' branch instruction, and so we need to pre-allocate some space for
> > + * a ADRP/ADD/BR sequence.
> > + */
> > +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn)                      \
> > +     asm(".pushsection       .static_call.text, \"ax\"               \n" \
> > +         ".align             5                                       \n" \
> > +         ".globl             " STATIC_CALL_TRAMP_STR(name) "         \n" \
> > +         STATIC_CALL_TRAMP_STR(name) ":                              \n" \
> > +         "hint       34      /* BTI C */                             \n" \
> > +         insn "                                                      \n" \
> > +         "ret                                                        \n" \
> > +         "nop                                                        \n" \
> > +         "nop                                                        \n" \
> > +         "adrp       x16, " STATIC_CALL_KEY(name) "                  \n" \
> > +         "ldr        x16, [x16, :lo12:" STATIC_CALL_KEY(name) "]     \n" \
> > +         "br         x16                                             \n" \
> > +         ".popsection                                                \n")
>
> Still trying to understand all this in details, so bear with me, but is
> there any way this could be turned into the 'inline' static call
> variant?
>
> That is, could we have a piece of inline assembly at all static_call
> locations that would do a branch-link, with basically x0-x18 and
> x29,x30 in the clobber list (+ some have some magic to place the
> parameters in the right register, like we do for SMCCC calls for
> instance). That'd save a branch to the trampoline.
>
> Ideally we'd need a way to tell the compile 'this inline assembly code
> does a function call', but I'm not aware of any easy way to do that
> (other that marking register clobbered + probably other things that I'm
> missing).
>
> In any case, I was looking forward to an arm64 static call port so
> thanks for putting that together.
>

Hello Quentin,

We'll need tooling along the lines of the GCC plugin I wrote [0] to
implement inline static calls - doing function calls from inline
assembly is too messy, too fragile and too error prone to rely on.

However, as I discussed with Will offline yesterday as well, the
question that got snowed under is whether we need any of this on arm64
in the first place. It seems highly unlikely that inline static calls
are worth it, and even out-of-line static calls are probably not worth
the hassle as we don't have the retpoline problem.

So this code should be considered an invitation for discussion, and
perhaps someone can invent a use case where benchmarks can show a
worthwhile improvement. But let's not get ahead of ourselves.


[0] https://git.kernel.org/pub/scm/linux/kernel/git/ardb/linux.git/log/?h=static-calls

Peter Zijlstra Oct. 29, 2020, 11:44 a.m. UTC | #6

On Thu, Oct 29, 2020 at 12:32:50PM +0100, Ard Biesheuvel wrote:
> However, as I discussed with Will offline yesterday as well, the
> question that got snowed under is whether we need any of this on arm64
> in the first place. It seems highly unlikely that inline static calls
> are worth it, and even out-of-line static calls are probably not worth
> the hassle as we don't have the retpoline problem.
> 
> So this code should be considered an invitation for discussion, and
> perhaps someone can invent a use case where benchmarks can show a
> worthwhile improvement. But let's not get ahead of ourselves.

So the obvious benefit is not having to do the extra load. Any indirect
call will have to do a load first, which can miss etc.. And yes,
retpoline is a horrible mess as well.

Doing the direct vs indirect saves one I$ miss I suppose, which can be
noticable.

IIRC Steve had benchmarks for the ftrace conversion, which is now
upstream, so that should be simple enough to run.

Steve, remember how to get numbers out of that?

Peter Zijlstra Oct. 29, 2020, 11:46 a.m. UTC | #7

On Thu, Oct 29, 2020 at 11:58:52AM +0100, Ard Biesheuvel wrote:
> On Thu, 29 Oct 2020 at 11:40, Peter Zijlstra <peterz@infradead.org> wrote:

> > Would it make things easier if your trampoline consisted of two complete
> > slots, between which you can flip?
> >
> > Something like:
> >
> >         0x00    B 0x24 / NOP
> >         0x04    < slot 1 >
> >                 ....
> >         0x20
> >         0x24    < slot 2 >
> >                 ....
> >         0x40
> >
> > Then each (20 byte) slot can contain any of the variants above and you
> > can write the unused slot without stop-machine. Then, when the unused
> > slot is populated, flip the initial instruction (like a static-branch),
> > issue synchronize_rcu_tasks() and flip to using the other slot for next
> > time.
> >
> 
> Once we've populated a slot and activated it, we have to assume that
> it is live and we can no longer modify it freely.

Urhm how so? Once you pass through synchronize_rcu_tasks() nobody should
still be using the old one.

Ard Biesheuvel Oct. 29, 2020, 11:49 a.m. UTC | #8

On Thu, 29 Oct 2020 at 12:46, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 29, 2020 at 11:58:52AM +0100, Ard Biesheuvel wrote:
> > On Thu, 29 Oct 2020 at 11:40, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Would it make things easier if your trampoline consisted of two complete
> > > slots, between which you can flip?
> > >
> > > Something like:
> > >
> > >         0x00    B 0x24 / NOP
> > >         0x04    < slot 1 >
> > >                 ....
> > >         0x20
> > >         0x24    < slot 2 >
> > >                 ....
> > >         0x40
> > >
> > > Then each (20 byte) slot can contain any of the variants above and you
> > > can write the unused slot without stop-machine. Then, when the unused
> > > slot is populated, flip the initial instruction (like a static-branch),
> > > issue synchronize_rcu_tasks() and flip to using the other slot for next
> > > time.
> > >
> >
> > Once we've populated a slot and activated it, we have to assume that
> > it is live and we can no longer modify it freely.
>
> Urhm how so? Once you pass through synchronize_rcu_tasks() nobody should
> still be using the old one.

But doesn't that require some RCU foo in the static call wrapper itself?

Mark Rutland Oct. 29, 2020, 11:50 a.m. UTC | #9

Hi Ard,

On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> Implement arm64 support for the 'unoptimized' static call variety,
> which routes all calls through a single trampoline that is patched
> to perform a tail call to the selected function.

Given the complexity and subtlety here, do we actually need this?

If there's some core feature that depends on this (or wants to), it'd be
nice to call that out in the commit message.

> Since static call targets may be located in modules loaded out of
> direct branching range, we need to be able to fall back to issuing
> a ADRP/ADD pair to load the branch target into R16 and use a BR
> instruction. As this involves patching more than a single B or NOP
> instruction (for which the architecture makes special provisions
> in terms of the synchronization needed), we may need to run the
> full blown instruction patching logic that uses stop_machine(). It
> also means that once we've patched in a ADRP/ADD pair once, we are
> quite restricted in the patching we can code subsequently, and we
> may end up using an indirect call after all (note that 

Noted. I guess we

[...]

> v2:
> This turned nasty really quickly when I realized that any sleeping task
> could have been interrupted right in the middle of the ADRP/ADD pair
> that we emit for static call targets that are out of immediate branching
> range.

> +/*
> + * The static call trampoline consists of one of the following sequences:
> + *
> + *      (A)           (B)           (C)           (D)           (E)
> + * 00: BTI  C        BTI  C        BTI  C        BTI  C        BTI  C
> + * 04: B    fn       NOP           NOP           NOP           NOP
> + * 08: RET           RET           ADRP X16, fn  ADRP X16, fn  ADRP X16, fn
> + * 0c: NOP           NOP           ADD  X16, fn  ADD  X16, fn  ADD  X16, fn
> + * 10:                             BR   X16      RET           NOP
> + * 14:                                                         ADRP X16, &fn
> + * 18:                                                         LDR  X16, [X16, &fn]
> + * 1c:                                                         BR   X16

Are these all padded with trailing NOPs or UDFs? I assume the same space is
statically reserved.

> + *
> + * The architecture permits us to patch B instructions into NOPs or vice versa
> + * directly, but patching any other instruction sequence requires careful
> + * synchronization. Since branch targets may be out of range for ordinary
> + * immediate branch instructions, we may have to fall back to ADRP/ADD/BR
> + * sequences in some cases, which complicates things considerably; since any
> + * sleeping tasks may have been preempted right in the middle of any of these
> + * sequences, we have to carefully transform one into the other, and ensure
> + * that it is safe to resume execution at any point in the sequence for tasks
> + * that have already executed part of it.
> + *
> + * So the rules are:
> + * - we start out with (A) or (B)
> + * - a branch within immediate range can always be patched in at offset 0x4;
> + * - sequence (A) can be turned into (B) for NULL branch targets;
> + * - a branch outside of immediate range can be patched using (C), but only if
> + *   . the sequence being updated is (A) or (B), or
> + *   . the branch target address modulo 4k results in the same ADD opcode
> + *     (which could occur when patching the same far target a second time)
> + * - once we have patched in (C) we cannot go back to (A) or (B), so patching
> + *   in a NULL target now requires sequence (D);
> + * - if we cannot patch a far target using (C), we fall back to sequence (E),
> + *   which loads the function pointer from memory.

Cases C-E all use an indirect branch, which goes against one of the
arguments for having static calls (the assumption that CPUs won't
mis-predict direct branches). Similarly case E is a literal pool with
more steps.

That means that for us, static calls would only be an opportunistic
optimization rather than a hardening feature. Do they actually save us
much, or could we get by with an inline literal pool in the trampoline?

It'd be much easier to reason consistently if the trampoline were
always:

| 	BTI C
| 	LDR X16, _literal // pc-relative load
| 	BR X16
| _literal:
| 	< patch a 64-bit value here atomically >

... and I would strongly prefer that to having multiple sequences that
could all be live -- I'm really not keen on the complexity and subtlety
that entails.

[...]

> + * Note that sequence (E) is only used when switching between multiple far
> + * targets, and that it is not a terminal degraded state.

I think what you're saying here is that you can go from (E) to (C) if
switching to a new far branch that's in ADRP+ADD range, but it can be
misread to mean (E) is a transient step while patching a branch rather
than some branches only being possible to encode with (E).

Thanks,
Mark.

Peter Zijlstra Oct. 29, 2020, 11:54 a.m. UTC | #10

On Thu, Oct 29, 2020 at 12:49:52PM +0100, Ard Biesheuvel wrote:
> On Thu, 29 Oct 2020 at 12:46, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Thu, Oct 29, 2020 at 11:58:52AM +0100, Ard Biesheuvel wrote:
> > > On Thu, 29 Oct 2020 at 11:40, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > > > Would it make things easier if your trampoline consisted of two complete
> > > > slots, between which you can flip?
> > > >
> > > > Something like:
> > > >
> > > >         0x00    B 0x24 / NOP
> > > >         0x04    < slot 1 >
> > > >                 ....
> > > >         0x20
> > > >         0x24    < slot 2 >
> > > >                 ....
> > > >         0x40
> > > >
> > > > Then each (20 byte) slot can contain any of the variants above and you
> > > > can write the unused slot without stop-machine. Then, when the unused
> > > > slot is populated, flip the initial instruction (like a static-branch),
> > > > issue synchronize_rcu_tasks() and flip to using the other slot for next
> > > > time.
> > > >
> > >
> > > Once we've populated a slot and activated it, we have to assume that
> > > it is live and we can no longer modify it freely.
> >
> > Urhm how so? Once you pass through synchronize_rcu_tasks() nobody should
> > still be using the old one.
> 
> But doesn't that require some RCU foo in the static call wrapper itself?

Nope, synchronize_rcu_tasks() ensures that every task will pass a safe
point (aka. voluntary schedule()) before returning. This then guarantees
that, no matter where they were preempted, they'll have left (and will
observe the new state).

Quentin Perret Oct. 29, 2020, 11:54 a.m. UTC | #11

On Thursday 29 Oct 2020 at 12:32:50 (+0100), Ard Biesheuvel wrote:
> We'll need tooling along the lines of the GCC plugin I wrote [0] to
> implement inline static calls - doing function calls from inline
> assembly is too messy, too fragile and too error prone to rely on.

Right, and that is the gut feeling I had too, but I can't quite put my
finger on what exactly can go wrong. Any pointers?

> However, as I discussed with Will offline yesterday as well, the
> question that got snowed under is whether we need any of this on arm64
> in the first place. It seems highly unlikely that inline static calls
> are worth it, and even out-of-line static calls are probably not worth
> the hassle as we don't have the retpoline problem.
> 
> So this code should be considered an invitation for discussion, and
> perhaps someone can invent a use case where benchmarks can show a
> worthwhile improvement. But let's not get ahead of ourselves.

The reason I'm interested in this is because Android makes heavy use of
trace points/hooks, so any potential improvement in this area would be
welcome. Now I agree we need numbers to show the benefit is real before
this can be considered for inclusion in the kernel. I'll try and see if
we can get something.

Thanks,
Quentin

Peter Zijlstra Oct. 29, 2020, 11:58 a.m. UTC | #12

On Thu, Oct 29, 2020 at 11:50:26AM +0000, Mark Rutland wrote:
> Hi Ard,
> 
> On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> > Implement arm64 support for the 'unoptimized' static call variety,
> > which routes all calls through a single trampoline that is patched
> > to perform a tail call to the selected function.
> 
> Given the complexity and subtlety here, do we actually need this?

Only if you can get a performance win. The obvious benefit is loosing
the load that's inherent in indirect function calls. The down-side of
the indirect static-call implementation is that it will incur an extra
I$ miss.

So it might be a wash, loose a data load miss, gain an I$ miss.

The direct method (patching the call-site, where possible) would
alleviate that (mostly) and be more of a win.

Ard Biesheuvel Oct. 29, 2020, 11:59 a.m. UTC | #13

Hi Mark,

On Thu, 29 Oct 2020 at 12:50, Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi Ard,
>
> On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> > Implement arm64 support for the 'unoptimized' static call variety,
> > which routes all calls through a single trampoline that is patched
> > to perform a tail call to the selected function.
>
> Given the complexity and subtlety here, do we actually need this?
>

The more time I spend on this, the more I lean towards 'no' :-)

> If there's some core feature that depends on this (or wants to), it'd be
> nice to call that out in the commit message.
>

Agreed. Perhaps I should have put RFC in the subject, because this is
more intended as a call for discussion (and v1 had little success in
that regard)

> > Since static call targets may be located in modules loaded out of
> > direct branching range, we need to be able to fall back to issuing
> > a ADRP/ADD pair to load the branch target into R16 and use a BR
> > instruction. As this involves patching more than a single B or NOP
> > instruction (for which the architecture makes special provisions
> > in terms of the synchronization needed), we may need to run the
> > full blown instruction patching logic that uses stop_machine(). It
> > also means that once we've patched in a ADRP/ADD pair once, we are
> > quite restricted in the patching we can code subsequently, and we
> > may end up using an indirect call after all (note that
>
> Noted. I guess we
>
> [...]
>

?

> > v2:
> > This turned nasty really quickly when I realized that any sleeping task
> > could have been interrupted right in the middle of the ADRP/ADD pair
> > that we emit for static call targets that are out of immediate branching
> > range.
>
> > +/*
> > + * The static call trampoline consists of one of the following sequences:
> > + *
> > + *      (A)           (B)           (C)           (D)           (E)
> > + * 00: BTI  C        BTI  C        BTI  C        BTI  C        BTI  C
> > + * 04: B    fn       NOP           NOP           NOP           NOP
> > + * 08: RET           RET           ADRP X16, fn  ADRP X16, fn  ADRP X16, fn
> > + * 0c: NOP           NOP           ADD  X16, fn  ADD  X16, fn  ADD  X16, fn
> > + * 10:                             BR   X16      RET           NOP
> > + * 14:                                                         ADRP X16, &fn
> > + * 18:                                                         LDR  X16, [X16, &fn]
> > + * 1c:                                                         BR   X16
>
> Are these all padded with trailing NOPs or UDFs? I assume the same space is
> statically reserved.
>

Yes, it is a fixed size 32 byte area. I only included the relevant
ones here, the empty slots are D/C in the context of each individual
sequence.


> > + *
> > + * The architecture permits us to patch B instructions into NOPs or vice versa
> > + * directly, but patching any other instruction sequence requires careful
> > + * synchronization. Since branch targets may be out of range for ordinary
> > + * immediate branch instructions, we may have to fall back to ADRP/ADD/BR
> > + * sequences in some cases, which complicates things considerably; since any
> > + * sleeping tasks may have been preempted right in the middle of any of these
> > + * sequences, we have to carefully transform one into the other, and ensure
> > + * that it is safe to resume execution at any point in the sequence for tasks
> > + * that have already executed part of it.
> > + *
> > + * So the rules are:
> > + * - we start out with (A) or (B)
> > + * - a branch within immediate range can always be patched in at offset 0x4;
> > + * - sequence (A) can be turned into (B) for NULL branch targets;
> > + * - a branch outside of immediate range can be patched using (C), but only if
> > + *   . the sequence being updated is (A) or (B), or
> > + *   . the branch target address modulo 4k results in the same ADD opcode
> > + *     (which could occur when patching the same far target a second time)
> > + * - once we have patched in (C) we cannot go back to (A) or (B), so patching
> > + *   in a NULL target now requires sequence (D);
> > + * - if we cannot patch a far target using (C), we fall back to sequence (E),
> > + *   which loads the function pointer from memory.
>
> Cases C-E all use an indirect branch, which goes against one of the
> arguments for having static calls (the assumption that CPUs won't
> mis-predict direct branches). Similarly case E is a literal pool with
> more steps.
>
> That means that for us, static calls would only be an opportunistic
> optimization rather than a hardening feature. Do they actually save us
> much, or could we get by with an inline literal pool in the trampoline?
>

Another assumption this is based on is that a literal load is more
costly than a ADRP/ADD.

> It'd be much easier to reason consistently if the trampoline were
> always:
>
> |       BTI C
> |       LDR X16, _literal // pc-relative load
> |       BR X16
> | _literal:
> |       < patch a 64-bit value here atomically >
>
> ... and I would strongly prefer that to having multiple sequences that
> could all be live -- I'm really not keen on the complexity and subtlety
> that entails.
>

I don't see this having any benefit over a ADRP/LDR pair that accesses
the static call key struct directly.

> [...]
>
> > + * Note that sequence (E) is only used when switching between multiple far
> > + * targets, and that it is not a terminal degraded state.
>
> I think what you're saying here is that you can go from (E) to (C) if
> switching to a new far branch that's in ADRP+ADD range, but it can be
> misread to mean (E) is a transient step while patching a branch rather
> than some branches only being possible to encode with (E).
>

Indeed.

Ard Biesheuvel Oct. 29, 2020, 12:14 p.m. UTC | #14

On Thu, 29 Oct 2020 at 12:54, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Oct 29, 2020 at 12:49:52PM +0100, Ard Biesheuvel wrote:
> > On Thu, 29 Oct 2020 at 12:46, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Thu, Oct 29, 2020 at 11:58:52AM +0100, Ard Biesheuvel wrote:
> > > > On Thu, 29 Oct 2020 at 11:40, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > > > Would it make things easier if your trampoline consisted of two complete
> > > > > slots, between which you can flip?
> > > > >
> > > > > Something like:
> > > > >
> > > > >         0x00    B 0x24 / NOP
> > > > >         0x04    < slot 1 >
> > > > >                 ....
> > > > >         0x20
> > > > >         0x24    < slot 2 >
> > > > >                 ....
> > > > >         0x40
> > > > >
> > > > > Then each (20 byte) slot can contain any of the variants above and you
> > > > > can write the unused slot without stop-machine. Then, when the unused
> > > > > slot is populated, flip the initial instruction (like a static-branch),
> > > > > issue synchronize_rcu_tasks() and flip to using the other slot for next
> > > > > time.
> > > > >
> > > >
> > > > Once we've populated a slot and activated it, we have to assume that
> > > > it is live and we can no longer modify it freely.
> > >
> > > Urhm how so? Once you pass through synchronize_rcu_tasks() nobody should
> > > still be using the old one.
> >
> > But doesn't that require some RCU foo in the static call wrapper itself?
>
> Nope, synchronize_rcu_tasks() ensures that every task will pass a safe
> point (aka. voluntary schedule()) before returning. This then guarantees
> that, no matter where they were preempted, they'll have left (and will
> observe the new state).
>

Ah, wonderful. Yes, that does simplify things substantially.

Mark Rutland Oct. 29, 2020, 1:21 p.m. UTC | #15

On Thu, Oct 29, 2020 at 12:59:43PM +0100, Ard Biesheuvel wrote:
> On Thu, 29 Oct 2020 at 12:50, Mark Rutland <mark.rutland@arm.com> wrote:
> > On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> > > Implement arm64 support for the 'unoptimized' static call variety,
> > > which routes all calls through a single trampoline that is patched
> > > to perform a tail call to the selected function.
> >
> > Given the complexity and subtlety here, do we actually need this?
> 
> The more time I spend on this, the more I lean towards 'no' :-)

That would make this all simpler! :)

[...]
 
> > > Since static call targets may be located in modules loaded out of
> > > direct branching range, we need to be able to fall back to issuing
> > > a ADRP/ADD pair to load the branch target into R16 and use a BR
> > > instruction. As this involves patching more than a single B or NOP
> > > instruction (for which the architecture makes special provisions
> > > in terms of the synchronization needed), we may need to run the
> > > full blown instruction patching logic that uses stop_machine(). It
> > > also means that once we've patched in a ADRP/ADD pair once, we are
> > > quite restricted in the patching we can code subsequently, and we
> > > may end up using an indirect call after all (note that
> >
> > Noted. I guess we
> >
> > [...]
> >
> 
> ?

Sorry; I was playing on the commit message ending on "(note that", as I
wasn't sure where that was going.

> > > + *
> > > + * The architecture permits us to patch B instructions into NOPs or vice versa
> > > + * directly, but patching any other instruction sequence requires careful
> > > + * synchronization. Since branch targets may be out of range for ordinary
> > > + * immediate branch instructions, we may have to fall back to ADRP/ADD/BR
> > > + * sequences in some cases, which complicates things considerably; since any
> > > + * sleeping tasks may have been preempted right in the middle of any of these
> > > + * sequences, we have to carefully transform one into the other, and ensure
> > > + * that it is safe to resume execution at any point in the sequence for tasks
> > > + * that have already executed part of it.
> > > + *
> > > + * So the rules are:
> > > + * - we start out with (A) or (B)
> > > + * - a branch within immediate range can always be patched in at offset 0x4;
> > > + * - sequence (A) can be turned into (B) for NULL branch targets;
> > > + * - a branch outside of immediate range can be patched using (C), but only if
> > > + *   . the sequence being updated is (A) or (B), or
> > > + *   . the branch target address modulo 4k results in the same ADD opcode
> > > + *     (which could occur when patching the same far target a second time)
> > > + * - once we have patched in (C) we cannot go back to (A) or (B), so patching
> > > + *   in a NULL target now requires sequence (D);
> > > + * - if we cannot patch a far target using (C), we fall back to sequence (E),
> > > + *   which loads the function pointer from memory.
> >
> > Cases C-E all use an indirect branch, which goes against one of the
> > arguments for having static calls (the assumption that CPUs won't
> > mis-predict direct branches). Similarly case E is a literal pool with
> > more steps.
> >
> > That means that for us, static calls would only be an opportunistic
> > optimization rather than a hardening feature. Do they actually save us
> > much, or could we get by with an inline literal pool in the trampoline?
> 
> Another assumption this is based on is that a literal load is more
> costly than a ADRP/ADD.

Agreed. I think in practice it's going to depend on the surrounding
context and microarchitecture. If the result is being fed into a BR, I'd
expect no difference on a big OoO core, and even for a small in-order
core it'll depend on how/when the core can forward the result relative
to predicting the branch.

> > It'd be much easier to reason consistently if the trampoline were
> > always:
> >
> > |       BTI C
> > |       LDR X16, _literal // pc-relative load
> > |       BR X16
> > | _literal:
> > |       < patch a 64-bit value here atomically >
> >
> > ... and I would strongly prefer that to having multiple sequences that
> > could all be live -- I'm really not keen on the complexity and subtlety
> > that entails.
> 
> I don't see this having any benefit over a ADRP/LDR pair that accesses
> the static call key struct directly.

Even better!

Thanks,
Mark.

Ard Biesheuvel Oct. 29, 2020, 1:22 p.m. UTC | #16

On Thu, 29 Oct 2020 at 12:54, Quentin Perret <qperret@google.com> wrote:
>
> On Thursday 29 Oct 2020 at 12:32:50 (+0100), Ard Biesheuvel wrote:
> > We'll need tooling along the lines of the GCC plugin I wrote [0] to
> > implement inline static calls - doing function calls from inline
> > assembly is too messy, too fragile and too error prone to rely on.
>
> Right, and that is the gut feeling I had too, but I can't quite put my
> finger on what exactly can go wrong. Any pointers?
>

The main problem is that the compiler is not aware that a function
call is being made. For instance, this code

int fred(int l)
{
asm("bl barney"
:"+r"(l)
::"x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9",
  "x10", "x11", "x12", "x13", "x14", "x15", "x16", "x17", "x30",
  "cc");

return l + 1;
}

works with GCC, but only if I omit the x29 clobber, or it gives an
error. With that omitted, Clang produces

.type fred,@function
fred:                                   // @fred
// %bb.0:
str x30, [sp, #-16]!        // 8-byte Folded Spill
//APP
bl barney
//NO_APP
add w0, w0, #1              // =1
ldr x30, [sp], #16          // 8-byte Folded Reload
ret

i.e., it does not set up the stack frame correctly, and so we lose the
ability to unwind the stack through this call.

There may be some other corner cases, although I agree it makes sense
to get to the bottom of this.


> > However, as I discussed with Will offline yesterday as well, the
> > question that got snowed under is whether we need any of this on arm64
> > in the first place. It seems highly unlikely that inline static calls
> > are worth it, and even out-of-line static calls are probably not worth
> > the hassle as we don't have the retpoline problem.
> >
> > So this code should be considered an invitation for discussion, and
> > perhaps someone can invent a use case where benchmarks can show a
> > worthwhile improvement. But let's not get ahead of ourselves.
>
> The reason I'm interested in this is because Android makes heavy use of
> trace points/hooks, so any potential improvement in this area would be
> welcome. Now I agree we need numbers to show the benefit is real before
> this can be considered for inclusion in the kernel. I'll try and see if
> we can get something.
>
> Thanks,
> Quentin

Mark Rutland Oct. 29, 2020, 1:30 p.m. UTC | #17

On Thu, Oct 29, 2020 at 12:58:32PM +0100, Peter Zijlstra wrote:
> On Thu, Oct 29, 2020 at 11:50:26AM +0000, Mark Rutland wrote:
> > Hi Ard,
> > 
> > On Wed, Oct 28, 2020 at 07:41:14PM +0100, Ard Biesheuvel wrote:
> > > Implement arm64 support for the 'unoptimized' static call variety,
> > > which routes all calls through a single trampoline that is patched
> > > to perform a tail call to the selected function.
> > 
> > Given the complexity and subtlety here, do we actually need this?
> 
> Only if you can get a performance win. The obvious benefit is loosing
> the load that's inherent in indirect function calls. The down-side of
> the indirect static-call implementation is that it will incur an extra
> I$ miss.
> 
> So it might be a wash, loose a data load miss, gain an I$ miss.

I reckon it'll be highly dependent on microarchitecture since it'll also
depend on how indirect branches are handled (with prediction,
forwarding, speculation, etc). I don't think we can easily reason about
this in general.

> The direct method (patching the call-site, where possible) would
> alleviate that (mostly) and be more of a win.

I think that where the original callsite can be patched with a direct
branch, it's desireable that we do so. That's simple enough, and there
are places where that'd be useful from a functional pov (e.g. if we want
to patch branches in hyp text to other hyp text).

However, if the range of the branch requires a trampoline I'd rather the
trampoline (and the procedure for updating it) be as simple as possible.

Thanks,
Mark.

Steven Rostedt Oct. 29, 2020, 2:10 p.m. UTC | #18

On Thu, 29 Oct 2020 12:44:57 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> IIRC Steve had benchmarks for the ftrace conversion, which is now
> upstream, so that should be simple enough to run.
> 
> Steve, remember how to get numbers out of that?

IIRC, all I did was, before and after the patches run:

	# trace-cmd start -e all

(which is equivalent to "echo 1 > /sys/kernel/tracing/events/enable")

and do:

	# perf stat -r 10 -a ./hackbench 50

Basically running perf stat on hackbench with all events being traced. As
the tracing does indirect jumps without Peter's patches and direct jumps
with the static calls, it gave me a good idea on how much it changed, as
hackbench triggers a lot of trace events when tracing is enabled.

Note, the number of events can change depending on the config. There's a
a few config options that could really stress it (like enabling preempt
enabled/disable tracepoints).

-- Steve

Quentin Perret Nov. 16, 2020, 10:18 a.m. UTC | #19

On Thursday 29 Oct 2020 at 11:54:42 (+0000), Quentin Perret wrote:
> The reason I'm interested in this is because Android makes heavy use of
> trace points/hooks, so any potential improvement in this area would be
> welcome. Now I agree we need numbers to show the benefit is real before
> this can be considered for inclusion in the kernel. I'll try and see if
> we can get something.

Following up on this as we've just figured out what was causing
performance issues in our use-case. Basically, we have a setup where
some modules attach to trace hooks for a few things (e.g. the pelt
scheduler hooks + other Android-specific hooks), and that appeared to
cause up ~6% perf regression on the Androbench benchmark.

The bulk of the regression came from a feature that is currently
Android-specific but should hopefully make it upstream (soon?): Control
Flow Integrity (CFI) -- see [1] for more details. In essence CFI is a
software-based cousin of BTI, which is basically about ensuring the
target of an indirect function call has a compatible prototype. This can
be relatively easily checked for potential targets that are known at
compile-time, but is a little harder when the targets are dynamically
loaded, hence causing extra overhead when the target is in a module.

Anyway, I don't think any of the above is particularly relevant to
upstream just yet, but I figured this would interesting to share. The
short-term fix for Android was to locally disable CFI checks around the
trace hooks that cause the perf regression, but I think static-calls
would be a preferable alternative to that (I'll try to confirm that
experimentally). And when/if CFI makes it upstream, then that may become
relevant to upstream as well, though the integration of CFI and
static-calls is not very clear yet.

Thanks,
Quentin

[1] https://www.youtube.com/watch?v=0Bj6W7qrOOI

Ard Biesheuvel Nov. 16, 2020, 10:31 a.m. UTC | #20

On Mon, 16 Nov 2020 at 11:18, Quentin Perret <qperret@google.com> wrote:
>
> On Thursday 29 Oct 2020 at 11:54:42 (+0000), Quentin Perret wrote:
> > The reason I'm interested in this is because Android makes heavy use of
> > trace points/hooks, so any potential improvement in this area would be
> > welcome. Now I agree we need numbers to show the benefit is real before
> > this can be considered for inclusion in the kernel. I'll try and see if
> > we can get something.
>
> Following up on this as we've just figured out what was causing
> performance issues in our use-case. Basically, we have a setup where
> some modules attach to trace hooks for a few things (e.g. the pelt
> scheduler hooks + other Android-specific hooks), and that appeared to
> cause up ~6% perf regression on the Androbench benchmark.
>
> The bulk of the regression came from a feature that is currently
> Android-specific but should hopefully make it upstream (soon?): Control
> Flow Integrity (CFI) -- see [1] for more details. In essence CFI is a
> software-based cousin of BTI, which is basically about ensuring the
> target of an indirect function call has a compatible prototype. This can
> be relatively easily checked for potential targets that are known at
> compile-time, but is a little harder when the targets are dynamically
> loaded, hence causing extra overhead when the target is in a module.
>
> Anyway, I don't think any of the above is particularly relevant to
> upstream just yet, but I figured this would interesting to share. The
> short-term fix for Android was to locally disable CFI checks around the
> trace hooks that cause the perf regression, but I think static-calls
> would be a preferable alternative to that (I'll try to confirm that
> experimentally). And when/if CFI makes it upstream, then that may become
> relevant to upstream as well, though the integration of CFI and
> static-calls is not very clear yet.
>

OK, so that would suggest that having at least the out-of-line
trampoline would help with CFI, but only because the indirect call is
decorated with CFI checks, not because the indirect call itself is any
slower.

So that suggests that something like

  bti    c
  ldr    x16, 0f
  br     x16
0:.quad  <target>

may well be sufficient in the arm64 case - it is hidden from the
assembler, so we don't get the CFI overhead, and since it is emitted
as .text (and therefore requires code patching to be updated), it does
not need the same level of protection that CFI offers elsewhere when
it comes to indirect calls.

Quentin Perret Nov. 16, 2020, 12:05 p.m. UTC | #21

On Monday 16 Nov 2020 at 11:31:10 (+0100), Ard Biesheuvel wrote:
> OK, so that would suggest that having at least the out-of-line
> trampoline would help with CFI, but only because the indirect call is
> decorated with CFI checks, not because the indirect call itself is any
> slower.

Right. By disabling CFI checks in Android we get something that is more
comparable to the inline static-call implementation as we get a 'raw'
indirect call. But yes, it's very likely that even an out-of-line static
call is going be much faster than a CFI-enabled indirect call, so
definitely worth a try.

> So that suggests that something like
> 
>   bti    c
>   ldr    x16, 0f
>   br     x16
> 0:.quad  <target>
> 
> may well be sufficient in the arm64 case - it is hidden from the
> assembler, so we don't get the CFI overhead, and since it is emitted
> as .text (and therefore requires code patching to be updated), it does
> not need the same level of protection that CFI offers elsewhere when
> it comes to indirect calls.

Agreed. I'm thinking the static-call infrastructure itself could perhaps
do the CFI target validation before actually patching the text. But I
suppose we probably have bigger problems if we can't trust whoever
initiated the static-call patching, so ...

Thanks,
Quentin

[v2] arm64: implement support for static call trampolines

Commit Message

Comments

Patch