diff mbox series

static_call,x86: Robustify trampoline patching

Message ID 20211030074758.GT174703@worktop.programming.kicks-ass.net (mailing list archive)
State Mainlined
Commit 2105a92748e83e2e3ee6be539da959706bbb3898
Delegated to: Kees Cook
Headers show
Series static_call,x86: Robustify trampoline patching | expand

Commit Message

Peter Zijlstra Oct. 30, 2021, 7:47 a.m. UTC
On Fri, Oct 29, 2021 at 10:03:24PM +0200, Peter Zijlstra wrote:

> So I had a bit of a peek at what clang generates:
> 
>     3fa4:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi        3fa7: R_X86_64_32S      __SCK__x86_pmu_handle_irq
>     3fab:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi        3fae: R_X86_64_32S      __SCT__x86_pmu_handle_irq.cfi_jt
>     3fb2:       e8 00 00 00 00          call   3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32    __static_call_update-0x4
> 
> So this then gives the trampoline jump table entry to
> __static_call_update(), with the result that it will rewrite the
> jump-table entry, not the trampoline!
> 
> Now it so happens that the trampoline looks *exactly* like the
> jump-table entry (one jmp.d32 instruction), so in that regards it'll
> again 'work'.
> 
> But this is all really, as in *really*, wrong. And I'm really sad I'm
> the one to have to discover this, even though I've mentioned
> static_call()s being tricky in previous reviews.

The below makes the clang-cfi build properly sick:

[    0.000000] trampoline signature fail
[    0.000000] ------------[ cut here ]------------
[    0.000000] kernel BUG at arch/x86/kernel/static_call.c:65!

---
Subject: static_call,x86: Robustify trampoline patching

Add a few signature bytes after the static call trampoline and verify
those bytes match before patching the trampoline. This avoids patching
random other JMPs (such as CFI jump-table entries) instead.

These bytes decode as:

   d:   53                      push   %rbx
   e:   43 54                   rex.XB push %r12

And happen to spell "SCT".

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/include/asm/static_call.h |  1 +
 arch/x86/kernel/static_call.c      | 14 ++++++++++----
 tools/objtool/check.c              |  3 +++
 3 files changed, 14 insertions(+), 4 deletions(-)

Comments

Peter Zijlstra Oct. 30, 2021, 8:16 a.m. UTC | #1
On Sat, Oct 30, 2021 at 09:47:58AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 29, 2021 at 10:03:24PM +0200, Peter Zijlstra wrote:
> 
> > So I had a bit of a peek at what clang generates:
> > 
> >     3fa4:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi        3fa7: R_X86_64_32S      __SCK__x86_pmu_handle_irq
> >     3fab:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi        3fae: R_X86_64_32S      __SCT__x86_pmu_handle_irq.cfi_jt
> >     3fb2:       e8 00 00 00 00          call   3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32    __static_call_update-0x4
> > 
> > So this then gives the trampoline jump table entry to
> > __static_call_update(), with the result that it will rewrite the
> > jump-table entry, not the trampoline!
> > 
> > Now it so happens that the trampoline looks *exactly* like the
> > jump-table entry (one jmp.d32 instruction), so in that regards it'll
> > again 'work'.
> > 
> > But this is all really, as in *really*, wrong. And I'm really sad I'm
> > the one to have to discover this, even though I've mentioned
> > static_call()s being tricky in previous reviews.
> 
> The below makes the clang-cfi build properly sick:
> 
> [    0.000000] trampoline signature fail
> [    0.000000] ------------[ cut here ]------------
> [    0.000000] kernel BUG at arch/x86/kernel/static_call.c:65!

So fundamentally I think the whole notion that the address of a function
is something different than 'the address of that function' is an *utter*
fail.

So things like FineIBT use a scheme where they pass a hash value along
with the indirect call, which is veryfied at the other end. Can't we
adjust it like:


foo.cfi:
	endbr
	xorl $0xdeadbeef, %r10d
	jz foo
	ud2
	nop	# make it an even 16 bytes
foo:
	# actual function text


Then have the address of foo, be the address of foo, like any normal
sane person would expect. Have direct calls to foo, go to foo, again, as
expected.

When doing an indirect call (to r11, as clang does), then, and only
then, do:

	movl $0xdeadbeef, %r10d
	subq $0x10, %r11
	call *%r11

	# if the r11 lives, add:
	addq $0x10, %r11


Then only when caller and callee agree 0xdeadbeef is the password, does
the indirect call go through.

Why isn't this a suitable CFI scheme even without IBT?
Ard Biesheuvel Oct. 30, 2021, 5:19 p.m. UTC | #2
On Sat, 30 Oct 2021 at 09:49, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Oct 29, 2021 at 10:03:24PM +0200, Peter Zijlstra wrote:
>
> > So I had a bit of a peek at what clang generates:
> >
> >     3fa4:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi        3fa7: R_X86_64_32S      __SCK__x86_pmu_handle_irq
> >     3fab:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi        3fae: R_X86_64_32S      __SCT__x86_pmu_handle_irq.cfi_jt
> >     3fb2:       e8 00 00 00 00          call   3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32    __static_call_update-0x4
> >
> > So this then gives the trampoline jump table entry to
> > __static_call_update(), with the result that it will rewrite the
> > jump-table entry, not the trampoline!
> >
> > Now it so happens that the trampoline looks *exactly* like the
> > jump-table entry (one jmp.d32 instruction), so in that regards it'll
> > again 'work'.
> >
> > But this is all really, as in *really*, wrong. And I'm really sad I'm
> > the one to have to discover this, even though I've mentioned
> > static_call()s being tricky in previous reviews.
>
> The below makes the clang-cfi build properly sick:
>
> [    0.000000] trampoline signature fail
> [    0.000000] ------------[ cut here ]------------
> [    0.000000] kernel BUG at arch/x86/kernel/static_call.c:65!
>
> ---
> Subject: static_call,x86: Robustify trampoline patching
>
> Add a few signature bytes after the static call trampoline and verify
> those bytes match before patching the trampoline. This avoids patching
> random other JMPs (such as CFI jump-table entries) instead.
>
> These bytes decode as:
>
>    d:   53                      push   %rbx
>    e:   43 54                   rex.XB push %r12
>
> And happen to spell "SCT".
>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>

I just realized that arm64 has the exact same problem, which is not
being addressed by my v5 of the static call support patch.

As it turns out, the v11 Clang that I have been testing with is broken
wrt BTI landing pads, and omits them from the jump table entries.
Clang 12+ adds them properly, which means that both the jump table
entry and the static call trampoline may start with BTI C + direct
branch, and we also need additional checks to disambiguate.
Peter Zijlstra Oct. 30, 2021, 6:02 p.m. UTC | #3
On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote:
> I just realized that arm64 has the exact same problem, which is not
> being addressed by my v5 of the static call support patch.

Yeah, it would.

> As it turns out, the v11 Clang that I have been testing with is broken
> wrt BTI landing pads, and omits them from the jump table entries.
> Clang 12+ adds them properly, which means that both the jump table
> entry and the static call trampoline may start with BTI C + direct
> branch, and we also need additional checks to disambiguate.

I'm not sure, why would the static_call trampoline need a BTI C ? The
whole point of static_call() is to be a direct call, we should never
have an indirect call to the trampoline, that would defeat the whole
purpose.
Ard Biesheuvel Oct. 30, 2021, 6:55 p.m. UTC | #4
On Sat, 30 Oct 2021 at 20:03, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote:
> > I just realized that arm64 has the exact same problem, which is not
> > being addressed by my v5 of the static call support patch.
>
> Yeah, it would.
>
> > As it turns out, the v11 Clang that I have been testing with is broken
> > wrt BTI landing pads, and omits them from the jump table entries.
> > Clang 12+ adds them properly, which means that both the jump table
> > entry and the static call trampoline may start with BTI C + direct
> > branch, and we also need additional checks to disambiguate.
>
> I'm not sure, why would the static_call trampoline need a BTI C ? The
> whole point of static_call() is to be a direct call, we should never
> have an indirect call to the trampoline, that would defeat the whole
> purpose.

This might happen when the distance between the caller and the
trampoline is more than 128 MB, in which case we emit a veneer that
uses an indirect call as well. So we definitely need the landing pad
in the trampoline.
Ard Biesheuvel Oct. 31, 2021, 4:24 p.m. UTC | #5
On Sat, 30 Oct 2021 at 20:55, Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Sat, 30 Oct 2021 at 20:03, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote:
> > > I just realized that arm64 has the exact same problem, which is not
> > > being addressed by my v5 of the static call support patch.
> >
> > Yeah, it would.
> >
> > > As it turns out, the v11 Clang that I have been testing with is broken
> > > wrt BTI landing pads, and omits them from the jump table entries.
> > > Clang 12+ adds them properly, which means that both the jump table
> > > entry and the static call trampoline may start with BTI C + direct
> > > branch, and we also need additional checks to disambiguate.
> >
> > I'm not sure, why would the static_call trampoline need a BTI C ? The
> > whole point of static_call() is to be a direct call, we should never
> > have an indirect call to the trampoline, that would defeat the whole
> > purpose.
>
> This might happen when the distance between the caller and the
> trampoline is more than 128 MB, in which case we emit a veneer that
> uses an indirect call as well. So we definitely need the landing pad
> in the trampoline.

Something like the below seems to work to prevent getting the wrong
trampoline address into arch_static_call_transform:

diff --git a/arch/x86/include/asm/static_call.h
b/arch/x86/include/asm/static_call.h
index cbb67b6030f9..c3704ea21bee 100644
--- a/arch/x86/include/asm/static_call.h
+++ b/arch/x86/include/asm/static_call.h
@@ -25,7 +25,9 @@
        asm(".pushsection .static_call.text, \"ax\"             \n"     \
            ".align 4                                           \n"     \
            ".globl " STATIC_CALL_TRAMP_STR(name) "             \n"     \
+           ".globl " STATIC_CALL_TRAMP_P_STR(name) "           \n"     \
            STATIC_CALL_TRAMP_STR(name) ":                      \n"     \
+           STATIC_CALL_TRAMP_P_STR(name) ":                    \n"     \
            insns "                                             \n"     \
            ".type " STATIC_CALL_TRAMP_STR(name) ", @function   \n"     \
            ".size " STATIC_CALL_TRAMP_STR(name) ", . - "
STATIC_CALL_TRAMP_STR(name) " \n" \
diff --git a/include/linux/static_call.h b/include/linux/static_call.h
index 19dc210214c0..46777a3395d3 100644
--- a/include/linux/static_call.h
+++ b/include/linux/static_call.h
@@ -143,7 +143,7 @@
  */
 extern void arch_static_call_transform(void *site, void *tramp, void
*func, bool tail);

-#define STATIC_CALL_TRAMP_ADDR(name) &STATIC_CALL_TRAMP(name)
+#define STATIC_CALL_TRAMP_ADDR(name) &STATIC_CALL_TRAMP_P(name)

 #else
 #define STATIC_CALL_TRAMP_ADDR(name) NULL
diff --git a/include/linux/static_call_types.h
b/include/linux/static_call_types.h
index 5a00b8b2cf9f..98a448f5ae45 100644
--- a/include/linux/static_call_types.h
+++ b/include/linux/static_call_types.h
@@ -18,6 +18,8 @@
 #define STATIC_CALL_TRAMP(name)
__PASTE(STATIC_CALL_TRAMP_PREFIX, name)
 #define STATIC_CALL_TRAMP_STR(name)    __stringify(STATIC_CALL_TRAMP(name))

+#define STATIC_CALL_TRAMP_P(name)      __PASTE(STATIC_CALL_TRAMP(name), _p)
+#define STATIC_CALL_TRAMP_P_STR(name)  __stringify(STATIC_CALL_TRAMP_P(name))
 /*
  * Flags in the low bits of static_call_site::key.
  */
@@ -36,7 +38,8 @@ struct static_call_site {

 #define DECLARE_STATIC_CALL(name, func)
         \
        extern struct static_call_key STATIC_CALL_KEY(name);            \
-       extern typeof(func) STATIC_CALL_TRAMP(name);
+       extern typeof(func) STATIC_CALL_TRAMP(name);                    \
+       extern u8 STATIC_CALL_TRAMP_P(name);

 #ifdef CONFIG_HAVE_STATIC_CALL

That leaves the 'func' argument, which ideally should not go through
the jump table either, but at least it is not terminally broken there.
Peter Zijlstra Oct. 31, 2021, 4:39 p.m. UTC | #6
On Sun, Oct 31, 2021 at 05:24:13PM +0100, Ard Biesheuvel wrote:
> On Sat, 30 Oct 2021 at 20:55, Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Sat, 30 Oct 2021 at 20:03, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote:
> > > > I just realized that arm64 has the exact same problem, which is not
> > > > being addressed by my v5 of the static call support patch.
> > >
> > > Yeah, it would.
> > >
> > > > As it turns out, the v11 Clang that I have been testing with is broken
> > > > wrt BTI landing pads, and omits them from the jump table entries.
> > > > Clang 12+ adds them properly, which means that both the jump table
> > > > entry and the static call trampoline may start with BTI C + direct
> > > > branch, and we also need additional checks to disambiguate.
> > >
> > > I'm not sure, why would the static_call trampoline need a BTI C ? The
> > > whole point of static_call() is to be a direct call, we should never
> > > have an indirect call to the trampoline, that would defeat the whole
> > > purpose.
> >
> > This might happen when the distance between the caller and the
> > trampoline is more than 128 MB, in which case we emit a veneer that
> > uses an indirect call as well. So we definitely need the landing pad
> > in the trampoline.
> 
> Something like the below seems to work to prevent getting the wrong
> trampoline address into arch_static_call_transform:

Is is also a terriblly gross hack. I really want the clang-cfi stuff to
improve, not add layers of hacks on top of it.
Ard Biesheuvel Oct. 31, 2021, 4:44 p.m. UTC | #7
On Sun, 31 Oct 2021 at 17:39, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Oct 31, 2021 at 05:24:13PM +0100, Ard Biesheuvel wrote:
> > On Sat, 30 Oct 2021 at 20:55, Ard Biesheuvel <ardb@kernel.org> wrote:
> > >
> > > On Sat, 30 Oct 2021 at 20:03, Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Sat, Oct 30, 2021 at 07:19:53PM +0200, Ard Biesheuvel wrote:
> > > > > I just realized that arm64 has the exact same problem, which is not
> > > > > being addressed by my v5 of the static call support patch.
> > > >
> > > > Yeah, it would.
> > > >
> > > > > As it turns out, the v11 Clang that I have been testing with is broken
> > > > > wrt BTI landing pads, and omits them from the jump table entries.
> > > > > Clang 12+ adds them properly, which means that both the jump table
> > > > > entry and the static call trampoline may start with BTI C + direct
> > > > > branch, and we also need additional checks to disambiguate.
> > > >
> > > > I'm not sure, why would the static_call trampoline need a BTI C ? The
> > > > whole point of static_call() is to be a direct call, we should never
> > > > have an indirect call to the trampoline, that would defeat the whole
> > > > purpose.
> > >
> > > This might happen when the distance between the caller and the
> > > trampoline is more than 128 MB, in which case we emit a veneer that
> > > uses an indirect call as well. So we definitely need the landing pad
> > > in the trampoline.
> >
> > Something like the below seems to work to prevent getting the wrong
> > trampoline address into arch_static_call_transform:
>
> Is is also a terriblly gross hack. I really want the clang-cfi stuff to
> improve, not add layers of hacks on top of it.

I'm just as annoyed as you are about the apparent need for this.
However, emitting an alias at build time is far better IMHO than
adding a magic byte sequence and having to check it at runtime.
Peter Zijlstra Oct. 31, 2021, 8:09 p.m. UTC | #8
On Sun, Oct 31, 2021 at 05:44:04PM +0100, Ard Biesheuvel wrote:

> > Is is also a terriblly gross hack. I really want the clang-cfi stuff to
> > improve, not add layers of hacks on top of it.
> 
> I'm just as annoyed as you are about the apparent need for this.
> However, emitting an alias at build time is far better IMHO than
> adding a magic byte sequence and having to check it at runtime.

Oh, I'm keeping that magic sequence :-) That's hardening in general, and
I don't want to ever want to debug a wrong poke like that again.

Adding an extra label fixes this thing, but there's still the other
cases where we need/want/desire a *real* function pointer.

I'm very close to saying that anything that mucks up function pointers
like this is a complete non-starter. Let's start re-start this whole CFI
endeavour from the start.
Ard Biesheuvel Oct. 31, 2021, 8:21 p.m. UTC | #9
On Sun, 31 Oct 2021 at 21:11, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Oct 31, 2021 at 05:44:04PM +0100, Ard Biesheuvel wrote:
>
> > > Is is also a terriblly gross hack. I really want the clang-cfi stuff to
> > > improve, not add layers of hacks on top of it.
> >
> > I'm just as annoyed as you are about the apparent need for this.
> > However, emitting an alias at build time is far better IMHO than
> > adding a magic byte sequence and having to check it at runtime.
>
> Oh, I'm keeping that magic sequence :-) That's hardening in general, and
> I don't want to ever want to debug a wrong poke like that again.
>
> Adding an extra label fixes this thing, but there's still the other
> cases where we need/want/desire a *real* function pointer.
>
> I'm very close to saying that anything that mucks up function pointers
> like this is a complete non-starter. Let's start re-start this whole CFI
> endeavour from the start.

Well, CFI is already in mainline for arm64, whereas static call
support is not. So we have to deal with it one way or the other.

So for the static call targets, I agree that we want to support any
expression that produces a function pointer, but that part is not
actually broken, it is just sub-optimal iff you are using CFI Clang.
For taking the address of the trampoline, I think the solutions we
have are sufficient (although I am not inclined to add the magic sig
to arm64 if the label is sufficient).

That means we can support static calls on arm64 now without breaking
Clang CFI, and work on a solution for the redundant jumps on a more
relaxed schedule.
Peter Zijlstra Oct. 31, 2021, 8:44 p.m. UTC | #10
On Sun, Oct 31, 2021 at 09:21:56PM +0100, Ard Biesheuvel wrote:

> That means we can support static calls on arm64 now without breaking
> Clang CFI, and work on a solution for the redundant jumps on a more
> relaxed schedule.

Yes, arm64 has a 'problem' with having already merged the clang-cfi
stuff :/

I'm hoping the x86 solution can be an alternative CFI scheme, I'm
starting to really hate this one. And I'm not at all convinced the
proposed scheme is the best possible scheme given the constraints of
kernel code. AFAICT it's a compromise made in userspace.
Ard Biesheuvel Oct. 31, 2021, 11:36 p.m. UTC | #11
On Sun, 31 Oct 2021 at 21:45, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Sun, Oct 31, 2021 at 09:21:56PM +0100, Ard Biesheuvel wrote:
>
> > That means we can support static calls on arm64 now without breaking
> > Clang CFI, and work on a solution for the redundant jumps on a more
> > relaxed schedule.
>
> Yes, arm64 has a 'problem' with having already merged the clang-cfi
> stuff :/
>
> I'm hoping the x86 solution can be an alternative CFI scheme, I'm
> starting to really hate this one. And I'm not at all convinced the
> proposed scheme is the best possible scheme given the constraints of
> kernel code. AFAICT it's a compromise made in userspace.

Your scheme only works with IBT: the value of %r11 is under the
adversary's control so it could just point it at 'foo+0x10' if it
wants to call foo indirectly, and circumvent the check. So without IBT
(or BTI), I think the check fundamentally belongs in the caller, not
in the callee.
Peter Zijlstra Nov. 1, 2021, 9:01 a.m. UTC | #12
On Mon, Nov 01, 2021 at 12:36:18AM +0100, Ard Biesheuvel wrote:
> On Sun, 31 Oct 2021 at 21:45, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Sun, Oct 31, 2021 at 09:21:56PM +0100, Ard Biesheuvel wrote:
> >
> > > That means we can support static calls on arm64 now without breaking
> > > Clang CFI, and work on a solution for the redundant jumps on a more
> > > relaxed schedule.
> >
> > Yes, arm64 has a 'problem' with having already merged the clang-cfi
> > stuff :/
> >
> > I'm hoping the x86 solution can be an alternative CFI scheme, I'm
> > starting to really hate this one. And I'm not at all convinced the
> > proposed scheme is the best possible scheme given the constraints of
> > kernel code. AFAICT it's a compromise made in userspace.
> 
> Your scheme only works with IBT: the value of %r11 is under the
> adversary's control so it could just point it at 'foo+0x10' if it
> wants to call foo indirectly, and circumvent the check. So without IBT
> (or BTI), I think the check fundamentally belongs in the caller, not
> in the callee.

How is that not true for the jump table approach? Like I showed earlier,
it is *trivial* to reconstruct the actual function pointer from a
jump-table entry pointer.

In any case, I really want the discussion to start at square one, and
show/explain why any chosen CFI scheme is actually good for the kernel.
Just because clang happened to have implemented it, doesn't make it the
most suitable scheme for the kernel.
David Laight Nov. 1, 2021, 9:36 a.m. UTC | #13
From: Peter Zijlstra
> Sent: 01 November 2021 09:02
..
> In any case, I really want the discussion to start at square one, and
> show/explain why any chosen CFI scheme is actually good for the kernel.
> Just because clang happened to have implemented it, doesn't make it the
> most suitable scheme for the kernel.

How much overhead does it add to write("/dev/null", "", 1) ?
You've two large jump tables.
One for the syscall entry - (all the syscalls have the
same prototype), and a second for selecting the correct
device driver's 'write' entry point.

You really don't want to be doing any kind of search.

Hardware that supported a (say) 16-bit constant in both the
'landing pad' and call indirect instruction and trapped if
they differed would be useful - but I doubt any hardware
that checks landing pads is anywhere near that useful.

	David.

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Ard Biesheuvel Nov. 1, 2021, 2:14 p.m. UTC | #14
On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Nov 01, 2021 at 12:36:18AM +0100, Ard Biesheuvel wrote:
> > On Sun, 31 Oct 2021 at 21:45, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Sun, Oct 31, 2021 at 09:21:56PM +0100, Ard Biesheuvel wrote:
> > >
> > > > That means we can support static calls on arm64 now without breaking
> > > > Clang CFI, and work on a solution for the redundant jumps on a more
> > > > relaxed schedule.
> > >
> > > Yes, arm64 has a 'problem' with having already merged the clang-cfi
> > > stuff :/
> > >
> > > I'm hoping the x86 solution can be an alternative CFI scheme, I'm
> > > starting to really hate this one. And I'm not at all convinced the
> > > proposed scheme is the best possible scheme given the constraints of
> > > kernel code. AFAICT it's a compromise made in userspace.
> >
> > Your scheme only works with IBT: the value of %r11 is under the
> > adversary's control so it could just point it at 'foo+0x10' if it
> > wants to call foo indirectly, and circumvent the check. So without IBT
> > (or BTI), I think the check fundamentally belongs in the caller, not
> > in the callee.
>
> How is that not true for the jump table approach? Like I showed earlier,
> it is *trivial* to reconstruct the actual function pointer from a
> jump-table entry pointer.
>

That is not the point. The point is that Clang instruments every
indirect call that it emits, to check whether the type of the jump
table entry it is about to call matches the type of the caller. IOW,
the indirect calls can only branch into jump tables, and all jump
table entries in a table each branch to the start of some function of
the same type.

So the only thing you could achieve by adding or subtracting a
constant value from the indirect call address is either calling
another function of the same type (if you are hitting another entry in
the same table), or failing the CFI type check.

Instrumenting the callee only needs something like BTI, and a
consistent use of the landing pads to ensure that you cannot trivially
omit the check by landing right after it.

> In any case, I really want the discussion to start at square one, and
> show/explain why any chosen CFI scheme is actually good for the kernel.
> Just because clang happened to have implemented it, doesn't make it the
> most suitable scheme for the kernel.

Not disagreeing with that for x86, but again, we're already past that for arm64.
Peter Zijlstra Nov. 2, 2021, 12:57 p.m. UTC | #15
On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote:
> On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote:

> > How is that not true for the jump table approach? Like I showed earlier,
> > it is *trivial* to reconstruct the actual function pointer from a
> > jump-table entry pointer.
> >
> 
> That is not the point. The point is that Clang instruments every
> indirect call that it emits, to check whether the type of the jump
> table entry it is about to call matches the type of the caller. IOW,
> the indirect calls can only branch into jump tables, and all jump
> table entries in a table each branch to the start of some function of
> the same type.
> 
> So the only thing you could achieve by adding or subtracting a
> constant value from the indirect call address is either calling
> another function of the same type (if you are hitting another entry in
> the same table), or failing the CFI type check.

Ah, I see, so the call-site needs to have a branch around the indirect
call instruction.

> Instrumenting the callee only needs something like BTI, and a
> consistent use of the landing pads to ensure that you cannot trivially
> omit the check by landing right after it.

That does bring up another point tho; how are we going to do a kernel
that's optimal for both software CFI and hardware aided CFI?

All questions that need answering I think.


So how insane is something like this, have each function:

foo.cfi:
	endbr64
	xorl $0xdeadbeef, %r10d
	jz foo
	ud2
	nop	# make it 16 bytes
foo:
	# actual function text goes here


And for each hash have two thunks:


	# arg: r11
	# clobbers: r10, r11
__x86_indirect_cfi_deadbeef:
	movl -9(%r11), %r10		# immediate in foo.cfi
	xorl $0xdeadbeef, %r10		# our immediate
	jz 1f
	ud2
1:	ALTERNATIVE_2	"jmp *%r11",
			"jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
			"lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD



	# arg: r11
	# clobbers: r10, r11
__x86_indirect_ibt_deadbeef:
	movl $0xdeadbeef, %r10
	subq $0x10, %r11
	ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE
	jmp *%r11



And have the actual indirect callsite look like:

	# r11 - &foo
	ALTERNATIVE_2	"cs call __x86_indirect_thunk_r11",
			"cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI
			"cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT

Although if the compiler were to emit:

	cs call __x86_indirect_cfi_deadbeef

we could probaly fix it up from there.


Then we can at runtime decide between:

  {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd}
Peter Zijlstra Nov. 2, 2021, 3:15 p.m. UTC | #16
On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:

> So how insane is something like this, have each function:
> 
> foo.cfi:
> 	endbr64
> 	xorl $0xdeadbeef, %r10d
> 	jz foo
> 	ud2
> 	nop	# make it 16 bytes
> foo:
> 	# actual function text goes here
> 
> 
> And for each hash have two thunks:
> 
> 
> 	# arg: r11
> 	# clobbers: r10, r11
> __x86_indirect_cfi_deadbeef:
> 	movl -9(%r11), %r10		# immediate in foo.cfi
> 	xorl $0xdeadbeef, %r10		# our immediate
> 	jz 1f
> 	ud2
> 1:	ALTERNATIVE_2	"jmp *%r11",
> 			"jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
> 			"lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD
> 
> 
> 
> 	# arg: r11
> 	# clobbers: r10, r11
> __x86_indirect_ibt_deadbeef:
> 	movl $0xdeadbeef, %r10
> 	subq $0x10, %r11
> 	ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE
> 	jmp *%r11
> 

These two thunks could of course be one big alternative.

> And have the actual indirect callsite look like:
> 
> 	# r11 - &foo
> 	ALTERNATIVE_2	"cs call __x86_indirect_thunk_r11",
> 			"cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI
> 			"cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT

Also simplifying this.

> Although if the compiler were to emit:
> 
> 	cs call __x86_indirect_cfi_deadbeef
> 
> we could probaly fix it up from there.
> 
> 
> Then we can at runtime decide between:
> 
>   {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd}
>
Kees Cook Nov. 2, 2021, 5:35 p.m. UTC | #17
On Sat, Oct 30, 2021 at 10:16:31AM +0200, Peter Zijlstra wrote:
> foo.cfi:
> 	endbr
> 	xorl $0xdeadbeef, %r10d
> 	jz foo
> 	ud2
> 	nop	# make it an even 16 bytes
> foo:
> 	# actual function text
> 
> 
> Then have the address of foo, be the address of foo, like any normal
> sane person would expect. Have direct calls to foo, go to foo, again, as
> expected.
> 
> When doing an indirect call (to r11, as clang does), then, and only
> then, do:
> 
> 	movl $0xdeadbeef, %r10d
> 	subq $0x10, %r11
> 	call *%r11
> 
> 	# if the r11 lives, add:
> 	addq $0x10, %r11
> 
> 
> Then only when caller and callee agree 0xdeadbeef is the password, does
> the indirect call go through.
> 
> Why isn't this a suitable CFI scheme even without IBT?

The trouble is that the callee is doing the verification. There's no
protection against calling into a callee that doesn't perform a check
(e.g. BPF JIT, or otherwise constructed executable memory, etc). The
caller needs to do the verification that what they're calling into is
safe before it makes the call.
Ard Biesheuvel Nov. 2, 2021, 5:44 p.m. UTC | #18
On Tue, 2 Nov 2021 at 16:15, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
>
> > So how insane is something like this, have each function:
> >
> > foo.cfi:
> >       endbr64
> >       xorl $0xdeadbeef, %r10d
> >       jz foo
> >       ud2
> >       nop     # make it 16 bytes
> > foo:
> >       # actual function text goes here
> >
> >
> > And for each hash have two thunks:
> >
> >
> >       # arg: r11
> >       # clobbers: r10, r11
> > __x86_indirect_cfi_deadbeef:
> >       movl -9(%r11), %r10             # immediate in foo.cfi
> >       xorl $0xdeadbeef, %r10          # our immediate
> >       jz 1f
> >       ud2
> > 1:    ALTERNATIVE_2   "jmp *%r11",
> >                       "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
> >                       "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD
> >

So are these supposed to go into the jump tables? If so, there still
needs to be a check against the boundary of the table at the call
site, to ensure that we are not calling something that we shouldn't.

If they are not going into the jump tables, I don't see the point of
having them, as only happy flow/uncomprised code would bother to use
them.


> >
> >
> >       # arg: r11
> >       # clobbers: r10, r11
> > __x86_indirect_ibt_deadbeef:
> >       movl $0xdeadbeef, %r10
> >       subq $0x10, %r11
> >       ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE
> >       jmp *%r11
> >
>
> These two thunks could of course be one big alternative.
>
> > And have the actual indirect callsite look like:
> >
> >       # r11 - &foo
> >       ALTERNATIVE_2   "cs call __x86_indirect_thunk_r11",
> >                       "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI
> >                       "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT
>
> Also simplifying this.
>
> > Although if the compiler were to emit:
> >
> >       cs call __x86_indirect_cfi_deadbeef
> >
> > we could probaly fix it up from there.
> >
> >
> > Then we can at runtime decide between:
> >
> >   {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd}
> >
Kees Cook Nov. 2, 2021, 6:10 p.m. UTC | #19
On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote:
> > On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > How is that not true for the jump table approach? Like I showed earlier,
> > > it is *trivial* to reconstruct the actual function pointer from a
> > > jump-table entry pointer.
> > >
> > 
> > That is not the point. The point is that Clang instruments every
> > indirect call that it emits, to check whether the type of the jump
> > table entry it is about to call matches the type of the caller. IOW,
> > the indirect calls can only branch into jump tables, and all jump
> > table entries in a table each branch to the start of some function of
> > the same type.
> > 
> > So the only thing you could achieve by adding or subtracting a
> > constant value from the indirect call address is either calling
> > another function of the same type (if you are hitting another entry in
> > the same table), or failing the CFI type check.
> 
> Ah, I see, so the call-site needs to have a branch around the indirect
> call instruction.
> 
> > Instrumenting the callee only needs something like BTI, and a
> > consistent use of the landing pads to ensure that you cannot trivially
> > omit the check by landing right after it.
> 
> That does bring up another point tho; how are we going to do a kernel
> that's optimal for both software CFI and hardware aided CFI?
> 
> All questions that need answering I think.

I'm totally fine with designing a new CFI for a future option,
but blocking the existing (working) one does not best serve our end
users. There are already people waiting on x86 CFI because having the
extra layer of defense is valuable for them. No, it's not perfect,
but it works right now, and evidence from Android shows that it has
significant real-world defensive value. Some of the more adventurous
are actually patching their kernels with the CFI support already, and
happily running their workloads, etc.

Supporting Clang CFI means we actually have something to evolve
from, where as starting completely over means (likely significant)
delays leaving folks without the option available at all. I think the
various compiler and kernel tweaks needed to improve kernel support
are reasonable, but building a totally new CFI implementation is not:
it _does_ work today on x86. Yes, it's weird, but not outrageously so.
(And just to state the obvious, CFI is an _optional_ CONFIG: not
everyone wants CFI, so it's okay if there are some sharp edges under
some CONFIG combinations.)

Regardless, speaking to a new CFI design below:

> So how insane is something like this, have each function:
> 
> foo.cfi:
> 	endbr64
> 	xorl $0xdeadbeef, %r10d
> 	jz foo
> 	ud2
> 	nop	# make it 16 bytes
> foo:
> 	# actual function text goes here
> 
> 
> And for each hash have two thunks:
> 
> 
> 	# arg: r11
> 	# clobbers: r10, r11
> __x86_indirect_cfi_deadbeef:
> 	movl -9(%r11), %r10		# immediate in foo.cfi

This requires the text be readable. I have been hoping to avoid this for
a CFI implementation so we could gain the benefit of execute-only
memory (available soon on arm64, and possible today on x86 under a
hypervisor).

But, yes, FWIW, this is very similar to what PaX RAP CFI does: the
caller reads "$dest - offset" for a hash, and compares it against the
hard-coded hash at the call site, before "call $dest".

> 	xorl $0xdeadbeef, %r10		# our immediate
> 	jz 1f
> 	ud2
> 1:	ALTERNATIVE_2	"jmp *%r11",
> 			"jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
> 			"lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD
> 
> 
> 
> 	# arg: r11
> 	# clobbers: r10, r11
> __x86_indirect_ibt_deadbeef:
> 	movl $0xdeadbeef, %r10
> 	subq $0x10, %r11
> 	ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE
> 	jmp *%r11
> 
> 
> 
> And have the actual indirect callsite look like:
> 
> 	# r11 - &foo
> 	ALTERNATIVE_2	"cs call __x86_indirect_thunk_r11",
> 			"cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI
> 			"cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT
> 
> Although if the compiler were to emit:
> 
> 	cs call __x86_indirect_cfi_deadbeef
> 
> we could probaly fix it up from there.

It seems like this could work for any CFI implementation, though, if
the CFI implementation always performed a call, or if the bounds of the
inline checking were known? i.e. objtool could also find the inline
checks just as well as the call, though?

> Then we can at runtime decide between:
> 
>   {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd}

This does look pretty powerful, but I still don't think it precludes
using the existing Clang CFI. I don't want perfect to be the enemy of
good. :)
Peter Zijlstra Nov. 2, 2021, 6:14 p.m. UTC | #20
On Tue, Nov 02, 2021 at 06:44:56PM +0100, Ard Biesheuvel wrote:
> On Tue, 2 Nov 2021 at 16:15, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
> >
> > > So how insane is something like this, have each function:
> > >
> > > foo.cfi:
> > >       endbr64
> > >       xorl $0xdeadbeef, %r10d
> > >       jz foo
> > >       ud2
> > >       nop     # make it 16 bytes
> > > foo:
> > >       # actual function text goes here
> > >
> > >
> > > And for each hash have two thunks:
> > >
> > >
> > >       # arg: r11
> > >       # clobbers: r10, r11
> > > __x86_indirect_cfi_deadbeef:
> > >       movl -9(%r11), %r10             # immediate in foo.cfi
> > >       xorl $0xdeadbeef, %r10          # our immediate
> > >       jz 1f
> > >       ud2
> > > 1:    ALTERNATIVE_2   "jmp *%r11",
> > >                       "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
> > >                       "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD
> > >
> 
> So are these supposed to go into the jump tables? If so, there still
> needs to be a check against the boundary of the table at the call
> site, to ensure that we are not calling something that we shouldn't.
> 
> If they are not going into the jump tables, I don't see the point of
> having them, as only happy flow/uncomprised code would bother to use
> them.

I don't understand. If you can scribble your own code, you can
circumvent pretty much any range check anyway. But if you can't scribble
your own code, you get to use the branch here and that checks the
callsite and callee signature.

The range check isn't fundamental to CFI, having a check is the
important thing AFAIU.
Peter Zijlstra Nov. 2, 2021, 6:15 p.m. UTC | #21
On Tue, Nov 02, 2021 at 10:35:30AM -0700, Kees Cook wrote:
> On Sat, Oct 30, 2021 at 10:16:31AM +0200, Peter Zijlstra wrote:
> > foo.cfi:
> > 	endbr
> > 	xorl $0xdeadbeef, %r10d
> > 	jz foo
> > 	ud2
> > 	nop	# make it an even 16 bytes
> > foo:
> > 	# actual function text
> > 
> > 
> > Then have the address of foo, be the address of foo, like any normal
> > sane person would expect. Have direct calls to foo, go to foo, again, as
> > expected.
> > 
> > When doing an indirect call (to r11, as clang does), then, and only
> > then, do:
> > 
> > 	movl $0xdeadbeef, %r10d
> > 	subq $0x10, %r11
> > 	call *%r11
> > 
> > 	# if the r11 lives, add:
> > 	addq $0x10, %r11
> > 
> > 
> > Then only when caller and callee agree 0xdeadbeef is the password, does
> > the indirect call go through.
> > 
> > Why isn't this a suitable CFI scheme even without IBT?
> 
> The trouble is that the callee is doing the verification. There's no
> protection against calling into a callee that doesn't perform a check
> (e.g. BPF JIT, or otherwise constructed executable memory, etc). The
> caller needs to do the verification that what they're calling into is
> safe before it makes the call.

Right, Ard said the same, see new crackpot scheme here:

  https://lkml.kernel.org/r/YYE1yPClPMHvyvIt@hirez.programming.kicks-ass.net
Peter Zijlstra Nov. 2, 2021, 6:17 p.m. UTC | #22
On Tue, Nov 02, 2021 at 07:14:25PM +0100, Peter Zijlstra wrote:
> On Tue, Nov 02, 2021 at 06:44:56PM +0100, Ard Biesheuvel wrote:
> > On Tue, 2 Nov 2021 at 16:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
> > >
> > > > So how insane is something like this, have each function:
> > > >
> > > > foo.cfi:
> > > >       endbr64
> > > >       xorl $0xdeadbeef, %r10d
> > > >       jz foo
> > > >       ud2
> > > >       nop     # make it 16 bytes
> > > > foo:
> > > >       # actual function text goes here
> > > >
> > > >
> > > > And for each hash have two thunks:
> > > >
> > > >
> > > >       # arg: r11
> > > >       # clobbers: r10, r11
> > > > __x86_indirect_cfi_deadbeef:
> > > >       movl -9(%r11), %r10             # immediate in foo.cfi
> > > >       xorl $0xdeadbeef, %r10          # our immediate
> > > >       jz 1f
> > > >       ud2
> > > > 1:    ALTERNATIVE_2   "jmp *%r11",
> > > >                       "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
> > > >                       "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD
> > > >
> > 
> > So are these supposed to go into the jump tables? If so, there still
> > needs to be a check against the boundary of the table at the call
> > site, to ensure that we are not calling something that we shouldn't.
> > 
> > If they are not going into the jump tables, I don't see the point of
> > having them, as only happy flow/uncomprised code would bother to use
> > them.
> 
> I don't understand. If you can scribble your own code, you can
> circumvent pretty much any range check anyway. But if you can't scribble
> your own code, you get to use the branch here and that checks the
> callsite and callee signature.
> 
> The range check isn't fundamental to CFI, having a check is the
> important thing AFAIU.

That is, how is a jump-table/range-check better than a hash-value match
check?
Ard Biesheuvel Nov. 2, 2021, 6:18 p.m. UTC | #23
On Tue, 2 Nov 2021 at 19:14, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Nov 02, 2021 at 06:44:56PM +0100, Ard Biesheuvel wrote:
> > On Tue, 2 Nov 2021 at 16:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
> > >
> > > > So how insane is something like this, have each function:
> > > >
> > > > foo.cfi:
> > > >       endbr64
> > > >       xorl $0xdeadbeef, %r10d
> > > >       jz foo
> > > >       ud2
> > > >       nop     # make it 16 bytes
> > > > foo:
> > > >       # actual function text goes here
> > > >
> > > >
> > > > And for each hash have two thunks:
> > > >
> > > >
> > > >       # arg: r11
> > > >       # clobbers: r10, r11
> > > > __x86_indirect_cfi_deadbeef:
> > > >       movl -9(%r11), %r10             # immediate in foo.cfi
> > > >       xorl $0xdeadbeef, %r10          # our immediate
> > > >       jz 1f
> > > >       ud2
> > > > 1:    ALTERNATIVE_2   "jmp *%r11",
> > > >                       "jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
> > > >                       "lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD
> > > >
> >
> > So are these supposed to go into the jump tables? If so, there still
> > needs to be a check against the boundary of the table at the call
> > site, to ensure that we are not calling something that we shouldn't.
> >
> > If they are not going into the jump tables, I don't see the point of
> > having them, as only happy flow/uncomprised code would bother to use
> > them.
>
> I don't understand. If you can scribble your own code, you can
> circumvent pretty much any range check anyway.

A function pointer is data not code.

> But if you can't scribble
> your own code, you get to use the branch here and that checks the
> callsite and callee signature.
>

OK, so the call site has a direct branch to this trampoline then? That
wasn't clear to me.

> The range check isn't fundamental to CFI, having a check is the
> important thing AFAIU.

Agreed. If the call site has a direct branch, it doesn't need the range check.
Andy Lutomirski Nov. 2, 2021, 9:02 p.m. UTC | #24
On Tue, Nov 2, 2021, at 11:10 AM, Kees Cook wrote:
> On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
>> On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote:
>> > On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote:
>> 
>> > > How is that not true for the jump table approach? Like I showed earlier,
>> > > it is *trivial* to reconstruct the actual function pointer from a
>> > > jump-table entry pointer.
>> > >
>> > 
>> > That is not the point. The point is that Clang instruments every
>> > indirect call that it emits, to check whether the type of the jump
>> > table entry it is about to call matches the type of the caller. IOW,
>> > the indirect calls can only branch into jump tables, and all jump
>> > table entries in a table each branch to the start of some function of
>> > the same type.
>> > 
>> > So the only thing you could achieve by adding or subtracting a
>> > constant value from the indirect call address is either calling
>> > another function of the same type (if you are hitting another entry in
>> > the same table), or failing the CFI type check.
>> 
>> Ah, I see, so the call-site needs to have a branch around the indirect
>> call instruction.
>> 
>> > Instrumenting the callee only needs something like BTI, and a
>> > consistent use of the landing pads to ensure that you cannot trivially
>> > omit the check by landing right after it.
>> 
>> That does bring up another point tho; how are we going to do a kernel
>> that's optimal for both software CFI and hardware aided CFI?
>> 
>> All questions that need answering I think.
>
> I'm totally fine with designing a new CFI for a future option,
> but blocking the existing (working) one does not best serve our end
> users. 

I like security, but I also like building working systems, and I think I disagree with you. There are a whole bunch of CFI schemes out there, with varying hardware requirements, and they provide varying degrees of fine grained protection and varying degrees of protection against improper speculation.  We do not want to merge clang CFI just because it’s “ready” and end up with a mess that makes it harder to support other schemes in the kernel.

So, yes, a good CFI scheme needs caller-side protection, especially if IBT isn’t in use.  But a good CFI scheme also needs to interoperate with the rest of the kernel, and this whole “canonical” and symbol-based lookup and static_call thing is nonsense.  I think we need a better implementation, whether it uses intrinsics or little C helpers or whatever.

I’m not saying this needs to be incompatible with current clang releases, but I do think we need a clear story for how operations like static call patching are supposed to work.

FYI, Ard, many years ago we merged kernel support for the original gcc stack protector. We have since *removed* it on x86_32 in favor of a nicer implementation that requires a newer toolchain.
Peter Zijlstra Nov. 2, 2021, 9:19 p.m. UTC | #25
On Tue, Nov 02, 2021 at 11:10:10AM -0700, Kees Cook wrote:
> On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:

> > All questions that need answering I think.
> 
> I'm totally fine with designing a new CFI for a future option,
> but blocking the existing (working) one does not best serve our end
> users.

Accepting a half arsed CFI now, just because it is what we have, will
only make it ever so much more difficuly to get new/better
things implemented in the toolchains, because the pressure is off.

Also, it will make enabling IBT more difficult, or we'll end up having
CFI and IBT mutually exclusive, which is a crap position to be in.

> There are already people waiting on x86 CFI because having the
> extra layer of defense is valuable for them. No, it's not perfect,
> but it works right now, and evidence from Android shows that it has
> significant real-world defensive value. Some of the more adventurous
> are actually patching their kernels with the CFI support already, and
> happily running their workloads, etc.

It works by accident, not design. Which is a giant red flag in my book.

> Supporting Clang CFI means we actually have something to evolve
> from,

I don't want to evolve from a place that's crazy and we shouldn't have
been in to begin with. Redefining what a function pointer is is insane,
and the required work-arounds are ugly at best.

The ARM64 folks have expressed regret from having merged it. Why should
x86 do something that's known to cause regret?

> where as starting completely over means (likely significant)
> delays leaving folks without the option available at all.

See above. Maybe some of those folks will get motivated to make it
happen faster.

> I think the various compiler and kernel tweaks needed to improve
> kernel support are reasonable, but building a totally new CFI
> implementation is not: it _does_ work today on x86. 

By sodding accident; see that one static call patch that makes it burn.

If someone were to use the jump-table entry after we'd scribbled it,
thing will go sideways bad.

> Yes, it's weird, but not outrageously so.

Clearly we disagree.

> Regardless, speaking to a new CFI design below:
> 
> > So how insane is something like this, have each function:
> > 
> > foo.cfi:
> > 	endbr64
> > 	xorl $0xdeadbeef, %r10d
> > 	jz foo
> > 	ud2
> > 	nop	# make it 16 bytes
> > foo:
> > 	# actual function text goes here
> > 
> > 
> > And for each hash have two thunks:
> > 
> > 
> > 	# arg: r11
> > 	# clobbers: r10, r11
> > __x86_indirect_cfi_deadbeef:
> > 	movl -9(%r11), %r10		# immediate in foo.cfi
> 
> This requires the text be readable. I have been hoping to avoid this for
> a CFI implementation so we could gain the benefit of execute-only
> memory (available soon on arm64, and possible today on x86 under a
> hypervisor).

It's only needed for the 'legacy' case of software only CFI if that
makes you feel better. BTI/IBT based CFI doesn't need this.

(also, I'm very much a bare metal kinda guy)

> > 	xorl $0xdeadbeef, %r10		# our immediate
> > 	jz 1f
> > 	ud2
> > 1:	ALTERNATIVE_2	"jmp *%r11",
> > 			"jmp __x86_indirect_thunk_r11", X86_FEATURE_RETPOLINE
> > 			"lfence; jmp *%r11", X86_FEATURE_RETPOLINE_AMD
> > 
> > 
> > 
> > 	# arg: r11
> > 	# clobbers: r10, r11
> > __x86_indirect_ibt_deadbeef:
> > 	movl $0xdeadbeef, %r10
> > 	subq $0x10, %r11
> > 	ALTERNATIVE "", "lfence", X86_FEATURE_RETPOLINE
> > 	jmp *%r11
> > 

IBT case ^ doesn't actually read from the code. It simply jumps in front
of the real function by a set amount to get the extra cruft /
landing-pad required for indirect calls.

Also note that direct calls never make use of that pad, which means it
can be stripped from all symbols that never have their address taken,
which means less indirect targets.

(Or poison the thing with UD2 and write 0 in the immediate)

> > And have the actual indirect callsite look like:
> > 
> > 	# r11 - &foo
> > 	ALTERNATIVE_2	"cs call __x86_indirect_thunk_r11",
> > 			"cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI
> > 			"cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT
> > 
> > Although if the compiler were to emit:
> > 
> > 	cs call __x86_indirect_cfi_deadbeef
> > 
> > we could probaly fix it up from there.
> 
> It seems like this could work for any CFI implementation, though, if

The strong suggestion is that function pointers are sane, and there's a
few other considerations, but yes.

> the CFI implementation always performed a call, or if the bounds of the
> inline checking were known? i.e. objtool could also find the inline
> checks just as well as the call, though?

Somewhat hard, it's much easier to find a single instruction than a
pattern. Also, footprint. Smaller really is better.

> > Then we can at runtime decide between:
> > 
> >   {!cfi, cfi, ibt} x {!retpoline, retpoline, retpoline-amd}
> 
> This does look pretty powerful, but I still don't think it precludes
> using the existing Clang CFI. I don't want perfect to be the enemy of
> good. :)

As stated above, we're in violent disagreement that the proposed scheme
comes anywhere near good. I utterly hate the redefintion of function
pointers and I also think the range compare goes sideways in the face of
modules. That's two marks against jump-tables.

(Also, in the !CFI case, you can't actually free the jump-tables, since
you're uncondtionally s(t)uck with them because of how &func is
wrecked.)
Peter Zijlstra Nov. 2, 2021, 9:48 p.m. UTC | #26
On Tue, Nov 02, 2021 at 07:18:53PM +0100, Ard Biesheuvel wrote:

> > The range check isn't fundamental to CFI, having a check is the
> > important thing AFAIU.
> 
> Agreed. If the call site has a direct branch, it doesn't need the range check.

That, from the earlier email:

 | And have the actual indirect callsite look like:
 |
 |        # r11 - &foo
 |        ALTERNATIVE_2   "cs call __x86_indirect_thunk_r11",
 |                        "cs call __x86_indirect_cfi_deadbeef", X86_FEATURE_CFI
 |                        "cs call __x86_indirect_ibt_deadbeef", X86_FEATURE_IBT

So the callsite has a direct call to the hash-specific and cfi-type
specific thunk, which then does an (indirect) tail-call.

The CFI one does the hash check in the thunk and jumps to the function
proper, the IBT one on does it in the landing-pad.

The !CFI one ignore it all and simply does an indirect call (retpoline
aided or otherwise) to the function proper -- in which case we can free
all the thunks.
Kees Cook Nov. 2, 2021, 11:13 p.m. UTC | #27
On Tue, Nov 02, 2021 at 02:02:38PM -0700, Andy Lutomirski wrote:
> 
> 
> On Tue, Nov 2, 2021, at 11:10 AM, Kees Cook wrote:
> > On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
> >> On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote:
> >> > On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote:
> >> 
> >> > > How is that not true for the jump table approach? Like I showed earlier,
> >> > > it is *trivial* to reconstruct the actual function pointer from a
> >> > > jump-table entry pointer.
> >> > >
> >> > 
> >> > That is not the point. The point is that Clang instruments every
> >> > indirect call that it emits, to check whether the type of the jump
> >> > table entry it is about to call matches the type of the caller. IOW,
> >> > the indirect calls can only branch into jump tables, and all jump
> >> > table entries in a table each branch to the start of some function of
> >> > the same type.
> >> > 
> >> > So the only thing you could achieve by adding or subtracting a
> >> > constant value from the indirect call address is either calling
> >> > another function of the same type (if you are hitting another entry in
> >> > the same table), or failing the CFI type check.
> >> 
> >> Ah, I see, so the call-site needs to have a branch around the indirect
> >> call instruction.
> >> 
> >> > Instrumenting the callee only needs something like BTI, and a
> >> > consistent use of the landing pads to ensure that you cannot trivially
> >> > omit the check by landing right after it.
> >> 
> >> That does bring up another point tho; how are we going to do a kernel
> >> that's optimal for both software CFI and hardware aided CFI?
> >> 
> >> All questions that need answering I think.
> >
> > I'm totally fine with designing a new CFI for a future option,
> > but blocking the existing (working) one does not best serve our end
> > users. 
> 
> I like security, but I also like building working systems, and I think
> I disagree with you. There are a whole bunch of CFI schemes out there,
> with varying hardware requirements, and they provide varying degrees
> of fine grained protection and varying degrees of protection against
> improper speculation.  We do not want to merge clang CFI just because
> it’s “ready” and end up with a mess that makes it harder to support
> other schemes in the kernel.

Right, and I see the difficulties here. And speaking to Peter's
observation that CFI "accidentally" worked with static_calls, I don't
see it that way: it worked because it was designed to be as "invisible"
as possible. It's just that at a certain point of extreme binary output
control, it becomes an issue and I think that's going to be true for
*any* CFI system: they each will have different design characteristics.

One of the goals of the Clang CFI use in Linux was to make it as
minimally invasive as possible (and you can see this guiding Sami's
choices: e.g. he didn't go change all the opaque address uses to need a
"&" prefix added, etc). I think we're always going to have some
push/pull between the compiler's "general"ness and the kernel's
"specific"ness.

> So, yes, a good CFI scheme needs caller-side protection, especially if
> IBT isn’t in use.  But a good CFI scheme also needs to interoperate with
> the rest of the kernel, and this whole “canonical” and symbol-based
> lookup and static_call thing is nonsense.  I think we need a better
> implementation, whether it uses intrinsics or little C helpers or
> whatever.

I think we're very close already. Like I said, I think it's fine to nail
down some of these interoperability requirements; we've been doing it
all along. We got there with arm64, and it looks to me like we're almost
there on x86. There is this particular case with static_calls now, but I
don't think it's insurmountable.

> I’m not saying this needs to be incompatible with current clang
> releases, but I do think we need a clear story for how operations like
> static call patching are supposed to work.

Right -- and until very recently, it did all Just Work. I think we'll
always have these kinds of issues with whatever CFI implementations are
made available, and since we've already cleaned up so much of what's
needed to make the kernel work with *any* CFI, it makes sense to
continue with the one we've already got working for arm64.

> FYI, Ard, many years ago we merged kernel support for the original
> gcc stack protector. We have since *removed* it on x86_32 in favor of
> a nicer implementation that requires a newer toolchain.

Sure, and this is the kind of thing I mean: we had an awkward
implementation of a meaningful defense, and we improved on it. I think
it's important to be able to make these kinds of concessions to gain the
defensive features they provide. And yes, we can continue to improve it,
but in the meantime, we can stop entire classes of problems from
happening to our user base.
Andy Lutomirski Nov. 3, 2021, 12:20 a.m. UTC | #28
On 11/2/21 16:13, Kees Cook wrote:
> On Tue, Nov 02, 2021 at 02:02:38PM -0700, Andy Lutomirski wrote:
>>
>>
>> On Tue, Nov 2, 2021, at 11:10 AM, Kees Cook wrote:
>>> On Tue, Nov 02, 2021 at 01:57:44PM +0100, Peter Zijlstra wrote:
>>>> On Mon, Nov 01, 2021 at 03:14:41PM +0100, Ard Biesheuvel wrote:
>>>>> On Mon, 1 Nov 2021 at 10:05, Peter Zijlstra <peterz@infradead.org> wrote:
>>>>
>>>>>> How is that not true for the jump table approach? Like I showed earlier,
>>>>>> it is *trivial* to reconstruct the actual function pointer from a
>>>>>> jump-table entry pointer.
>>>>>>
>>>>>
>>>>> That is not the point. The point is that Clang instruments every
>>>>> indirect call that it emits, to check whether the type of the jump
>>>>> table entry it is about to call matches the type of the caller. IOW,
>>>>> the indirect calls can only branch into jump tables, and all jump
>>>>> table entries in a table each branch to the start of some function of
>>>>> the same type.
>>>>>
>>>>> So the only thing you could achieve by adding or subtracting a
>>>>> constant value from the indirect call address is either calling
>>>>> another function of the same type (if you are hitting another entry in
>>>>> the same table), or failing the CFI type check.
>>>>
>>>> Ah, I see, so the call-site needs to have a branch around the indirect
>>>> call instruction.
>>>>
>>>>> Instrumenting the callee only needs something like BTI, and a
>>>>> consistent use of the landing pads to ensure that you cannot trivially
>>>>> omit the check by landing right after it.
>>>>
>>>> That does bring up another point tho; how are we going to do a kernel
>>>> that's optimal for both software CFI and hardware aided CFI?
>>>>
>>>> All questions that need answering I think.
>>>
>>> I'm totally fine with designing a new CFI for a future option,
>>> but blocking the existing (working) one does not best serve our end
>>> users.
>>
>> I like security, but I also like building working systems, and I think
>> I disagree with you. There are a whole bunch of CFI schemes out there,
>> with varying hardware requirements, and they provide varying degrees
>> of fine grained protection and varying degrees of protection against
>> improper speculation.  We do not want to merge clang CFI just because
>> it’s “ready” and end up with a mess that makes it harder to support
>> other schemes in the kernel.
> 
> Right, and I see the difficulties here. And speaking to Peter's
> observation that CFI "accidentally" worked with static_calls, I don't
> see it that way: it worked because it was designed to be as "invisible"
> as possible. It's just that at a certain point of extreme binary output
> control, it becomes an issue and I think that's going to be true for
> *any* CFI system: they each will have different design characteristics.
> 
> One of the goals of the Clang CFI use in Linux was to make it as
> minimally invasive as possible (and you can see this guiding Sami's
> choices: e.g. he didn't go change all the opaque address uses to need a
> "&" prefix added, etc). I think we're always going to have some
> push/pull between the compiler's "general"ness and the kernel's
> "specific"ness.

The "&" thing was the wrong way around.  That part of CFI was Linux 
being sloppy, and the & part is a straight-up improvement.  Improvements 
can be invasive.  The problem with invasive code is when it invades 
places it doesn't belong.

> 
>> So, yes, a good CFI scheme needs caller-side protection, especially if
>> IBT isn’t in use.  But a good CFI scheme also needs to interoperate with
>> the rest of the kernel, and this whole “canonical” and symbol-based
>> lookup and static_call thing is nonsense.  I think we need a better
>> implementation, whether it uses intrinsics or little C helpers or
>> whatever.
> 
> I think we're very close already. Like I said, I think it's fine to nail
> down some of these interoperability requirements; we've been doing it
> all along. We got there with arm64, and it looks to me like we're almost
> there on x86. There is this particular case with static_calls now, but I
> don't think it's insurmountable.

So fundamentally, I think there's a sort of type system issue here, and 
it's more or less the same issue with IBT and with various fine-grained 
refinements to IBT.  Using syntax more or like what's been floating 
around this thread, the asm works like this:

[possible magic here depending on the precise scheme]
func.indirect:
  ENDBR (if it's an IBT build)
  [magic here?]

  [in a jump table scheme, there's a jmp here and possibly a large or 
negative gap.]

func:
  actual body

The point being that there's the actual function body (where one should 
*direct* jump to) and there's the address that goes in a C function 
pointer.  Indirect calls/jumps go to (or pretend to go to, depending on 
the scheme) func.indirect.

While one can plausibly kludge up the static call patching (and who 
knows what else -- eBPF, kprobes, mcount stuff, etc) using symbol-based 
hackery, I think we actually just want a way to convert from a function 
pointer to a function address.  It doesn't need to be fast, but it needs 
to work.  And I think this can probably be mostly done in plain C based 
on clang's implementation.  Ideally it would have a typecheck:

unsigned long __cfi_decode_funcpointer(funcptr type);

but that's not happening in plain C because I don't think there's a way 
to get the magic metadata.  But I bet we could do:

unsigned long __cfi_decode_funcpointer(funcptr val, funcptr ref)
{
   BUG_ON(ref and val aren't the same type);
   read insn bytes at val;
   return the jump target;
}

If this won't work with current clang, let's ask to get whatever basic 
functionality is needed to enable it.

<rambling>

Sadly, in clang, this is wrapped up in the incomprehensible "canonical" 
stuff.  I think that's a big mistake -- any sane ENDBR-using scheme 
would really prefer that ENDBR to be right next to the actual function 
body, and really any scheme would benefit due to better cache locality. 
But, more importantly, IMO any sane ENDBR-using scheme wants to generate 
the indirect stub as part of code gen for the actual function.  In an 
IBT build, this really doesn't deserve to work:

non-IBT C file generating:

func:
   ret

IBT-aware C file:

extern void func(void);
ptr = func;

clang has horrible magic to generate the indirect stub in the caller 
translation unit, and I think that's solving a problem that doesn't 
really need solving.  Sure, it's nifty, but it IMO should be opt-in, at 
least in a world where everyone agrees to recompile for CFI.  (Sure, 
using that hack for userspace IBT would be cute, and it would work.)

There's another tradeoff, though: errors like this are possible:

translation unit A:
void func(int)
{
   [body here]
}

translation unit B:
extern void func(char); /* wrong signature! */
ptr = func;
ptr();

The callee-generates-the-stub model will not catch this.  The 
caller-generates-the-stub model will.

> 
> Sure, and this is the kind of thing I mean: we had an awkward
> implementation of a meaningful defense, and we improved on it. I think
> it's important to be able to make these kinds of concessions to gain the
> defensive features they provide. And yes, we can continue to improve it,
> but in the meantime, we can stop entire classes of problems from
> happening to our user base.
> 

In retrospect, we should have put our feet down with stack protector, 
though.  The problem with it was gcc hardcoding things that shouldn't 
have been hardcoded, and the gcc fixes were trivial.
Peter Zijlstra Nov. 3, 2021, 8:35 a.m. UTC | #29
On Tue, Nov 02, 2021 at 05:20:05PM -0700, Andy Lutomirski wrote:
> I think that's a big mistake -- any sane ENDBR-using scheme would
> really prefer that ENDBR to be right next to the actual function body,
> and really any scheme would benefit due to better cache locality. 

Agreed, IBT/BTI want the landing pad in front of the actual function.

> But, more importantly, IMO any sane ENDBR-using scheme wants to
> generate the indirect stub as part of code gen for the actual
> function.

Sorta, I really want to be able to not have a landing pad for functions
whose address is never taken. At that point it doesn't matter if it gets
generated along with the function and then stripped/poisoned later, or
generated later.

As such, the landing pad should not be part of the function proper,
direct calls should never observe it.

Less landing pads is more better.
David Laight Nov. 3, 2021, 10:01 a.m. UTC | #30
From: Peter Zijlstra
> Sent: 03 November 2021 08:36
> 
> On Tue, Nov 02, 2021 at 05:20:05PM -0700, Andy Lutomirski wrote:
> > I think that's a big mistake -- any sane ENDBR-using scheme would
> > really prefer that ENDBR to be right next to the actual function body,
> > and really any scheme would benefit due to better cache locality.
> 
> Agreed, IBT/BTI want the landing pad in front of the actual function.
> 
> > But, more importantly, IMO any sane ENDBR-using scheme wants to
> > generate the indirect stub as part of code gen for the actual
> > function.
> 
> Sorta, I really want to be able to not have a landing pad for functions
> whose address is never taken. At that point it doesn't matter if it gets
> generated along with the function and then stripped/poisoned later, or
> generated later.
> 
> As such, the landing pad should not be part of the function proper,
> direct calls should never observe it.
> 
> Less landing pads is more better.

One problem is when a direct call is 'too far' for a call instruction.
IIRC this can happen in arm64 with modules (all 64bit except x86?).
So an indirect call has to be used instead - which needs the landing pad.
Although it may actually be better to put a trampoline (landing pad
+ near jump) elsewhere and have the module loader do the correct fixup.
(Is the loader already generating a trampoline in the module code?)
The function body can then be cache-line aligned - with its benefits.

Can't anything that can write instructions always use a retpoline
to implement a jump indirect to an arbitrary address?
(Not to mention just generating the required code rather than a call.)

AFAICT CFI is all about detecting invalid values in function pointer tables.
It doesn't really protect in any way from JIT code doing incorrect things.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Andy Lutomirski Nov. 3, 2021, 7:32 p.m. UTC | #31
On 11/3/21 01:35, Peter Zijlstra wrote:
> On Tue, Nov 02, 2021 at 05:20:05PM -0700, Andy Lutomirski wrote:
>> I think that's a big mistake -- any sane ENDBR-using scheme would
>> really prefer that ENDBR to be right next to the actual function body,
>> and really any scheme would benefit due to better cache locality.
> 
> Agreed, IBT/BTI want the landing pad in front of the actual function.
> 
>> But, more importantly, IMO any sane ENDBR-using scheme wants to
>> generate the indirect stub as part of code gen for the actual
>> function.
> 
> Sorta, I really want to be able to not have a landing pad for functions
> whose address is never taken. At that point it doesn't matter if it gets
> generated along with the function and then stripped/poisoned later, or
> generated later.

Stripping is conceptually straightforward even without LTO.

foo.indirect:
  ENDBR
foo:
  ...

and the linker learns (using magic function sections?) that, if 
foo.indirect is not referenced, then it should not be generated.  Or a 
very straightforward walk over all the relocations in an object to 
poison the unused .indirect entries could be done.  Making this work 
with DSOs, EXPORT_SYMBOL, etc will be somewhat nontrivial, but not 
impossible.

--Andy
Rasmus Villemoes Nov. 15, 2021, 1:09 p.m. UTC | #32
On 30/10/2021 10.16, Peter Zijlstra wrote:
> On Sat, Oct 30, 2021 at 09:47:58AM +0200, Peter Zijlstra wrote:

> So fundamentally I think the whole notion that the address of a function
> is something different than 'the address of that function' is an *utter*
> fail.

Agreed. IMHO, commits like 0a5b412891df, 981731129e0f should have been a
big red flag saying "ok, perhaps we shall not mess with the semantics of
function pointer comparisons". Yes yes, the commit log in cf68fffb66d6
did dutifully say "This may break code that relies on comparing
addresses passed from other components.".

But it did not spell out that that for example means that i2c bus
recovery now doesn't result in recovering the bus but instead in a NULL
pointer deref and kernel panic: Some i2c drivers set ->recover_bus =
i2c_generic_scl_recovery and provide ->sda_gpiod, ->scl_gpiod, then rely on

if (bri->scl_gpiod && bri->recover_bus == i2c_generic_scl_recovery)

in i2c-core-base.c causing ->get_scl/->set_scl to be filled in. That's
of course broken with ->recover_bus pointing to the module's jump table,
so when ->recover_bus is actually called, the bri->set_scl() call in
i2c_generic_scl_recovery() is a NULL pointer deref. [I've tested that
this actually happens on an imx8mp_evk board.]

And who knows how much other code happens to rely on function pointer
comparison working as specified in the C standard for correctness.

Rasmus
diff mbox series

Patch

diff --git a/arch/x86/include/asm/static_call.h b/arch/x86/include/asm/static_call.h
index cbb67b6030f9..39ebe0511869 100644
--- a/arch/x86/include/asm/static_call.h
+++ b/arch/x86/include/asm/static_call.h
@@ -27,6 +27,7 @@ 
 	    ".globl " STATIC_CALL_TRAMP_STR(name) "		\n"	\
 	    STATIC_CALL_TRAMP_STR(name) ":			\n"	\
 	    insns "						\n"	\
+	    ".byte 0x53, 0x43, 0x54				\n"	\
 	    ".type " STATIC_CALL_TRAMP_STR(name) ", @function	\n"	\
 	    ".size " STATIC_CALL_TRAMP_STR(name) ", . - " STATIC_CALL_TRAMP_STR(name) " \n" \
 	    ".popsection					\n")
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index ea028e736831..9c407a33a774 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -56,10 +56,15 @@  static void __ref __static_call_transform(void *insn, enum insn_type type, void
 	text_poke_bp(insn, code, size, emulate);
 }
 
-static void __static_call_validate(void *insn, bool tail)
+static void __static_call_validate(void *insn, bool tail, bool tramp)
 {
 	u8 opcode = *(u8 *)insn;
 
+	if (tramp && memcmp(insn+5, "SCT", 3)) {
+		pr_err("trampoline signature fail");
+		BUG();
+	}
+
 	if (tail) {
 		if (opcode == JMP32_INSN_OPCODE ||
 		    opcode == RET_INSN_OPCODE)
@@ -74,7 +79,8 @@  static void __static_call_validate(void *insn, bool tail)
 	/*
 	 * If we ever trigger this, our text is corrupt, we'll probably not live long.
 	 */
-	WARN_ONCE(1, "unexpected static_call insn opcode 0x%x at %pS\n", opcode, insn);
+	pr_err("unexpected static_call insn opcode 0x%x at %pS\n", opcode, insn);
+	BUG();
 }
 
 static inline enum insn_type __sc_insn(bool null, bool tail)
@@ -97,12 +103,12 @@  void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
 	mutex_lock(&text_mutex);
 
 	if (tramp) {
-		__static_call_validate(tramp, true);
+		__static_call_validate(tramp, true, true);
 		__static_call_transform(tramp, __sc_insn(!func, true), func);
 	}
 
 	if (IS_ENABLED(CONFIG_HAVE_STATIC_CALL_INLINE) && site) {
-		__static_call_validate(site, tail);
+		__static_call_validate(site, tail, false);
 		__static_call_transform(site, __sc_insn(!func, tail), func);
 	}
 
diff --git a/tools/objtool/check.c b/tools/objtool/check.c
index fb3f251ea021..6d5951113d53 100644
--- a/tools/objtool/check.c
+++ b/tools/objtool/check.c
@@ -3310,6 +3310,9 @@  static bool ignore_unreachable_insn(struct objtool_file *file, struct instructio
 	if (!insn->func)
 		return false;
 
+	if (insn->func->static_call_tramp)
+		return true;
+
 	/*
 	 * CONFIG_UBSAN_TRAP inserts a UD2 when it sees
 	 * __builtin_unreachable().  The BUG() macro has an unreachable() after