mbox series

[v5,00/15] x86: Add support for Clang CFI

Message ID 20211013181658.1020262-1-samitolvanen@google.com (mailing list archive)
Headers show
Series x86: Add support for Clang CFI | expand

Message

Sami Tolvanen Oct. 13, 2021, 6:16 p.m. UTC
This series adds support for Clang's Control-Flow Integrity (CFI)
checking to x86_64. With CFI, the compiler injects a runtime
check before each indirect function call to ensure the target is
a valid function with the correct static type. This restricts
possible call targets and makes it more difficult for an attacker
to exploit bugs that allow the modification of stored function
pointers. For more details, see:

  https://clang.llvm.org/docs/ControlFlowIntegrity.html

Note that v5 is based on tip/master. The first two patches contain
objtool support for CFI, the remaining patches change function
declarations to use opaque types, fix type mismatch issues that
confuse the compiler, and disable CFI where it can't be used.

You can also pull this series from

  https://github.com/samitolvanen/linux.git x86-cfi-v5

---
Changes in v5:
- Renamed DECLARE_ASM_FUNC_SYMBOL() to DECLARE_NOT_CALLED_FROM_C()
  after some discussion about naming.

- Added an explicit include for <linux/cfi.h> to tracepoint.c.

- Updated commit messages based on feedback.

Changes in v4:
- Dropped the extable patch after the code was refactored in -tip.

- Switched to __section() instead of open-coding the attribute.

- Added an explicit ifdef for filtering out CC_FLAGS_CFI in
  purgatory for readability.

- Added a comment to arch_cfi_jump_reloc_offset() in objtool.

Changes in v3:
- Dropped Clang requirement to >= 13 after the missing compiler
  fix was backported there.

- Added DEFINE_CFI_IMMEDIATE_RETURN_STUB to address the issue
  with tp_stub_func in kernel/tracepoint.c.

- Renamed asm_func_t to asm_func_ptr.

- Changed extable handlers to use __cficanonical instead of
  disabling CFI for fixup_exception.

Changes in v2:
- Dropped the first objtool patch as the warnings were fixed in
  separate patches.

- Changed fix_cfi_relocs() in objtool to not rely on jump table
  symbols, and to return an error if it can't find a relocation.

- Fixed a build issue with ASM_STACK_FRAME_NON_STANDARD().

- Dropped workarounds for inline assembly references to
  address-taken static functions with CFI as this was fixed in
  the compiler.

- Changed the C declarations of non-callable functions to use
  opaque types and dropped the function_nocfi() patches.

- Changed ARCH_SUPPORTS_CFI_CLANG to depend on Clang >=14 for
  the compiler fixes.


Kees Cook (1):
  x86, relocs: Ignore __typeid__ relocations

Sami Tolvanen (14):
  objtool: Add CONFIG_CFI_CLANG support
  objtool: Add ASM_STACK_FRAME_NON_STANDARD
  linkage: Add DECLARE_NOT_CALLED_FROM_C
  cfi: Add DEFINE_CFI_IMMEDIATE_RETURN_STUB
  tracepoint: Exclude tp_stub_func from CFI checking
  ftrace: Use an opaque type for functions not callable from C
  lkdtm: Disable UNSET_SMEP with CFI
  lkdtm: Use an opaque type for lkdtm_rodata_do_nothing
  x86: Use an opaque type for functions not callable from C
  x86/purgatory: Disable CFI
  x86, module: Ignore __typeid__ relocations
  x86, cpu: Use LTO for cpu.c with CFI
  x86, kprobes: Fix optprobe_template_func type mismatch
  x86, build: Allow CONFIG_CFI_CLANG to be selected

 arch/x86/Kconfig                      |  1 +
 arch/x86/include/asm/ftrace.h         |  2 +-
 arch/x86/include/asm/idtentry.h       | 10 +++---
 arch/x86/include/asm/page_64.h        |  7 ++--
 arch/x86/include/asm/paravirt_types.h |  3 +-
 arch/x86/include/asm/processor.h      |  2 +-
 arch/x86/include/asm/proto.h          | 25 ++++++-------
 arch/x86/include/asm/uaccess_64.h     |  9 ++---
 arch/x86/kernel/alternative.c         |  2 +-
 arch/x86/kernel/ftrace.c              |  2 +-
 arch/x86/kernel/kprobes/opt.c         |  4 +--
 arch/x86/kernel/module.c              |  4 +++
 arch/x86/kernel/paravirt.c            |  4 +--
 arch/x86/kvm/emulate.c                |  4 +--
 arch/x86/kvm/kvm_emulate.h            |  9 ++---
 arch/x86/power/Makefile               |  2 ++
 arch/x86/purgatory/Makefile           |  4 +++
 arch/x86/tools/relocs.c               |  7 ++++
 arch/x86/xen/enlighten_pv.c           |  6 ++--
 arch/x86/xen/xen-ops.h                | 10 +++---
 drivers/misc/lkdtm/bugs.c             |  2 +-
 drivers/misc/lkdtm/lkdtm.h            |  2 +-
 drivers/misc/lkdtm/perms.c            |  2 +-
 drivers/misc/lkdtm/rodata.c           |  2 +-
 include/asm-generic/vmlinux.lds.h     | 11 ++++++
 include/linux/cfi.h                   | 13 +++++++
 include/linux/ftrace.h                |  7 ++--
 include/linux/linkage.h               | 13 +++++++
 include/linux/objtool.h               |  6 ++++
 kernel/cfi.c                          | 24 ++++++++++++-
 kernel/tracepoint.c                   |  6 ++--
 tools/include/linux/objtool.h         |  6 ++++
 tools/objtool/arch/x86/decode.c       | 17 +++++++++
 tools/objtool/elf.c                   | 51 +++++++++++++++++++++++++++
 tools/objtool/include/objtool/arch.h  |  3 ++
 tools/objtool/include/objtool/elf.h   |  2 +-
 36 files changed, 218 insertions(+), 66 deletions(-)


base-commit: 880e2b8e3151574b9e3419d1fbb06726ddee8b03

Comments

Kees Cook Oct. 13, 2021, 7:07 p.m. UTC | #1
On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote:
> This series adds support for Clang's Control-Flow Integrity (CFI)
> checking to x86_64. With CFI, the compiler injects a runtime
> check before each indirect function call to ensure the target is
> a valid function with the correct static type. This restricts
> possible call targets and makes it more difficult for an attacker
> to exploit bugs that allow the modification of stored function
> pointers. For more details, see:
> 
>   https://clang.llvm.org/docs/ControlFlowIntegrity.html
> 
> Note that v5 is based on tip/master. The first two patches contain
> objtool support for CFI, the remaining patches change function
> declarations to use opaque types, fix type mismatch issues that
> confuse the compiler, and disable CFI where it can't be used.

x86 folks: I'd prefer this series went via -tip, but I can carry it for
-next as well. What would you like to do here? I think it's ready.

Thanks!

-Kees
Alexander Lobakin Oct. 19, 2021, 10:06 a.m. UTC | #2
From: Sami Tolvanen <samitolvanen@google.com>
Date: Wed, 13 Oct 2021 11:16:43 -0700

> This series adds support for Clang's Control-Flow Integrity (CFI)
> checking to x86_64. With CFI, the compiler injects a runtime
> check before each indirect function call to ensure the target is
> a valid function with the correct static type. This restricts
> possible call targets and makes it more difficult for an attacker
> to exploit bugs that allow the modification of stored function
> pointers. For more details, see:
>
>   https://clang.llvm.org/docs/ControlFlowIntegrity.html
>
> Note that v5 is based on tip/master. The first two patches contain
> objtool support for CFI, the remaining patches change function
> declarations to use opaque types, fix type mismatch issues that
> confuse the compiler, and disable CFI where it can't be used.
>
> You can also pull this series from
>
>   https://github.com/samitolvanen/linux.git x86-cfi-v5

Hi,

I found [0] while was testing Peter's retpoline series, wanted to
ask / double check if that is because I'm using ClangCFI for x86
on unsupported Clang 12. It is fixed in 13 I suppose?

[0] https://lore.kernel.org/all/20211019094038.80569-1-alobakin@pm.me

Thanks,
Al
Sami Tolvanen Oct. 19, 2021, 3:40 p.m. UTC | #3
On Tue, Oct 19, 2021 at 3:06 AM Alexander Lobakin <alobakin@pm.me> wrote:
>
> From: Sami Tolvanen <samitolvanen@google.com>
> Date: Wed, 13 Oct 2021 11:16:43 -0700
>
> > This series adds support for Clang's Control-Flow Integrity (CFI)
> > checking to x86_64. With CFI, the compiler injects a runtime
> > check before each indirect function call to ensure the target is
> > a valid function with the correct static type. This restricts
> > possible call targets and makes it more difficult for an attacker
> > to exploit bugs that allow the modification of stored function
> > pointers. For more details, see:
> >
> >   https://clang.llvm.org/docs/ControlFlowIntegrity.html
> >
> > Note that v5 is based on tip/master. The first two patches contain
> > objtool support for CFI, the remaining patches change function
> > declarations to use opaque types, fix type mismatch issues that
> > confuse the compiler, and disable CFI where it can't be used.
> >
> > You can also pull this series from
> >
> >   https://github.com/samitolvanen/linux.git x86-cfi-v5
>
> Hi,
>
> I found [0] while was testing Peter's retpoline series, wanted to
> ask / double check if that is because I'm using ClangCFI for x86
> on unsupported Clang 12. It is fixed in 13 I suppose?
>
> [0] https://lore.kernel.org/all/20211019094038.80569-1-alobakin@pm.me

No, it works exactly the same in later compiler versions. I also
replied to that thread, but this looks like another instance where
using an opaque type instead of a function declaration fixes the
issue, and probably makes sense as the thunks are not directly
callable from C code.

Sami
Alexander Lobakin Oct. 21, 2021, 10:27 a.m. UTC | #4
From: Sami Tolvanen <samitolvanen@google.com>
Date: Wed, 13 Oct 2021 11:16:43 -0700

> This series adds support for Clang's Control-Flow Integrity (CFI)
> checking to x86_64. With CFI, the compiler injects a runtime
> check before each indirect function call to ensure the target is
> a valid function with the correct static type. This restricts
> possible call targets and makes it more difficult for an attacker
> to exploit bugs that allow the modification of stored function
> pointers. For more details, see:
>
>   https://clang.llvm.org/docs/ControlFlowIntegrity.html
>
> Note that v5 is based on tip/master. The first two patches contain
> objtool support for CFI, the remaining patches change function
> declarations to use opaque types, fix type mismatch issues that
> confuse the compiler, and disable CFI where it can't be used.
>
> You can also pull this series from
>
>   https://github.com/samitolvanen/linux.git x86-cfi-v5

[ snip ]

I've been using it since the end of May on my x86_64, so for v5
(with changing retpoline thunks prototypes to opaque):

Reviwed-by: Alexander Lobakin <alobakin@pm.me>
Tested-by: Alexander Lobakin <alobakin@pm.me>

Thanks!
Al
Peter Zijlstra Oct. 26, 2021, 8:16 p.m. UTC | #5
On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote:
> This series adds support for Clang's Control-Flow Integrity (CFI)
> checking to x86_64. With CFI, the compiler injects a runtime
> check before each indirect function call to ensure the target is
> a valid function with the correct static type. This restricts
> possible call targets and makes it more difficult for an attacker
> to exploit bugs that allow the modification of stored function
> pointers. For more details, see:
> 
>   https://clang.llvm.org/docs/ControlFlowIntegrity.html

So, if I understand this right, the compiler emits, for every function
two things: 1) the actual funcion and 2) a jump-table entry.

Then, every time the address of a function is taken, 2) is given instead
of the expected 1), right?

But how does this work with things like static_call(), which we give a
function address (now a jump-table entry) and use that to write direct
call instructions?

Should not this jump-table thingy get converted to an actual function
address somewhere around arch_static_call_transform() ? This also seems
relevant for arm64 (which already has CLANG_CFI supported) given:

  https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org

Or am I still not understanding this CFI thing?
David Laight Oct. 27, 2021, 10:02 a.m. UTC | #6
From: Peter Zijlstra
> Sent: 26 October 2021 21:16
> 
> On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote:
> > This series adds support for Clang's Control-Flow Integrity (CFI)
> > checking to x86_64. With CFI, the compiler injects a runtime
> > check before each indirect function call to ensure the target is
> > a valid function with the correct static type. This restricts
> > possible call targets and makes it more difficult for an attacker
> > to exploit bugs that allow the modification of stored function
> > pointers. For more details, see:
> >
> >   https://clang.llvm.org/docs/ControlFlowIntegrity.html
> 
> So, if I understand this right, the compiler emits, for every function
> two things: 1) the actual funcion and 2) a jump-table entry.
> 
> Then, every time the address of a function is taken, 2) is given instead
> of the expected 1), right?
> 
> But how does this work with things like static_call(), which we give a
> function address (now a jump-table entry) and use that to write direct
> call instructions?
> 
> Should not this jump-table thingy get converted to an actual function
> address somewhere around arch_static_call_transform() ? This also seems
> relevant for arm64 (which already has CLANG_CFI supported) given:
> 
>   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org
> 
> Or am I still not understanding this CFI thing?

From what I remember the compiler adds code prior to every jump indirect
to check that the function address is in the list of valid functions
(with a suitable prototype - or some similar check).

So it add a run-time search to every indirect call.

What would be more sensible would be a hidden extra parameter that is
a hash of the prototype that is saved just before the entry point.
Then the caller and called function could both check.
That is still a moderate cost for an indirect call.

Anything that can write asm can get around any check - it just gets a
bit harder.
But overwritten function pointers would be detected - which I assume
is the main threat.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Peter Zijlstra Oct. 27, 2021, 10:17 a.m. UTC | #7
On Wed, Oct 27, 2021 at 10:02:56AM +0000, David Laight wrote:
> From: Peter Zijlstra
> > Sent: 26 October 2021 21:16
> > 
> > On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote:
> > > This series adds support for Clang's Control-Flow Integrity (CFI)
> > > checking to x86_64. With CFI, the compiler injects a runtime
> > > check before each indirect function call to ensure the target is
> > > a valid function with the correct static type. This restricts
> > > possible call targets and makes it more difficult for an attacker
> > > to exploit bugs that allow the modification of stored function
> > > pointers. For more details, see:
> > >
> > >   https://clang.llvm.org/docs/ControlFlowIntegrity.html
> > 
> > So, if I understand this right, the compiler emits, for every function
> > two things: 1) the actual funcion and 2) a jump-table entry.
> > 
> > Then, every time the address of a function is taken, 2) is given instead
> > of the expected 1), right?
> > 
> > But how does this work with things like static_call(), which we give a
> > function address (now a jump-table entry) and use that to write direct
> > call instructions?
> > 
> > Should not this jump-table thingy get converted to an actual function
> > address somewhere around arch_static_call_transform() ? This also seems
> > relevant for arm64 (which already has CLANG_CFI supported) given:
> > 
> >   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org
> > 
> > Or am I still not understanding this CFI thing?
> 
> From what I remember the compiler adds code prior to every jump indirect
> to check that the function address is in the list of valid functions
> (with a suitable prototype - or some similar check).

It definitely mucks about with the address too; see here:

  https://lkml.kernel.org/r/YW6a67fGzM2AyHot@hirez.programming.kicks-ass.net

I'm thinking static_call() wants the real actual function address before
it writes instructions.
Mark Rutland Oct. 27, 2021, 12:05 p.m. UTC | #8
On Tue, Oct 26, 2021 at 10:16:22PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote:
> > This series adds support for Clang's Control-Flow Integrity (CFI)
> > checking to x86_64. With CFI, the compiler injects a runtime
> > check before each indirect function call to ensure the target is
> > a valid function with the correct static type. This restricts
> > possible call targets and makes it more difficult for an attacker
> > to exploit bugs that allow the modification of stored function
> > pointers. For more details, see:
> > 
> >   https://clang.llvm.org/docs/ControlFlowIntegrity.html
> 
> So, if I understand this right, the compiler emits, for every function
> two things: 1) the actual funcion and 2) a jump-table entry.
> 
> Then, every time the address of a function is taken, 2) is given instead
> of the expected 1), right?

Yes, and we had to bodge around this with function_nocfi() to get the
actual function address.

Really there should be a compiler intrinsic or attribute for this, given
the compiler has all the releveant information available. On arm64 we
had to us inine asm to generate the addres...

Taking a step back, it'd be nicer if we didn't have the jump-table shim
at all, and had some SW landing pad (e.g. a NOP with some magic bytes)
in the callees that the caller could check for. Then function pointers
would remain callable in call cases, and we could explcitly add landing
pads to asm to protect those. I *think* that's what the grsecurity folk
do, but I could be mistaken.

> But how does this work with things like static_call(), which we give a
> function address (now a jump-table entry) and use that to write direct
> call instructions?
> 
> Should not this jump-table thingy get converted to an actual function
> address somewhere around arch_static_call_transform() ? This also seems
> relevant for arm64 (which already has CLANG_CFI supported) given:
> 
>   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org

Ugh, yeah, we'll need to do the function_nocfi() dance somewhere...

Thanks,
Mark.
Ard Biesheuvel Oct. 27, 2021, 12:22 p.m. UTC | #9
On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote:
>
> On Tue, Oct 26, 2021 at 10:16:22PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote:
> > > This series adds support for Clang's Control-Flow Integrity (CFI)
> > > checking to x86_64. With CFI, the compiler injects a runtime
> > > check before each indirect function call to ensure the target is
> > > a valid function with the correct static type. This restricts
> > > possible call targets and makes it more difficult for an attacker
> > > to exploit bugs that allow the modification of stored function
> > > pointers. For more details, see:
> > >
> > >   https://clang.llvm.org/docs/ControlFlowIntegrity.html
> >
> > So, if I understand this right, the compiler emits, for every function
> > two things: 1) the actual funcion and 2) a jump-table entry.
> >
> > Then, every time the address of a function is taken, 2) is given instead
> > of the expected 1), right?
>
> Yes, and we had to bodge around this with function_nocfi() to get the
> actual function address.
>
> Really there should be a compiler intrinsic or attribute for this, given
> the compiler has all the releveant information available. On arm64 we
> had to us inine asm to generate the addres...
>
> Taking a step back, it'd be nicer if we didn't have the jump-table shim
> at all, and had some SW landing pad (e.g. a NOP with some magic bytes)
> in the callees that the caller could check for. Then function pointers
> would remain callable in call cases, and we could explcitly add landing
> pads to asm to protect those. I *think* that's what the grsecurity folk
> do, but I could be mistaken.
>
> > But how does this work with things like static_call(), which we give a
> > function address (now a jump-table entry) and use that to write direct
> > call instructions?
> >
> > Should not this jump-table thingy get converted to an actual function
> > address somewhere around arch_static_call_transform() ? This also seems
> > relevant for arm64 (which already has CLANG_CFI supported) given:
> >
> >   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org
>
> Ugh, yeah, we'll need to do the function_nocfi() dance somewhere...
>

Sadly, that only works on symbol names, so we cannot use it to strip
CFI-ness from void *func arguments passed into the static call API,
unfortunately.

Also, function_nocfi() seems broken in the sense that it relies on the
symbol existing in the global namespace, which may not be true for
function symbols with static linkage, as they can be optimized away
entirely. I think the same might apply to function symbols with
external linkage under LTO.
Peter Zijlstra Oct. 27, 2021, 12:46 p.m. UTC | #10
On Wed, Oct 27, 2021 at 01:05:15PM +0100, Mark Rutland wrote:
> On Tue, Oct 26, 2021 at 10:16:22PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote:
> > > This series adds support for Clang's Control-Flow Integrity (CFI)
> > > checking to x86_64. With CFI, the compiler injects a runtime
> > > check before each indirect function call to ensure the target is
> > > a valid function with the correct static type. This restricts
> > > possible call targets and makes it more difficult for an attacker
> > > to exploit bugs that allow the modification of stored function
> > > pointers. For more details, see:
> > > 
> > >   https://clang.llvm.org/docs/ControlFlowIntegrity.html
> > 
> > So, if I understand this right, the compiler emits, for every function
> > two things: 1) the actual funcion and 2) a jump-table entry.
> > 
> > Then, every time the address of a function is taken, 2) is given instead
> > of the expected 1), right?
> 
> Yes, and we had to bodge around this with function_nocfi() to get the
> actual function address.

The patch set under consideration seems to have forgotten to provide one
for x86 :/

> Really there should be a compiler intrinsic or attribute for this, given
> the compiler has all the releveant information available. On arm64 we
> had to us inine asm to generate the addres...

Agreed, this *really* shouldn't be an arch asm hack trying to undo
something the compiler did.
Peter Zijlstra Oct. 27, 2021, 12:48 p.m. UTC | #11
On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote:
> On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote:

> > > Should not this jump-table thingy get converted to an actual function
> > > address somewhere around arch_static_call_transform() ? This also seems
> > > relevant for arm64 (which already has CLANG_CFI supported) given:
> > >
> > >   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org
> >
> > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere...
> >
> 
> Sadly, that only works on symbol names, so we cannot use it to strip
> CFI-ness from void *func arguments passed into the static call API,
> unfortunately.

Right, and while mostly static_call_update() is used, whcih is a macro
and could possibly be used to wrap this, we very much rely on
__static_call_update() also working without that wrapper and then we're
up a creek without no paddles.
David Laight Oct. 27, 2021, 12:55 p.m. UTC | #12
From: Mark Rutland
> Sent: 27 October 2021 13:05
...
> Taking a step back, it'd be nicer if we didn't have the jump-table shim
> at all, and had some SW landing pad (e.g. a NOP with some magic bytes)
> in the callees that the caller could check for. Then function pointers
> would remain callable in call cases, and we could explcitly add landing
> pads to asm to protect those. I *think* that's what the grsecurity folk
> do, but I could be mistaken.

It doesn't need to be a 'landing pad'.
The 'magic value' could be at 'label - 8'.

Provided you can generate the required value it could be added
to asm functions.
(Or you could patch it at startup by stealing the value from
a C function.)

Depending on the threat model, you may even want the called function
to do some sanity checks on the caller.

I suspect that anything you do is easy to subvert by anything that
can actually write asm.
So if the real threat is overwritten function tables then something
relatively simple is adequate.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Peter Zijlstra Oct. 27, 2021, 1:04 p.m. UTC | #13
On Wed, Oct 27, 2021 at 02:48:52PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote:
> > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote:
> 
> > > > Should not this jump-table thingy get converted to an actual function
> > > > address somewhere around arch_static_call_transform() ? This also seems
> > > > relevant for arm64 (which already has CLANG_CFI supported) given:
> > > >
> > > >   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org
> > >
> > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere...
> > >
> > 
> > Sadly, that only works on symbol names, so we cannot use it to strip
> > CFI-ness from void *func arguments passed into the static call API,
> > unfortunately.
> 
> Right, and while mostly static_call_update() is used, whcih is a macro
> and could possibly be used to wrap this, we very much rely on
> __static_call_update() also working without that wrapper and then we're
> up a creek without no paddles.

Specifically, we support code like:

struct foo {
	void (*func1)(args1);
	void (*func2)(args2);
}

struct foo global_foo;

...

DEFINE_STATIC_CALL_NULL(func1, *global_foo.func1);

...

__init foo_init()
{
	// whatever code that fills out foo

	static_call_update(func1, global_foo.func1);
}

...

hot_function()
{
	...
	static_cal(func1)(args1);
	...
}

cold_function()
{
	...
	global_foo->func1(args1);
	...
}

And there is no way I can see this working sanely with CFI as presented.

Even though the above uses static_call_update(), we can't no longer use
function_nocfi() on the @func argument, because it's not a symbol, it's
already a function pointer.

Nor can we fill global_foo.func1 with function_nocfi() because then
cold_function() goes sideways.
Mark Rutland Oct. 27, 2021, 1:17 p.m. UTC | #14
On Wed, Oct 27, 2021 at 12:55:17PM +0000, David Laight wrote:
> From: Mark Rutland
> > Sent: 27 October 2021 13:05
> ...
> > Taking a step back, it'd be nicer if we didn't have the jump-table shim
> > at all, and had some SW landing pad (e.g. a NOP with some magic bytes)
> > in the callees that the caller could check for. Then function pointers
> > would remain callable in call cases, and we could explcitly add landing
> > pads to asm to protect those. I *think* that's what the grsecurity folk
> > do, but I could be mistaken.
> 
> It doesn't need to be a 'landing pad'.
> The 'magic value' could be at 'label - 8'.

Sure; I'd intended to mean the general case of something at some fixed
offset from the entrypoint, either before or after, potentially but not
necessarily inline in the executed instruction stream.

Mark.
Ard Biesheuvel Oct. 27, 2021, 1:30 p.m. UTC | #15
On Wed, 27 Oct 2021 at 15:05, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 27, 2021 at 02:48:52PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote:
> > > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote:
> >
> > > > > Should not this jump-table thingy get converted to an actual function
> > > > > address somewhere around arch_static_call_transform() ? This also seems
> > > > > relevant for arm64 (which already has CLANG_CFI supported) given:
> > > > >
> > > > >   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org
> > > >
> > > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere...
> > > >
> > >
> > > Sadly, that only works on symbol names, so we cannot use it to strip
> > > CFI-ness from void *func arguments passed into the static call API,
> > > unfortunately.
> >
> > Right, and while mostly static_call_update() is used, whcih is a macro
> > and could possibly be used to wrap this, we very much rely on
> > __static_call_update() also working without that wrapper and then we're
> > up a creek without no paddles.
>
> Specifically, we support code like:
>
> struct foo {
>         void (*func1)(args1);
>         void (*func2)(args2);
> }
>
> struct foo global_foo;
>
> ...
>
> DEFINE_STATIC_CALL_NULL(func1, *global_foo.func1);
>
> ...
>
> __init foo_init()
> {
>         // whatever code that fills out foo
>
>         static_call_update(func1, global_foo.func1);
> }
>
> ...
>
> hot_function()
> {
>         ...
>         static_cal(func1)(args1);
>         ...
> }
>
> cold_function()
> {
>         ...
>         global_foo->func1(args1);
>         ...
> }
>
> And there is no way I can see this working sanely with CFI as presented.
>
> Even though the above uses static_call_update(), we can't no longer use
> function_nocfi() on the @func argument, because it's not a symbol, it's
> already a function pointer.
>
> Nor can we fill global_foo.func1 with function_nocfi() because then
> cold_function() goes sideways.
>

As far as I can tell from playing around with Clang, the stubs can
actually be executed directly, they just jumps to the actual function.
The compiler simply generates a jump table for each prototype that
appears in the code as the target of an indirect jump, and checks
whether the target appears in the list.

E.g., the code below

void foo(void) {}
void bar(int) {}
void baz(int) {}
void (* volatile fn1)(void) = foo;
void (* volatile fn2)(int) = bar;

int main(int argc, char *argv[])
{
  fn1();
  fn2 = baz;
  fn2(-1);
}

produces

0000000000400594 <foo.cfi>:
  400594: d65f03c0 ret

0000000000400598 <bar.cfi>:
  400598: d65f03c0 ret

000000000040059c <baz.cfi>:
  40059c: d65f03c0 ret

00000000004005a0 <main>:
  4005a0: a9bf7bfd stp x29, x30, [sp, #-16]!

// First indirect call
  4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
  4005a8: f9401508 ldr x8, [x8, #40]
  4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278>
  4005b0: 91182129 add x9, x9, #0x608
  4005b4: 910003fd mov x29, sp
  4005b8: eb09011f cmp x8, x9
  4005bc: 54000241 b.ne 400604 <main+0x64>  // b.any
  4005c0: d63f0100 blr x8

// Assignment of fn2
  4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278>
  4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
  4005cc: 91184129 add x9, x9, #0x610
  4005d0: f9001909 str x9, [x8, #48]

// Second indirect call
  4005d4: f9401908 ldr x8, [x8, #48]
  4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278>
  4005dc: 91183129 add x9, x9, #0x60c
  4005e0: cb090109 sub x9, x8, x9
  4005e4: 93c90929 ror x9, x9, #2
  4005e8: f100053f cmp x9, #0x1
  4005ec: 540000c8 b.hi 400604 <main+0x64>  // b.pmore
  4005f0: 12800000 mov w0, #0xffffffff            // #-1
  4005f4: d63f0100 blr x8


  4005f8: 2a1f03e0 mov w0, wzr
  4005fc: a8c17bfd ldp x29, x30, [sp], #16
  400600: d65f03c0 ret
  400604: d4200020 brk #0x1

0000000000400608 <__typeid__ZTSFvvE_global_addr>:
  400608: 17ffffe3 b 400594 <foo.cfi>

000000000040060c <__typeid__ZTSFviE_global_addr>:
  40060c: 17ffffe3 b 400598 <bar.cfi>
  400610: 17ffffe3 b 40059c <baz.cfi>

So it looks like taking the address is fine, although not optimal due
to the additional jump. We could fudge around that by checking the
opcode at the target of the call, or token paste ".cfi" after the
symbol name in the static_call_update() macro, but it doesn't like
like anything is terminally broken tbh.
Peter Zijlstra Oct. 27, 2021, 2:03 p.m. UTC | #16
On Wed, Oct 27, 2021 at 03:30:11PM +0200, Ard Biesheuvel wrote:

> As far as I can tell from playing around with Clang, the stubs can
> actually be executed directly, 

I had just finished reading the clang docs which suggest as much and was
about to try what the compiler actually generates.

> they just jumps to the actual function.
> The compiler simply generates a jump table for each prototype that
> appears in the code as the target of an indirect jump, and checks
> whether the target appears in the list.
> 
> E.g., the code below
> 
> void foo(void) {}
> void bar(int) {}
> void baz(int) {}
> void (* volatile fn1)(void) = foo;
> void (* volatile fn2)(int) = bar;
> 
> int main(int argc, char *argv[])
> {
>   fn1();
>   fn2 = baz;
>   fn2(-1);
> }
> 
> produces
> 
> 0000000000400594 <foo.cfi>:
>   400594: d65f03c0 ret
> 
> 0000000000400598 <bar.cfi>:
>   400598: d65f03c0 ret
> 
> 000000000040059c <baz.cfi>:
>   40059c: d65f03c0 ret

Right, so these are the actual functions ^.

> 00000000004005a0 <main>:
>   4005a0: a9bf7bfd stp x29, x30, [sp, #-16]!
> 
> // First indirect call
>   4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
>   4005a8: f9401508 ldr x8, [x8, #40]
>   4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278>
>   4005b0: 91182129 add x9, x9, #0x608
>   4005b4: 910003fd mov x29, sp
>   4005b8: eb09011f cmp x8, x9
>   4005bc: 54000241 b.ne 400604 <main+0x64>  // b.any
>   4005c0: d63f0100 blr x8

That's impenetrable to me, sorry.

> // Assignment of fn2
>   4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278>
>   4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
>   4005cc: 91184129 add x9, x9, #0x610
>   4005d0: f9001909 str x9, [x8, #48]

I'm struggling here, x9 points to the branch at 400610, but then what?

x8 is in .data somewhere?

> // Second indirect call
>   4005d4: f9401908 ldr x8, [x8, #48]
>   4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278>
>   4005dc: 91183129 add x9, x9, #0x60c
>   4005e0: cb090109 sub x9, x8, x9
>   4005e4: 93c90929 ror x9, x9, #2
>   4005e8: f100053f cmp x9, #0x1
>   4005ec: 540000c8 b.hi 400604 <main+0x64>  // b.pmore
>   4005f0: 12800000 mov w0, #0xffffffff            // #-1
>   4005f4: d63f0100 blr x8
> 
> 
>   4005f8: 2a1f03e0 mov w0, wzr
>   4005fc: a8c17bfd ldp x29, x30, [sp], #16
>   400600: d65f03c0 ret
>   400604: d4200020 brk #0x1


> 0000000000400608 <__typeid__ZTSFvvE_global_addr>:
>   400608: 17ffffe3 b 400594 <foo.cfi>
> 
> 000000000040060c <__typeid__ZTSFviE_global_addr>:
>   40060c: 17ffffe3 b 400598 <bar.cfi>
>   400610: 17ffffe3 b 40059c <baz.cfi>

And these are the stubs per type.

> So it looks like taking the address is fine, although not optimal due
> to the additional jump.

Right.

> We could fudge around that by checking the
> opcode at the target of the call, or token paste ".cfi" after the
> symbol name in the static_call_update() macro, but it doesn't like
> like anything is terminally broken tbh.

Agreed, since the jump table entries are actually executable it 'works'.

I really don't like that extra jump though, so I think I really do want
that nocfi_ptr() thing. And going by:

  https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls

and the above, that might be possible (on x86) with something like:

/*
 * Turns a Clang CFI jump-table entry into an actual function pointer.
 * These jump-table entries are simply jmp.d32 instruction with their
 * relative offset pointing to the actual function, therefore decode the
 * instruction to find the real function.
 */
static __always_inline void *nocfi_ptr(void *func)
{
	union text_poke_insn insn = *(union text_poke_insn *)func;

	return func + sizeof(insn) + insn.disp;
}

But really, that wants to be a compiler intrinsic.
Ard Biesheuvel Oct. 27, 2021, 2:18 p.m. UTC | #17
On Wed, 27 Oct 2021 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 27, 2021 at 03:30:11PM +0200, Ard Biesheuvel wrote:
>
> > As far as I can tell from playing around with Clang, the stubs can
> > actually be executed directly,
>
> I had just finished reading the clang docs which suggest as much and was
> about to try what the compiler actually generates.
>
> > they just jumps to the actual function.
> > The compiler simply generates a jump table for each prototype that
> > appears in the code as the target of an indirect jump, and checks
> > whether the target appears in the list.
> >
> > E.g., the code below
> >
> > void foo(void) {}
> > void bar(int) {}
> > void baz(int) {}
> > void (* volatile fn1)(void) = foo;
> > void (* volatile fn2)(int) = bar;
> >
> > int main(int argc, char *argv[])
> > {
> >   fn1();
> >   fn2 = baz;
> >   fn2(-1);
> > }
> >
> > produces
> >
> > 0000000000400594 <foo.cfi>:
> >   400594: d65f03c0 ret
> >
> > 0000000000400598 <bar.cfi>:
> >   400598: d65f03c0 ret
> >
> > 000000000040059c <baz.cfi>:
> >   40059c: d65f03c0 ret
>
> Right, so these are the actual functions ^.
>
> > 00000000004005a0 <main>:
> >   4005a0: a9bf7bfd stp x29, x30, [sp, #-16]!
> >
> > // First indirect call
> >   4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
> >   4005a8: f9401508 ldr x8, [x8, #40]
> >   4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> >   4005b0: 91182129 add x9, x9, #0x608
> >   4005b4: 910003fd mov x29, sp
> >   4005b8: eb09011f cmp x8, x9
> >   4005bc: 54000241 b.ne 400604 <main+0x64>  // b.any
> >   4005c0: d63f0100 blr x8
>
> That's impenetrable to me, sorry.
>

This loads the value of fn1 in x8, and takes the address of the jump
table in x9. Since it is only one entry long, it does a simple compare
to check whether x8 appears in the jump table, and branches to the BRK
at the end if they are different.

> > // Assignment of fn2
> >   4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> >   4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
> >   4005cc: 91184129 add x9, x9, #0x610
> >   4005d0: f9001909 str x9, [x8, #48]
>
> I'm struggling here, x9 points to the branch at 400610, but then what?
>
> x8 is in .data somewhere?
>

This takes the address of the jump table entry of 'baz' in x9, and
stores it in fn2 whose address is taken in x8.


> > // Second indirect call
> >   4005d4: f9401908 ldr x8, [x8, #48]
> >   4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> >   4005dc: 91183129 add x9, x9, #0x60c
> >   4005e0: cb090109 sub x9, x8, x9
> >   4005e4: 93c90929 ror x9, x9, #2
> >   4005e8: f100053f cmp x9, #0x1
> >   4005ec: 540000c8 b.hi 400604 <main+0x64>  // b.pmore
> >   4005f0: 12800000 mov w0, #0xffffffff            // #-1
> >   4005f4: d63f0100 blr x8
> >
> >
> >   4005f8: 2a1f03e0 mov w0, wzr
> >   4005fc: a8c17bfd ldp x29, x30, [sp], #16
> >   400600: d65f03c0 ret
> >   400604: d4200020 brk #0x1
>
>
> > 0000000000400608 <__typeid__ZTSFvvE_global_addr>:
> >   400608: 17ffffe3 b 400594 <foo.cfi>
> >
> > 000000000040060c <__typeid__ZTSFviE_global_addr>:
> >   40060c: 17ffffe3 b 400598 <bar.cfi>
> >   400610: 17ffffe3 b 40059c <baz.cfi>
>
> And these are the stubs per type.
>
> > So it looks like taking the address is fine, although not optimal due
> > to the additional jump.
>
> Right.
>

... although it does seem that function_nocfi() doesn't actually work
as expected, given that we want the address of <func>.cfi and not the
address of the stub.

> > We could fudge around that by checking the
> > opcode at the target of the call, or token paste ".cfi" after the
> > symbol name in the static_call_update() macro, but it doesn't like
> > like anything is terminally broken tbh.
>
> Agreed, since the jump table entries are actually executable it 'works'.
>
> I really don't like that extra jump though, so I think I really do want
> that nocfi_ptr() thing. And going by:
>
>   https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls
>
> and the above, that might be possible (on x86) with something like:
>
> /*
>  * Turns a Clang CFI jump-table entry into an actual function pointer.
>  * These jump-table entries are simply jmp.d32 instruction with their
>  * relative offset pointing to the actual function, therefore decode the
>  * instruction to find the real function.
>  */
> static __always_inline void *nocfi_ptr(void *func)
> {
>         union text_poke_insn insn = *(union text_poke_insn *)func;
>
>         return func + sizeof(insn) + insn.disp;
> }
>
> But really, that wants to be a compiler intrinsic.

Agreed. We could easily do something similar on arm64, but I'd prefer
to avoid that too.
Peter Zijlstra Oct. 27, 2021, 2:36 p.m. UTC | #18
On Wed, Oct 27, 2021 at 04:18:17PM +0200, Ard Biesheuvel wrote:
> On Wed, 27 Oct 2021 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote:

> > /*
> >  * Turns a Clang CFI jump-table entry into an actual function pointer.
> >  * These jump-table entries are simply jmp.d32 instruction with their
> >  * relative offset pointing to the actual function, therefore decode the
> >  * instruction to find the real function.
> >  */
> > static __always_inline void *nocfi_ptr(void *func)
> > {
> >         union text_poke_insn insn = *(union text_poke_insn *)func;

also, probably, for the paranoid amongst us:

	if (WARN_ON_ONCE(insn.opcode != JMP32_INSN_OPCODE))
		return func;

> >         return func + sizeof(insn) + insn.disp;
> > }
> >
> > But really, that wants to be a compiler intrinsic.
> 
> Agreed. We could easily do something similar on arm64, but I'd prefer
> to avoid that too.

Right, because on x86 CET-IBT will force that entry to have a different
form (and size), similar on arm64 with BTI.

I was thinking the compiler really should implicitly do this conversion
when a function pointer is cast to an integer type. But barring that, we
really need an intrinsic to perform this.

Also, perhaps the compiler should admit it's doing dodgy crap and
introduce the notion of address spaces and use the type system to
separate these two forms.
Sami Tolvanen Oct. 27, 2021, 3:50 p.m. UTC | #19
On Wed, Oct 27, 2021 at 7:18 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> On Wed, 27 Oct 2021 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Wed, Oct 27, 2021 at 03:30:11PM +0200, Ard Biesheuvel wrote:
> >
> > > As far as I can tell from playing around with Clang, the stubs can
> > > actually be executed directly,
> >
> > I had just finished reading the clang docs which suggest as much and was
> > about to try what the compiler actually generates.
> >
> > > they just jumps to the actual function.
> > > The compiler simply generates a jump table for each prototype that
> > > appears in the code as the target of an indirect jump, and checks
> > > whether the target appears in the list.
> > >
> > > E.g., the code below
> > >
> > > void foo(void) {}
> > > void bar(int) {}
> > > void baz(int) {}
> > > void (* volatile fn1)(void) = foo;
> > > void (* volatile fn2)(int) = bar;
> > >
> > > int main(int argc, char *argv[])
> > > {
> > >   fn1();
> > >   fn2 = baz;
> > >   fn2(-1);
> > > }
> > >
> > > produces
> > >
> > > 0000000000400594 <foo.cfi>:
> > >   400594: d65f03c0 ret
> > >
> > > 0000000000400598 <bar.cfi>:
> > >   400598: d65f03c0 ret
> > >
> > > 000000000040059c <baz.cfi>:
> > >   40059c: d65f03c0 ret
> >
> > Right, so these are the actual functions ^.
> >
> > > 00000000004005a0 <main>:
> > >   4005a0: a9bf7bfd stp x29, x30, [sp, #-16]!
> > >
> > > // First indirect call
> > >   4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
> > >   4005a8: f9401508 ldr x8, [x8, #40]
> > >   4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> > >   4005b0: 91182129 add x9, x9, #0x608
> > >   4005b4: 910003fd mov x29, sp
> > >   4005b8: eb09011f cmp x8, x9
> > >   4005bc: 54000241 b.ne 400604 <main+0x64>  // b.any
> > >   4005c0: d63f0100 blr x8
> >
> > That's impenetrable to me, sorry.
> >
>
> This loads the value of fn1 in x8, and takes the address of the jump
> table in x9. Since it is only one entry long, it does a simple compare
> to check whether x8 appears in the jump table, and branches to the BRK
> at the end if they are different.
>
> > > // Assignment of fn2
> > >   4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> > >   4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
> > >   4005cc: 91184129 add x9, x9, #0x610
> > >   4005d0: f9001909 str x9, [x8, #48]
> >
> > I'm struggling here, x9 points to the branch at 400610, but then what?
> >
> > x8 is in .data somewhere?
> >
>
> This takes the address of the jump table entry of 'baz' in x9, and
> stores it in fn2 whose address is taken in x8.
>
>
> > > // Second indirect call
> > >   4005d4: f9401908 ldr x8, [x8, #48]
> > >   4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> > >   4005dc: 91183129 add x9, x9, #0x60c
> > >   4005e0: cb090109 sub x9, x8, x9
> > >   4005e4: 93c90929 ror x9, x9, #2
> > >   4005e8: f100053f cmp x9, #0x1
> > >   4005ec: 540000c8 b.hi 400604 <main+0x64>  // b.pmore
> > >   4005f0: 12800000 mov w0, #0xffffffff            // #-1
> > >   4005f4: d63f0100 blr x8
> > >
> > >
> > >   4005f8: 2a1f03e0 mov w0, wzr
> > >   4005fc: a8c17bfd ldp x29, x30, [sp], #16
> > >   400600: d65f03c0 ret
> > >   400604: d4200020 brk #0x1
> >
> >
> > > 0000000000400608 <__typeid__ZTSFvvE_global_addr>:
> > >   400608: 17ffffe3 b 400594 <foo.cfi>
> > >
> > > 000000000040060c <__typeid__ZTSFviE_global_addr>:
> > >   40060c: 17ffffe3 b 400598 <bar.cfi>
> > >   400610: 17ffffe3 b 40059c <baz.cfi>
> >
> > And these are the stubs per type.
> >
> > > So it looks like taking the address is fine, although not optimal due
> > > to the additional jump.
> >
> > Right.
> >
>
> ... although it does seem that function_nocfi() doesn't actually work
> as expected, given that we want the address of <func>.cfi and not the
> address of the stub.

This is because the example wasn't compiled with
-fno-sanitize-cfi-canonical-jump-tables, which we use in the kernel.
With non-canonical jump tables, <func> continues to point to the
function and <func>.cfi_jt points to the jump table, and therefore,
function_nocfi() returns the raw function address.

> > > We could fudge around that by checking the
> > > opcode at the target of the call, or token paste ".cfi" after the
> > > symbol name in the static_call_update() macro, but it doesn't like
> > > like anything is terminally broken tbh.
> >
> > Agreed, since the jump table entries are actually executable it 'works'.
> >
> > I really don't like that extra jump though, so I think I really do want
> > that nocfi_ptr() thing. And going by:
> >
> >   https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls
> >
> > and the above, that might be possible (on x86) with something like:
> >
> > /*
> >  * Turns a Clang CFI jump-table entry into an actual function pointer.
> >  * These jump-table entries are simply jmp.d32 instruction with their
> >  * relative offset pointing to the actual function, therefore decode the
> >  * instruction to find the real function.
> >  */
> > static __always_inline void *nocfi_ptr(void *func)
> > {
> >         union text_poke_insn insn = *(union text_poke_insn *)func;
> >
> >         return func + sizeof(insn) + insn.disp;
> > }
> >
> > But really, that wants to be a compiler intrinsic.
>
> Agreed. We could easily do something similar on arm64, but I'd prefer
> to avoid that too.

I'll see what we can do. Note that the compiler built-in we previously
discussed would have semantics similar to function_nocfi(). It would
return the raw function address from a symbol name, but it wouldn't
decode the address from an arbitrary pointer, so this would require
something different.

Sami
Ard Biesheuvel Oct. 27, 2021, 3:55 p.m. UTC | #20
On Wed, 27 Oct 2021 at 17:50, Sami Tolvanen <samitolvanen@google.com> wrote:
>
> On Wed, Oct 27, 2021 at 7:18 AM Ard Biesheuvel <ardb@kernel.org> wrote:
> >
> > On Wed, 27 Oct 2021 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Wed, Oct 27, 2021 at 03:30:11PM +0200, Ard Biesheuvel wrote:
> > >
> > > > As far as I can tell from playing around with Clang, the stubs can
> > > > actually be executed directly,
> > >
> > > I had just finished reading the clang docs which suggest as much and was
> > > about to try what the compiler actually generates.
> > >
> > > > they just jumps to the actual function.
> > > > The compiler simply generates a jump table for each prototype that
> > > > appears in the code as the target of an indirect jump, and checks
> > > > whether the target appears in the list.
> > > >
> > > > E.g., the code below
> > > >
> > > > void foo(void) {}
> > > > void bar(int) {}
> > > > void baz(int) {}
> > > > void (* volatile fn1)(void) = foo;
> > > > void (* volatile fn2)(int) = bar;
> > > >
> > > > int main(int argc, char *argv[])
> > > > {
> > > >   fn1();
> > > >   fn2 = baz;
> > > >   fn2(-1);
> > > > }
> > > >
> > > > produces
> > > >
> > > > 0000000000400594 <foo.cfi>:
> > > >   400594: d65f03c0 ret
> > > >
> > > > 0000000000400598 <bar.cfi>:
> > > >   400598: d65f03c0 ret
> > > >
> > > > 000000000040059c <baz.cfi>:
> > > >   40059c: d65f03c0 ret
> > >
> > > Right, so these are the actual functions ^.
> > >
> > > > 00000000004005a0 <main>:
> > > >   4005a0: a9bf7bfd stp x29, x30, [sp, #-16]!
> > > >
> > > > // First indirect call
> > > >   4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
> > > >   4005a8: f9401508 ldr x8, [x8, #40]
> > > >   4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> > > >   4005b0: 91182129 add x9, x9, #0x608
> > > >   4005b4: 910003fd mov x29, sp
> > > >   4005b8: eb09011f cmp x8, x9
> > > >   4005bc: 54000241 b.ne 400604 <main+0x64>  // b.any
> > > >   4005c0: d63f0100 blr x8
> > >
> > > That's impenetrable to me, sorry.
> > >
> >
> > This loads the value of fn1 in x8, and takes the address of the jump
> > table in x9. Since it is only one entry long, it does a simple compare
> > to check whether x8 appears in the jump table, and branches to the BRK
> > at the end if they are different.
> >
> > > > // Assignment of fn2
> > > >   4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> > > >   4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17>
> > > >   4005cc: 91184129 add x9, x9, #0x610
> > > >   4005d0: f9001909 str x9, [x8, #48]
> > >
> > > I'm struggling here, x9 points to the branch at 400610, but then what?
> > >
> > > x8 is in .data somewhere?
> > >
> >
> > This takes the address of the jump table entry of 'baz' in x9, and
> > stores it in fn2 whose address is taken in x8.
> >
> >
> > > > // Second indirect call
> > > >   4005d4: f9401908 ldr x8, [x8, #48]
> > > >   4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278>
> > > >   4005dc: 91183129 add x9, x9, #0x60c
> > > >   4005e0: cb090109 sub x9, x8, x9
> > > >   4005e4: 93c90929 ror x9, x9, #2
> > > >   4005e8: f100053f cmp x9, #0x1
> > > >   4005ec: 540000c8 b.hi 400604 <main+0x64>  // b.pmore
> > > >   4005f0: 12800000 mov w0, #0xffffffff            // #-1
> > > >   4005f4: d63f0100 blr x8
> > > >
> > > >
> > > >   4005f8: 2a1f03e0 mov w0, wzr
> > > >   4005fc: a8c17bfd ldp x29, x30, [sp], #16
> > > >   400600: d65f03c0 ret
> > > >   400604: d4200020 brk #0x1
> > >
> > >
> > > > 0000000000400608 <__typeid__ZTSFvvE_global_addr>:
> > > >   400608: 17ffffe3 b 400594 <foo.cfi>
> > > >
> > > > 000000000040060c <__typeid__ZTSFviE_global_addr>:
> > > >   40060c: 17ffffe3 b 400598 <bar.cfi>
> > > >   400610: 17ffffe3 b 40059c <baz.cfi>
> > >
> > > And these are the stubs per type.
> > >
> > > > So it looks like taking the address is fine, although not optimal due
> > > > to the additional jump.
> > >
> > > Right.
> > >
> >
> > ... although it does seem that function_nocfi() doesn't actually work
> > as expected, given that we want the address of <func>.cfi and not the
> > address of the stub.
>
> This is because the example wasn't compiled with
> -fno-sanitize-cfi-canonical-jump-tables, which we use in the kernel.
> With non-canonical jump tables, <func> continues to point to the
> function and <func>.cfi_jt points to the jump table, and therefore,
> function_nocfi() returns the raw function address.
>

Ah excellent. So that means that we don't need function_nocfi() at
all, given that
- statically allocated references (i.e., DEFINE_STATIC_CALL()) will
refer to the function directly;
- runtime assignments can decode the target of the *func pointer and
strip off the initial branch.

It would still be nice to have an intrinsic for that, or some variable
attribute that signifies that assigning the address of a function to
it will produce the actual function rather than the jump table entry.

> > > > We could fudge around that by checking the
> > > > opcode at the target of the call, or token paste ".cfi" after the
> > > > symbol name in the static_call_update() macro, but it doesn't like
> > > > like anything is terminally broken tbh.
> > >
> > > Agreed, since the jump table entries are actually executable it 'works'.
> > >
> > > I really don't like that extra jump though, so I think I really do want
> > > that nocfi_ptr() thing. And going by:
> > >
> > >   https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls
> > >
> > > and the above, that might be possible (on x86) with something like:
> > >
> > > /*
> > >  * Turns a Clang CFI jump-table entry into an actual function pointer.
> > >  * These jump-table entries are simply jmp.d32 instruction with their
> > >  * relative offset pointing to the actual function, therefore decode the
> > >  * instruction to find the real function.
> > >  */
> > > static __always_inline void *nocfi_ptr(void *func)
> > > {
> > >         union text_poke_insn insn = *(union text_poke_insn *)func;
> > >
> > >         return func + sizeof(insn) + insn.disp;
> > > }
> > >
> > > But really, that wants to be a compiler intrinsic.
> >
> > Agreed. We could easily do something similar on arm64, but I'd prefer
> > to avoid that too.
>
> I'll see what we can do. Note that the compiler built-in we previously
> discussed would have semantics similar to function_nocfi(). It would
> return the raw function address from a symbol name, but it wouldn't
> decode the address from an arbitrary pointer, so this would require
> something different.
>
> Sami
Kees Cook Oct. 27, 2021, 5:11 p.m. UTC | #21
On Wed, Oct 27, 2021 at 03:04:55PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 27, 2021 at 02:48:52PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote:
> > > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote:
> > 
> > > > > Should not this jump-table thingy get converted to an actual function
> > > > > address somewhere around arch_static_call_transform() ? This also seems
> > > > > relevant for arm64 (which already has CLANG_CFI supported) given:
> > > > >
> > > > >   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org
> > > >
> > > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere...
> > > >
> > > 
> > > Sadly, that only works on symbol names, so we cannot use it to strip
> > > CFI-ness from void *func arguments passed into the static call API,
> > > unfortunately.
> > 
> > Right, and while mostly static_call_update() is used, whcih is a macro
> > and could possibly be used to wrap this, we very much rely on
> > __static_call_update() also working without that wrapper and then we're
> > up a creek without no paddles.
> 
> Specifically, we support code like:
> 
> struct foo {
> 	void (*func1)(args1);
> 	void (*func2)(args2);
> }
> 
> struct foo global_foo;

And global_foo is intentionally non-const?

> 
> ...
> 
> DEFINE_STATIC_CALL_NULL(func1, *global_foo.func1);
> 
> ...
> 
> __init foo_init()
> {
> 	// whatever code that fills out foo
> 
> 	static_call_update(func1, global_foo.func1);
> }
> 
> ...
> 
> hot_function()
> {
> 	...
> 	static_cal(func1)(args1);
> 	...
> }
> 
> cold_function()
> {
> 	...
> 	global_foo->func1(args1);
> 	...
> }

If global_foo is non-const, then the static call stuff is just an
obfuscated indirect call. The attack CFI attempts to block is having
a controlled write flaw turn into controlled execution. For example,
in the above, an attacker would use a flaw that could aim a write to
global_foo->func1, and then get the kernel to take an execution path
that executes global_foo->func1 (i.e. through cold_function).

If static_call_update() accepts an arbitrary function argument, then it
needs to perform the same validation that is currently being injected
at indirect call sites to avoid having static calls weaken CFI. If
things are left alone, static calls will just insert a direct call to
the jump table, and things "work" (with an extra jump), but nothing
is actually protected (it just requires the attacker locate a more
narrow set of kernel call flows that includes static_call_update). Any
attacker write to global_foo->func1, followed by a static_call_update(),
will create a direct call (with no CFI checking) to the attacker's
destination. (It's likely harder to find the needed execution path to
induce a static_call_update(), but that shouldn't stop us from trying
to protect it.)

Getting a "jump table to actual function" primitive only avoids the added
jump -- all the CFI checking remains bypassed. If static call function
address storage isn't const, for CFI to work as expected the update()
routine will need to do the same checking that is done at indirect call
sites when extracting the "real" function for writing into a direct call.
(i.e. it will need to be function prototype aware, be able to find the
real matching jump table and size, and verify the given function is
actually in the table. Just dereferencing the jump table to find the
address isn't a protection: the attacker just has to write to func1 with
a pointer to an attacker-constructed jump table.)

I think it's not unreasonable to solve this in phases, though. Given that
static calls work with CFI as-is (albeit with an extra jump), and that
static calls do offer some hardening of existing indirect call targets (by
narrowing the execution path needed to actually bring the func1 value into
the kernel execution flow), I don't see either feature blocking the other.

Adding proper CFI function prototype checking to static calls seems
like a security improvement with or without CFI, and a minor performance
improvement under CFI.

To avoid all of this, though, it'd be better if static calls only
switched between one of a per-site const list of possible functions,
which would make it a much tighter hand-rolled CFI system itself. :)
(i.e. it would operate from a specific and short list of expected
functions rather than the "best effort" approach of matching function
prototypes as done by Clang CFI.)
Peter Zijlstra Oct. 27, 2021, 9:21 p.m. UTC | #22
On Wed, Oct 27, 2021 at 10:11:28AM -0700, Kees Cook wrote:
> On Wed, Oct 27, 2021 at 03:04:55PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 27, 2021 at 02:48:52PM +0200, Peter Zijlstra wrote:
> > > On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote:
> > > > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote:
> > > 
> > > > > > Should not this jump-table thingy get converted to an actual function
> > > > > > address somewhere around arch_static_call_transform() ? This also seems
> > > > > > relevant for arm64 (which already has CLANG_CFI supported) given:
> > > > > >
> > > > > >   https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org
> > > > >
> > > > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere...
> > > > >
> > > > 
> > > > Sadly, that only works on symbol names, so we cannot use it to strip
> > > > CFI-ness from void *func arguments passed into the static call API,
> > > > unfortunately.
> > > 
> > > Right, and while mostly static_call_update() is used, whcih is a macro
> > > and could possibly be used to wrap this, we very much rely on
> > > __static_call_update() also working without that wrapper and then we're
> > > up a creek without no paddles.
> > 
> > Specifically, we support code like:
> > 
> > struct foo {
> > 	void (*func1)(args1);
> > 	void (*func2)(args2);
> > }
> > 
> > struct foo global_foo;
> 
> And global_foo is intentionally non-const?

Yep, since depending on the init function it can discover and stuff in
a wild variety of functions.

> > 
> > ...
> > 
> > DEFINE_STATIC_CALL_NULL(func1, *global_foo.func1);
> > 
> > ...
> > 
> > __init foo_init()
> > {
> > 	// whatever code that fills out foo
> > 
> > 	static_call_update(func1, global_foo.func1);
> > }
> > 
> > ...
> > 
> > hot_function()
> > {
> > 	...
> > 	static_cal(func1)(args1);
> > 	...
> > }
> > 
> > cold_function()
> > {
> > 	...
> > 	global_foo->func1(args1);
> > 	...
> > }
> 
> If global_foo is non-const, then the static call stuff is just an
> obfuscated indirect call.

It is not. The target is determined once, at boot time, depending on the
hardware, it then never changes. The static_call() results in a direct
call to that function.

> The attack CFI attempts to block is having
> a controlled write flaw turn into controlled execution. For example,
> in the above, an attacker would use a flaw that could aim a write to
> global_foo->func1, and then get the kernel to take an execution path
> that executes global_foo->func1 (i.e. through cold_function).

I know, and CFI works for cold_function.

> If static_call_update() accepts an arbitrary function argument, then it
> needs to perform the same validation that is currently being injected
> at indirect call sites to avoid having static calls weaken CFI.

static_call_update() is a macro and has compile time signature checks,
the actual function is __static_call_update() and someone can go add
extra validation in there if they so desire.

I did have this patch:

  https://lkml.kernel.org/r/20210904105529.GA5106@worktop.programming.kicks-ass.net

but I never did get around to finishing it. Although looking at it now,
I suppose static_call_seal() might be a better name.

And you're worried about __static_call_update() over text_poke_bp()
because?

> Getting a "jump table to actual function" primitive only avoids the added
> jump -- all the CFI checking remains bypassed.

Exactly, so the extra jump serves no purpose and needs to go. Doubly so
because people are looking at static_call() to undo some of the
performance damage introduced by CFI :-)

> If static call function
> address storage isn't const, for CFI to work as expected the update()
> routine will need to do the same checking that is done at indirect call
> sites when extracting the "real" function for writing into a direct call.

I've mentioned static_call like a hundred times in these CFI threads..
if you want to do CFI on them, go ahead. I'm just not sure the C type
system is up for that, you'll have to somehow frob the signature symbol
into __static_call_update(), something like __builtin_type_symbol().

> To avoid all of this, though, it'd be better if static calls only
> switched between one of a per-site const list of possible functions,
> which would make it a much tighter hand-rolled CFI system itself. :)
> (i.e. it would operate from a specific and short list of expected
> functions rather than the "best effort" approach of matching function
> prototypes as done by Clang CFI.)

That sounds like a ton of painful ugly.
David Laight Oct. 27, 2021, 9:31 p.m. UTC | #23
From: Mark Rutland
> Sent: 27 October 2021 14:18
> 
> On Wed, Oct 27, 2021 at 12:55:17PM +0000, David Laight wrote:
> > From: Mark Rutland
> > > Sent: 27 October 2021 13:05
> > ...
> > > Taking a step back, it'd be nicer if we didn't have the jump-table shim
> > > at all, and had some SW landing pad (e.g. a NOP with some magic bytes)
> > > in the callees that the caller could check for. Then function pointers
> > > would remain callable in call cases, and we could explcitly add landing
> > > pads to asm to protect those. I *think* that's what the grsecurity folk
> > > do, but I could be mistaken.
> >
> > It doesn't need to be a 'landing pad'.
> > The 'magic value' could be at 'label - 8'.
> 
> Sure; I'd intended to mean the general case of something at some fixed
> offset from the entrypoint, either before or after, potentially but not
> necessarily inline in the executed instruction stream.

What you really want is to be able to read the value using the I-cache
so as not to pollute the D-cache with code bytes and to avoid having
both an I-cache and D-cache miss at the same time for the same memory.

Even if the I-cache read took an extra clock (or two) I suspect it
would be an overall gain.
This is also true for code that uses pc-relative instructions to
read constants - common in arm-64.

Not sure any hardware lets you do that though :-(

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Kees Cook Oct. 27, 2021, 10:27 p.m. UTC | #24
On Wed, Oct 27, 2021 at 11:21:26PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 27, 2021 at 10:11:28AM -0700, Kees Cook wrote:
> > On Wed, Oct 27, 2021 at 03:04:55PM +0200, Peter Zijlstra wrote:
> > > [...]
> > > cold_function()
> > > {
> > > 	...
> > > 	global_foo->func1(args1);
> > > 	...
> > > }
> > 
> > If global_foo is non-const, then the static call stuff is just an
> > obfuscated indirect call.
> 
> It is not. The target is determined once, at boot time, depending on the
> hardware, it then never changes. The static_call() results in a direct
> call to that function.

Right, I mean it is _effectively_ an indirect call in the sense that the
call destination is under the control of a writable memory location.
Yes, static_call_update() must be called to make the change, hence why I
didn't wave my arms around when static_call was originally added. It's a
net benefit for the kernel: actual indirect calls now become updatable
direct calls. This means reachability becomes much harder for attackers;
I'm totally a fan. :)

> > If static_call_update() accepts an arbitrary function argument, then it
> > needs to perform the same validation that is currently being injected
> > at indirect call sites to avoid having static calls weaken CFI.
> 
> static_call_update() is a macro and has compile time signature checks,
> the actual function is __static_call_update() and someone can go add
> extra validation in there if they so desire.
> 
> I did have this patch:
> 
>   https://lkml.kernel.org/r/20210904105529.GA5106@worktop.programming.kicks-ass.net
> 
> but I never did get around to finishing it. Although looking at it now,
> I suppose static_call_seal() might be a better name.

Right -- though wouldn't just adding __ro_after_init do the same?

DEFINE_STATIC_CALL(static_call_name, func_a) __ro_after_init;

If you wanted the clean WARN_ON, __static_call_update() could check if
the struct static_call_key is in a non-writable region.

> And you're worried about __static_call_update() over text_poke_bp()
> because?

Both have a meaningful lack of exposure to common attacker-reachable
code paths (e.g., it's likely rare that there are ways attackers can
control the target/content of a text_poke_bp(): alternatives and ftrace,
both of which tend to use read-only content).

static_call_update() is limited per-site with a fixed set of hardcoded
targets (i.e. any place static_call() is used), but the content
is effectively arbitrary (coming from writable memory). AIUI, it's
intended to be used more widely than text_poke_bp(), and it seems like
call sites using static_call_update() will become less rare in future
kernels. (Currently it seems mostly focused on sched and pmu, which
don't traditionally have much userspace control).

So, they're kind of opposite, but for me the question is for examining
the changes to reachability and what kind of primitives their existence
provides an attacker. IMO, static_calls are a net gain over indirect
calls (from some usually writable function pointer) because it becomes
a direct call. It's risk doesn't match a "real" direct call, though,
since they do still have the (much more narrow) risk of having something
call static_call_update() from a writable function pointer as in your
example, but that's still a defensively better position than an regular
indirect call.

What I'm hoping to see from static_calls is maybe defaulting to being
ro_after_init, and explicitly marking those that can't be, which makes
auditing easier and put things into a default-safe state (i.e. both
fixed targets and fixed content).

> > Getting a "jump table to actual function" primitive only avoids the added
> > jump -- all the CFI checking remains bypassed.
> 
> Exactly, so the extra jump serves no purpose and needs to go. Doubly so
> because people are looking at static_call() to undo some of the
> performance damage introduced by CFI :-)

Well, sure, it's inefficient, not _broken_. :) And can you point to the
"use static_calls because CFI is slow" discussion? I'd be curious to read
through that; the past performance testing had shown that CFI performance
overhead tends to be less than the gains of LTO. So compared to a "stock"
kernel, things should be about the same if not faster.

That said, I understand I'm talking to you, and you're paying very close
attention to the scheduler, etc. :) I'm sure there are specific
workloads that look terrible under CFI!

> > If static call function
> > address storage isn't const, for CFI to work as expected the update()
> > routine will need to do the same checking that is done at indirect call
> > sites when extracting the "real" function for writing into a direct call.
> 
> I've mentioned static_call like a hundred times in these CFI threads..
> if you want to do CFI on them, go ahead. I'm just not sure the C type
> system is up for that, you'll have to somehow frob the signature symbol
> into __static_call_update(), something like __builtin_type_symbol().
> 
> > To avoid all of this, though, it'd be better if static calls only
> > switched between one of a per-site const list of possible functions,
> > which would make it a much tighter hand-rolled CFI system itself. :)
> > (i.e. it would operate from a specific and short list of expected
> > functions rather than the "best effort" approach of matching function
> > prototypes as done by Clang CFI.)
> 
> That sounds like a ton of painful ugly.

Right, I know. That's why I'm saying that I see these features as
certainly related, but not at odds with each other. CFI doesn't protect
static_call sites, but static_call sites are already more well protected
than their earlier indirect calls.

The only note I had was that if static_call wants to dig into the jump
table, it likely needs to static_call-specific: we don't want a general
way to do that without knowing we have some reasonable bounds on the
behavior of the code using it. I think it's fine to have static_calls
do this, though it'd be nice if they could be a little more defensive
(generally) at the same time.
Peter Zijlstra Oct. 28, 2021, 11:09 a.m. UTC | #25
On Wed, Oct 27, 2021 at 03:27:59PM -0700, Kees Cook wrote:

> Right -- though wouldn't just adding __ro_after_init do the same?
> 
> DEFINE_STATIC_CALL(static_call_name, func_a) __ro_after_init;

That breaks modules (and your jump_label patch doing the same is
similarly broken).

When a module is loaded that uses the static_call(), it needs to
register it's .static_call_sites range with the static_call_key which
requires modifying it.
Kees Cook Oct. 28, 2021, 5:12 p.m. UTC | #26
On Thu, Oct 28, 2021 at 01:09:39PM +0200, Peter Zijlstra wrote:
> On Wed, Oct 27, 2021 at 03:27:59PM -0700, Kees Cook wrote:
> 
> > Right -- though wouldn't just adding __ro_after_init do the same?
> > 
> > DEFINE_STATIC_CALL(static_call_name, func_a) __ro_after_init;
> 
> That breaks modules (and your jump_label patch doing the same is
> similarly broken).

Well that's no fun. :) I'd like to understand this better so I can fix
it!

> 
> When a module is loaded that uses the static_call(), it needs to
> register it's .static_call_sites range with the static_call_key which
> requires modifying it.

Reading static_call_add_module() leaves me with even more questions. ;)

It looks like module static calls need to write to kernel text? I don't
understand. Is this when a module is using an non-module key for a call
site? And in that case, this happens:

                key |= s_key & STATIC_CALL_SITE_FLAGS;

Where "key" is not in the module?

And the flags can be:

#define STATIC_CALL_SITE_TAIL 1UL       /* tail call */
#define STATIC_CALL_SITE_INIT 2UL       /* init section */

But aren't these per-site attributes? Why are they stored per-key?

                        if (!init && static_call_is_init(site))
                                continue;

			...

                        arch_static_call_transform(site_addr, NULL, func,
                                                   static_call_is_tail(site));
Peter Zijlstra Oct. 28, 2021, 8:29 p.m. UTC | #27
On Thu, Oct 28, 2021 at 10:12:32AM -0700, Kees Cook wrote:
> On Thu, Oct 28, 2021 at 01:09:39PM +0200, Peter Zijlstra wrote:
> > On Wed, Oct 27, 2021 at 03:27:59PM -0700, Kees Cook wrote:
> > 
> > > Right -- though wouldn't just adding __ro_after_init do the same?
> > > 
> > > DEFINE_STATIC_CALL(static_call_name, func_a) __ro_after_init;
> > 
> > That breaks modules (and your jump_label patch doing the same is
> > similarly broken).
> 
> Well that's no fun. :) I'd like to understand this better so I can fix
> it!
> 
> > 
> > When a module is loaded that uses the static_call(), it needs to
> > register it's .static_call_sites range with the static_call_key which
> > requires modifying it.
> 
> Reading static_call_add_module() leaves me with even more questions. ;)

Yes, that function is highly magical..

> It looks like module static calls need to write to kernel text? 

No, they need to modify the static_call_key though.

> I don't
> understand. Is this when a module is using an non-module key for a call
> site? And in that case, this happens:
> 
>                 key |= s_key & STATIC_CALL_SITE_FLAGS;
> 
> Where "key" is not in the module?
> 
> And the flags can be:
> 
> #define STATIC_CALL_SITE_TAIL 1UL       /* tail call */
> #define STATIC_CALL_SITE_INIT 2UL       /* init section */
> 
> But aren't these per-site attributes? Why are they stored per-key?

They are per site, but stored in the key pointer.

So static_call has (and jump_label is nearly identical):

struct static_call_site {
	s32 addr;
	s32 key;
};

struct static_call_mod {
	struct static_call_mod *next;
	struct module *mod;
	struct static_call_sutes *sites;
};

struct static_call_key {
	void *func;
	union {
		unsigned long type;
		struct static_call_mod *mods;
		struct static_call_site *sites;
	};
};

__SCT_##name() tramplines		(no analog with jump_label)

.static_call_sites section
.static_call_tramp_key section		(no analog with jump_label)

Where the key holds the current function pointer and a pointer to either
an array of static_call_site or a pointer to a static_call_mod.

Now, a key observation is that all these data structures are word
aligned, which means we have at least 2 lsb bits to play with. For
static_call_key::{mods,sites} the LSB indicates which, 0:mods, 1:sites.

Then the .static_call_sites section is an array of struct
static_call_site sorted by the static_call_key pointer.

The static_call_sites holds relative displacements, but represents:

	struct static_call_key *key;
	unsigned long call_address;

Now, since code (on x86) is variable length, there are no spare bits in
the code address, but since static_call_key is aligned, we have spare
bits. It is those bits we use to encode TAIL (Bit0) and INIT (Bit1).

If INIT, the address points to an __init section and we shouldn't try
and touch if after those have been freed or bad stuff happens.

If TAIL, it's a tail-call and we get to write a jump instruction instead
of a call instruction.

So, objtool builds .static_call_sites at built time, then at init (or
module load) time we sort the array by static_call_key pointer, such
that we get consequtive ranges per key. We iterate the array and every
time the key pointer changes, we -- already having the key pointer --
set key->sites to the first.

Now, kernel init of static_call happens *really* early and memory
allocation doesn't work yet, which is why we have that {mods,sites}
thing. Therefore, when the first module gets loaded, we need to allocate
a struct static_call_mod for the kernel (mod==NULL) and transfer the
sites pointer to it and change key to a mods pointer.

So one possible solution would be to have a late init (but before RO),
that, re-iterates the sites array and pre-allocates the kernel
static_call_mod structure.  That way, static_call_key gets changed to a
mods pointer and wouldn't ever need changing after that, only the
static_call_mod (which isn't RO) gets changed when modules get
added/deleted.

The above is basically identical to jump_labels. However static_call()
have one more trick:

  EXPORT_STATIC_CALL_TRAMP()

That exports the trampoline symbol, but not the static_call_key data
structure. The result is that modules can use the static_call(), but
cannot use static_call_update() because they cannot get at the key.

In this case objtool cannot correctly put the static_call_key address in
the static_call_site, what it does instead is store the trampoline
address (there's a 1:1 relation between key and tramplines). And then we
ues the .static_call_tramp_key section to find a mapping from trampoline
to key and rewrite the site to be 'right'. All this happens before
sorting it on key obv.

Hope that clarifies things, instead of making it worse :-)
Peter Zijlstra Oct. 29, 2021, 8:03 p.m. UTC | #28
On Wed, Oct 27, 2021 at 08:50:17AM -0700, Sami Tolvanen wrote:
> On Wed, Oct 27, 2021 at 7:18 AM Ard Biesheuvel <ardb@kernel.org> wrote:

> > > /*
> > >  * Turns a Clang CFI jump-table entry into an actual function pointer.
> > >  * These jump-table entries are simply jmp.d32 instruction with their
> > >  * relative offset pointing to the actual function, therefore decode the
> > >  * instruction to find the real function.
> > >  */
> > > static __always_inline void *nocfi_ptr(void *func)
> > > {
> > >         union text_poke_insn insn = *(union text_poke_insn *)func;
> > >
> > >         return func + sizeof(insn) + insn.disp;
> > > }
> > >
> > > But really, that wants to be a compiler intrinsic.
> >
> > Agreed. We could easily do something similar on arm64, but I'd prefer
> > to avoid that too.
> 
> I'll see what we can do. Note that the compiler built-in we previously
> discussed would have semantics similar to function_nocfi(). It would
> return the raw function address from a symbol name, but it wouldn't
> decode the address from an arbitrary pointer, so this would require
> something different.

So I had a bit of a peek at what clang generates:

    3fa4:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi        3fa7: R_X86_64_32S      __SCK__x86_pmu_handle_irq
    3fab:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi        3fae: R_X86_64_32S      __SCT__x86_pmu_handle_irq.cfi_jt
    3fb2:       e8 00 00 00 00          call   3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32    __static_call_update-0x4

So this then gives the trampoline jump table entry to
__static_call_update(), with the result that it will rewrite the
jump-table entry, not the trampoline!

Now it so happens that the trampoline looks *exactly* like the
jump-table entry (one jmp.d32 instruction), so in that regards it'll
again 'work'.

But this is all really, as in *really*, wrong. And I'm really sad I'm
the one to have to discover this, even though I've mentioned
static_call()s being tricky in previous reviews.
Sami Tolvanen Oct. 30, 2021, 7:07 p.m. UTC | #29
On Fri, Oct 29, 2021 at 1:04 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Wed, Oct 27, 2021 at 08:50:17AM -0700, Sami Tolvanen wrote:
> > On Wed, Oct 27, 2021 at 7:18 AM Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > > > /*
> > > >  * Turns a Clang CFI jump-table entry into an actual function pointer.
> > > >  * These jump-table entries are simply jmp.d32 instruction with their
> > > >  * relative offset pointing to the actual function, therefore decode the
> > > >  * instruction to find the real function.
> > > >  */
> > > > static __always_inline void *nocfi_ptr(void *func)
> > > > {
> > > >         union text_poke_insn insn = *(union text_poke_insn *)func;
> > > >
> > > >         return func + sizeof(insn) + insn.disp;
> > > > }
> > > >
> > > > But really, that wants to be a compiler intrinsic.
> > >
> > > Agreed. We could easily do something similar on arm64, but I'd prefer
> > > to avoid that too.
> >
> > I'll see what we can do. Note that the compiler built-in we previously
> > discussed would have semantics similar to function_nocfi(). It would
> > return the raw function address from a symbol name, but it wouldn't
> > decode the address from an arbitrary pointer, so this would require
> > something different.
>
> So I had a bit of a peek at what clang generates:
>
>     3fa4:       48 c7 c7 00 00 00 00    mov    $0x0,%rdi        3fa7: R_X86_64_32S      __SCK__x86_pmu_handle_irq
>     3fab:       48 c7 c6 00 00 00 00    mov    $0x0,%rsi        3fae: R_X86_64_32S      __SCT__x86_pmu_handle_irq.cfi_jt
>     3fb2:       e8 00 00 00 00          call   3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32    __static_call_update-0x4
>
> So this then gives the trampoline jump table entry to
> __static_call_update(), with the result that it will rewrite the
> jump-table entry, not the trampoline!
>
> Now it so happens that the trampoline looks *exactly* like the
> jump-table entry (one jmp.d32 instruction), so in that regards it'll
> again 'work'.

Ugh, good catch!

> But this is all really, as in *really*, wrong. And I'm really sad I'm
> the one to have to discover this, even though I've mentioned
> static_call()s being tricky in previous reviews.

My bad, I didn't realize Clang does this with a typeof(func)
declaration. I'll make sure we have a reasonable fix for this before
the next version.

Sami
Andy Lutomirski Nov. 1, 2021, 4:13 a.m. UTC | #30
On 10/27/21 15:27, Kees Cook wrote:
> On Wed, Oct 27, 2021 at 11:21:26PM +0200, Peter Zijlstra wrote:
>> On Wed, Oct 27, 2021 at 10:11:28AM -0700, Kees Cook wrote:
>>> On Wed, Oct 27, 2021 at 03:04:55PM +0200, Peter Zijlstra wrote:
>>>> [...]
>>>> cold_function()
>>>> {
>>>> 	...
>>>> 	global_foo->func1(args1);
>>>> 	...
>>>> }
>>>
>>> If global_foo is non-const, then the static call stuff is just an
>>> obfuscated indirect call.
>>
>> It is not. The target is determined once, at boot time, depending on the
>> hardware, it then never changes. The static_call() results in a direct
>> call to that function.
> 
> Right, I mean it is _effectively_ an indirect call in the sense that the
> call destination is under the control of a writable memory location.
> Yes, static_call_update() must be called to make the change, hence why I
> didn't wave my arms around when static_call was originally added. It's a
> net benefit for the kernel: actual indirect calls now become updatable
> direct calls. This means reachability becomes much harder for attackers;
> I'm totally a fan. :)

I think this means that function_nocfi() is the wrong abstraction.  To 
solve this properly, we want (sorry for fake C++ concept-ish syntax, but 
this would be a builtin):

template<function_pointer T> uintptr_t cfi_check_get_raw_address(T ptr);

You pass in a function pointer and you get out the address of the actual 
function body.  And this *also* emits the same type checking code that 
an actual call would emit.  So:

void (*ptr)(void);

ptr();

and:

void (*ptr)(void);

asm volatile ("call *%0" :: "rmi" (cfi_check_get_raw_address(ptr)));

would emit comparably safe code, except that the latter would actually 
read the code bytes and the latter is missing clobbers and such.  And 
yes, this means that, without special support, the jump table can't live 
in XO memory.

void foo(void);
int (*ptr)(void) = (void *)foo;
cfi_check_get_raw_address(ptr);

would crash due to a CFI violation.

For what it's worth, I feel fairly strongly that we should not merge 
half-baked CFI support.  Years ago, we merged half-baked stack protector 
support in which we shoehorned in a totally inappropriate x86_32 gcc 
implementation into the kernel, and the result was a mess.  It's 
*finally* cleaned up, and that only happened because we got gcc to add 
the actual required support, then we waited a while, and then we dropped 
stack protector support for old gcc versions.  Yuck.
Kees Cook Nov. 2, 2021, 5:26 p.m. UTC | #31
On Thu, Oct 28, 2021 at 10:29:05PM +0200, Peter Zijlstra wrote:
> Now, since code (on x86) is variable length, there are no spare bits in
> the code address, but since static_call_key is aligned, we have spare
> bits. It is those bits we use to encode TAIL (Bit0) and INIT (Bit1).
> 
> If INIT, the address points to an __init section and we shouldn't try
> and touch if after those have been freed or bad stuff happens.
> 
> If TAIL, it's a tail-call and we get to write a jump instruction instead
> of a call instruction.

I think this is the part that I was missing: the information is about
the _address_, but it's stored in the _key_'s low bits (regardless of
the key's actual/masked key pointer).

> [...]
> Hope that clarifies things, instead of making it worse :-)

It does help, yes, thanks! I will need to read it again and go follow
along in the code, but yes, that helps explain it.