Message ID | 20211013181658.1020262-1-samitolvanen@google.com (mailing list archive) |
---|---|
Headers | show |
Series | x86: Add support for Clang CFI | expand |
On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote: > This series adds support for Clang's Control-Flow Integrity (CFI) > checking to x86_64. With CFI, the compiler injects a runtime > check before each indirect function call to ensure the target is > a valid function with the correct static type. This restricts > possible call targets and makes it more difficult for an attacker > to exploit bugs that allow the modification of stored function > pointers. For more details, see: > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > Note that v5 is based on tip/master. The first two patches contain > objtool support for CFI, the remaining patches change function > declarations to use opaque types, fix type mismatch issues that > confuse the compiler, and disable CFI where it can't be used. x86 folks: I'd prefer this series went via -tip, but I can carry it for -next as well. What would you like to do here? I think it's ready. Thanks! -Kees
From: Sami Tolvanen <samitolvanen@google.com> Date: Wed, 13 Oct 2021 11:16:43 -0700 > This series adds support for Clang's Control-Flow Integrity (CFI) > checking to x86_64. With CFI, the compiler injects a runtime > check before each indirect function call to ensure the target is > a valid function with the correct static type. This restricts > possible call targets and makes it more difficult for an attacker > to exploit bugs that allow the modification of stored function > pointers. For more details, see: > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > Note that v5 is based on tip/master. The first two patches contain > objtool support for CFI, the remaining patches change function > declarations to use opaque types, fix type mismatch issues that > confuse the compiler, and disable CFI where it can't be used. > > You can also pull this series from > > https://github.com/samitolvanen/linux.git x86-cfi-v5 Hi, I found [0] while was testing Peter's retpoline series, wanted to ask / double check if that is because I'm using ClangCFI for x86 on unsupported Clang 12. It is fixed in 13 I suppose? [0] https://lore.kernel.org/all/20211019094038.80569-1-alobakin@pm.me Thanks, Al
On Tue, Oct 19, 2021 at 3:06 AM Alexander Lobakin <alobakin@pm.me> wrote: > > From: Sami Tolvanen <samitolvanen@google.com> > Date: Wed, 13 Oct 2021 11:16:43 -0700 > > > This series adds support for Clang's Control-Flow Integrity (CFI) > > checking to x86_64. With CFI, the compiler injects a runtime > > check before each indirect function call to ensure the target is > > a valid function with the correct static type. This restricts > > possible call targets and makes it more difficult for an attacker > > to exploit bugs that allow the modification of stored function > > pointers. For more details, see: > > > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > > > Note that v5 is based on tip/master. The first two patches contain > > objtool support for CFI, the remaining patches change function > > declarations to use opaque types, fix type mismatch issues that > > confuse the compiler, and disable CFI where it can't be used. > > > > You can also pull this series from > > > > https://github.com/samitolvanen/linux.git x86-cfi-v5 > > Hi, > > I found [0] while was testing Peter's retpoline series, wanted to > ask / double check if that is because I'm using ClangCFI for x86 > on unsupported Clang 12. It is fixed in 13 I suppose? > > [0] https://lore.kernel.org/all/20211019094038.80569-1-alobakin@pm.me No, it works exactly the same in later compiler versions. I also replied to that thread, but this looks like another instance where using an opaque type instead of a function declaration fixes the issue, and probably makes sense as the thunks are not directly callable from C code. Sami
From: Sami Tolvanen <samitolvanen@google.com> Date: Wed, 13 Oct 2021 11:16:43 -0700 > This series adds support for Clang's Control-Flow Integrity (CFI) > checking to x86_64. With CFI, the compiler injects a runtime > check before each indirect function call to ensure the target is > a valid function with the correct static type. This restricts > possible call targets and makes it more difficult for an attacker > to exploit bugs that allow the modification of stored function > pointers. For more details, see: > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > Note that v5 is based on tip/master. The first two patches contain > objtool support for CFI, the remaining patches change function > declarations to use opaque types, fix type mismatch issues that > confuse the compiler, and disable CFI where it can't be used. > > You can also pull this series from > > https://github.com/samitolvanen/linux.git x86-cfi-v5 [ snip ] I've been using it since the end of May on my x86_64, so for v5 (with changing retpoline thunks prototypes to opaque): Reviwed-by: Alexander Lobakin <alobakin@pm.me> Tested-by: Alexander Lobakin <alobakin@pm.me> Thanks! Al
On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote: > This series adds support for Clang's Control-Flow Integrity (CFI) > checking to x86_64. With CFI, the compiler injects a runtime > check before each indirect function call to ensure the target is > a valid function with the correct static type. This restricts > possible call targets and makes it more difficult for an attacker > to exploit bugs that allow the modification of stored function > pointers. For more details, see: > > https://clang.llvm.org/docs/ControlFlowIntegrity.html So, if I understand this right, the compiler emits, for every function two things: 1) the actual funcion and 2) a jump-table entry. Then, every time the address of a function is taken, 2) is given instead of the expected 1), right? But how does this work with things like static_call(), which we give a function address (now a jump-table entry) and use that to write direct call instructions? Should not this jump-table thingy get converted to an actual function address somewhere around arch_static_call_transform() ? This also seems relevant for arm64 (which already has CLANG_CFI supported) given: https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org Or am I still not understanding this CFI thing?
From: Peter Zijlstra > Sent: 26 October 2021 21:16 > > On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote: > > This series adds support for Clang's Control-Flow Integrity (CFI) > > checking to x86_64. With CFI, the compiler injects a runtime > > check before each indirect function call to ensure the target is > > a valid function with the correct static type. This restricts > > possible call targets and makes it more difficult for an attacker > > to exploit bugs that allow the modification of stored function > > pointers. For more details, see: > > > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > So, if I understand this right, the compiler emits, for every function > two things: 1) the actual funcion and 2) a jump-table entry. > > Then, every time the address of a function is taken, 2) is given instead > of the expected 1), right? > > But how does this work with things like static_call(), which we give a > function address (now a jump-table entry) and use that to write direct > call instructions? > > Should not this jump-table thingy get converted to an actual function > address somewhere around arch_static_call_transform() ? This also seems > relevant for arm64 (which already has CLANG_CFI supported) given: > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org > > Or am I still not understanding this CFI thing? From what I remember the compiler adds code prior to every jump indirect to check that the function address is in the list of valid functions (with a suitable prototype - or some similar check). So it add a run-time search to every indirect call. What would be more sensible would be a hidden extra parameter that is a hash of the prototype that is saved just before the entry point. Then the caller and called function could both check. That is still a moderate cost for an indirect call. Anything that can write asm can get around any check - it just gets a bit harder. But overwritten function pointers would be detected - which I assume is the main threat. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Oct 27, 2021 at 10:02:56AM +0000, David Laight wrote: > From: Peter Zijlstra > > Sent: 26 October 2021 21:16 > > > > On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote: > > > This series adds support for Clang's Control-Flow Integrity (CFI) > > > checking to x86_64. With CFI, the compiler injects a runtime > > > check before each indirect function call to ensure the target is > > > a valid function with the correct static type. This restricts > > > possible call targets and makes it more difficult for an attacker > > > to exploit bugs that allow the modification of stored function > > > pointers. For more details, see: > > > > > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > > > So, if I understand this right, the compiler emits, for every function > > two things: 1) the actual funcion and 2) a jump-table entry. > > > > Then, every time the address of a function is taken, 2) is given instead > > of the expected 1), right? > > > > But how does this work with things like static_call(), which we give a > > function address (now a jump-table entry) and use that to write direct > > call instructions? > > > > Should not this jump-table thingy get converted to an actual function > > address somewhere around arch_static_call_transform() ? This also seems > > relevant for arm64 (which already has CLANG_CFI supported) given: > > > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org > > > > Or am I still not understanding this CFI thing? > > From what I remember the compiler adds code prior to every jump indirect > to check that the function address is in the list of valid functions > (with a suitable prototype - or some similar check). It definitely mucks about with the address too; see here: https://lkml.kernel.org/r/YW6a67fGzM2AyHot@hirez.programming.kicks-ass.net I'm thinking static_call() wants the real actual function address before it writes instructions.
On Tue, Oct 26, 2021 at 10:16:22PM +0200, Peter Zijlstra wrote: > On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote: > > This series adds support for Clang's Control-Flow Integrity (CFI) > > checking to x86_64. With CFI, the compiler injects a runtime > > check before each indirect function call to ensure the target is > > a valid function with the correct static type. This restricts > > possible call targets and makes it more difficult for an attacker > > to exploit bugs that allow the modification of stored function > > pointers. For more details, see: > > > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > So, if I understand this right, the compiler emits, for every function > two things: 1) the actual funcion and 2) a jump-table entry. > > Then, every time the address of a function is taken, 2) is given instead > of the expected 1), right? Yes, and we had to bodge around this with function_nocfi() to get the actual function address. Really there should be a compiler intrinsic or attribute for this, given the compiler has all the releveant information available. On arm64 we had to us inine asm to generate the addres... Taking a step back, it'd be nicer if we didn't have the jump-table shim at all, and had some SW landing pad (e.g. a NOP with some magic bytes) in the callees that the caller could check for. Then function pointers would remain callable in call cases, and we could explcitly add landing pads to asm to protect those. I *think* that's what the grsecurity folk do, but I could be mistaken. > But how does this work with things like static_call(), which we give a > function address (now a jump-table entry) and use that to write direct > call instructions? > > Should not this jump-table thingy get converted to an actual function > address somewhere around arch_static_call_transform() ? This also seems > relevant for arm64 (which already has CLANG_CFI supported) given: > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org Ugh, yeah, we'll need to do the function_nocfi() dance somewhere... Thanks, Mark.
On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote: > > On Tue, Oct 26, 2021 at 10:16:22PM +0200, Peter Zijlstra wrote: > > On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote: > > > This series adds support for Clang's Control-Flow Integrity (CFI) > > > checking to x86_64. With CFI, the compiler injects a runtime > > > check before each indirect function call to ensure the target is > > > a valid function with the correct static type. This restricts > > > possible call targets and makes it more difficult for an attacker > > > to exploit bugs that allow the modification of stored function > > > pointers. For more details, see: > > > > > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > > > So, if I understand this right, the compiler emits, for every function > > two things: 1) the actual funcion and 2) a jump-table entry. > > > > Then, every time the address of a function is taken, 2) is given instead > > of the expected 1), right? > > Yes, and we had to bodge around this with function_nocfi() to get the > actual function address. > > Really there should be a compiler intrinsic or attribute for this, given > the compiler has all the releveant information available. On arm64 we > had to us inine asm to generate the addres... > > Taking a step back, it'd be nicer if we didn't have the jump-table shim > at all, and had some SW landing pad (e.g. a NOP with some magic bytes) > in the callees that the caller could check for. Then function pointers > would remain callable in call cases, and we could explcitly add landing > pads to asm to protect those. I *think* that's what the grsecurity folk > do, but I could be mistaken. > > > But how does this work with things like static_call(), which we give a > > function address (now a jump-table entry) and use that to write direct > > call instructions? > > > > Should not this jump-table thingy get converted to an actual function > > address somewhere around arch_static_call_transform() ? This also seems > > relevant for arm64 (which already has CLANG_CFI supported) given: > > > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere... > Sadly, that only works on symbol names, so we cannot use it to strip CFI-ness from void *func arguments passed into the static call API, unfortunately. Also, function_nocfi() seems broken in the sense that it relies on the symbol existing in the global namespace, which may not be true for function symbols with static linkage, as they can be optimized away entirely. I think the same might apply to function symbols with external linkage under LTO.
On Wed, Oct 27, 2021 at 01:05:15PM +0100, Mark Rutland wrote: > On Tue, Oct 26, 2021 at 10:16:22PM +0200, Peter Zijlstra wrote: > > On Wed, Oct 13, 2021 at 11:16:43AM -0700, Sami Tolvanen wrote: > > > This series adds support for Clang's Control-Flow Integrity (CFI) > > > checking to x86_64. With CFI, the compiler injects a runtime > > > check before each indirect function call to ensure the target is > > > a valid function with the correct static type. This restricts > > > possible call targets and makes it more difficult for an attacker > > > to exploit bugs that allow the modification of stored function > > > pointers. For more details, see: > > > > > > https://clang.llvm.org/docs/ControlFlowIntegrity.html > > > > So, if I understand this right, the compiler emits, for every function > > two things: 1) the actual funcion and 2) a jump-table entry. > > > > Then, every time the address of a function is taken, 2) is given instead > > of the expected 1), right? > > Yes, and we had to bodge around this with function_nocfi() to get the > actual function address. The patch set under consideration seems to have forgotten to provide one for x86 :/ > Really there should be a compiler intrinsic or attribute for this, given > the compiler has all the releveant information available. On arm64 we > had to us inine asm to generate the addres... Agreed, this *really* shouldn't be an arch asm hack trying to undo something the compiler did.
On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote: > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote: > > > Should not this jump-table thingy get converted to an actual function > > > address somewhere around arch_static_call_transform() ? This also seems > > > relevant for arm64 (which already has CLANG_CFI supported) given: > > > > > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org > > > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere... > > > > Sadly, that only works on symbol names, so we cannot use it to strip > CFI-ness from void *func arguments passed into the static call API, > unfortunately. Right, and while mostly static_call_update() is used, whcih is a macro and could possibly be used to wrap this, we very much rely on __static_call_update() also working without that wrapper and then we're up a creek without no paddles.
From: Mark Rutland > Sent: 27 October 2021 13:05 ... > Taking a step back, it'd be nicer if we didn't have the jump-table shim > at all, and had some SW landing pad (e.g. a NOP with some magic bytes) > in the callees that the caller could check for. Then function pointers > would remain callable in call cases, and we could explcitly add landing > pads to asm to protect those. I *think* that's what the grsecurity folk > do, but I could be mistaken. It doesn't need to be a 'landing pad'. The 'magic value' could be at 'label - 8'. Provided you can generate the required value it could be added to asm functions. (Or you could patch it at startup by stealing the value from a C function.) Depending on the threat model, you may even want the called function to do some sanity checks on the caller. I suspect that anything you do is easy to subvert by anything that can actually write asm. So if the real threat is overwritten function tables then something relatively simple is adequate. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Oct 27, 2021 at 02:48:52PM +0200, Peter Zijlstra wrote: > On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote: > > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote: > > > > > Should not this jump-table thingy get converted to an actual function > > > > address somewhere around arch_static_call_transform() ? This also seems > > > > relevant for arm64 (which already has CLANG_CFI supported) given: > > > > > > > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org > > > > > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere... > > > > > > > Sadly, that only works on symbol names, so we cannot use it to strip > > CFI-ness from void *func arguments passed into the static call API, > > unfortunately. > > Right, and while mostly static_call_update() is used, whcih is a macro > and could possibly be used to wrap this, we very much rely on > __static_call_update() also working without that wrapper and then we're > up a creek without no paddles. Specifically, we support code like: struct foo { void (*func1)(args1); void (*func2)(args2); } struct foo global_foo; ... DEFINE_STATIC_CALL_NULL(func1, *global_foo.func1); ... __init foo_init() { // whatever code that fills out foo static_call_update(func1, global_foo.func1); } ... hot_function() { ... static_cal(func1)(args1); ... } cold_function() { ... global_foo->func1(args1); ... } And there is no way I can see this working sanely with CFI as presented. Even though the above uses static_call_update(), we can't no longer use function_nocfi() on the @func argument, because it's not a symbol, it's already a function pointer. Nor can we fill global_foo.func1 with function_nocfi() because then cold_function() goes sideways.
On Wed, Oct 27, 2021 at 12:55:17PM +0000, David Laight wrote: > From: Mark Rutland > > Sent: 27 October 2021 13:05 > ... > > Taking a step back, it'd be nicer if we didn't have the jump-table shim > > at all, and had some SW landing pad (e.g. a NOP with some magic bytes) > > in the callees that the caller could check for. Then function pointers > > would remain callable in call cases, and we could explcitly add landing > > pads to asm to protect those. I *think* that's what the grsecurity folk > > do, but I could be mistaken. > > It doesn't need to be a 'landing pad'. > The 'magic value' could be at 'label - 8'. Sure; I'd intended to mean the general case of something at some fixed offset from the entrypoint, either before or after, potentially but not necessarily inline in the executed instruction stream. Mark.
On Wed, 27 Oct 2021 at 15:05, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Oct 27, 2021 at 02:48:52PM +0200, Peter Zijlstra wrote: > > On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote: > > > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > Should not this jump-table thingy get converted to an actual function > > > > > address somewhere around arch_static_call_transform() ? This also seems > > > > > relevant for arm64 (which already has CLANG_CFI supported) given: > > > > > > > > > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org > > > > > > > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere... > > > > > > > > > > Sadly, that only works on symbol names, so we cannot use it to strip > > > CFI-ness from void *func arguments passed into the static call API, > > > unfortunately. > > > > Right, and while mostly static_call_update() is used, whcih is a macro > > and could possibly be used to wrap this, we very much rely on > > __static_call_update() also working without that wrapper and then we're > > up a creek without no paddles. > > Specifically, we support code like: > > struct foo { > void (*func1)(args1); > void (*func2)(args2); > } > > struct foo global_foo; > > ... > > DEFINE_STATIC_CALL_NULL(func1, *global_foo.func1); > > ... > > __init foo_init() > { > // whatever code that fills out foo > > static_call_update(func1, global_foo.func1); > } > > ... > > hot_function() > { > ... > static_cal(func1)(args1); > ... > } > > cold_function() > { > ... > global_foo->func1(args1); > ... > } > > And there is no way I can see this working sanely with CFI as presented. > > Even though the above uses static_call_update(), we can't no longer use > function_nocfi() on the @func argument, because it's not a symbol, it's > already a function pointer. > > Nor can we fill global_foo.func1 with function_nocfi() because then > cold_function() goes sideways. > As far as I can tell from playing around with Clang, the stubs can actually be executed directly, they just jumps to the actual function. The compiler simply generates a jump table for each prototype that appears in the code as the target of an indirect jump, and checks whether the target appears in the list. E.g., the code below void foo(void) {} void bar(int) {} void baz(int) {} void (* volatile fn1)(void) = foo; void (* volatile fn2)(int) = bar; int main(int argc, char *argv[]) { fn1(); fn2 = baz; fn2(-1); } produces 0000000000400594 <foo.cfi>: 400594: d65f03c0 ret 0000000000400598 <bar.cfi>: 400598: d65f03c0 ret 000000000040059c <baz.cfi>: 40059c: d65f03c0 ret 00000000004005a0 <main>: 4005a0: a9bf7bfd stp x29, x30, [sp, #-16]! // First indirect call 4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> 4005a8: f9401508 ldr x8, [x8, #40] 4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278> 4005b0: 91182129 add x9, x9, #0x608 4005b4: 910003fd mov x29, sp 4005b8: eb09011f cmp x8, x9 4005bc: 54000241 b.ne 400604 <main+0x64> // b.any 4005c0: d63f0100 blr x8 // Assignment of fn2 4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278> 4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> 4005cc: 91184129 add x9, x9, #0x610 4005d0: f9001909 str x9, [x8, #48] // Second indirect call 4005d4: f9401908 ldr x8, [x8, #48] 4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278> 4005dc: 91183129 add x9, x9, #0x60c 4005e0: cb090109 sub x9, x8, x9 4005e4: 93c90929 ror x9, x9, #2 4005e8: f100053f cmp x9, #0x1 4005ec: 540000c8 b.hi 400604 <main+0x64> // b.pmore 4005f0: 12800000 mov w0, #0xffffffff // #-1 4005f4: d63f0100 blr x8 4005f8: 2a1f03e0 mov w0, wzr 4005fc: a8c17bfd ldp x29, x30, [sp], #16 400600: d65f03c0 ret 400604: d4200020 brk #0x1 0000000000400608 <__typeid__ZTSFvvE_global_addr>: 400608: 17ffffe3 b 400594 <foo.cfi> 000000000040060c <__typeid__ZTSFviE_global_addr>: 40060c: 17ffffe3 b 400598 <bar.cfi> 400610: 17ffffe3 b 40059c <baz.cfi> So it looks like taking the address is fine, although not optimal due to the additional jump. We could fudge around that by checking the opcode at the target of the call, or token paste ".cfi" after the symbol name in the static_call_update() macro, but it doesn't like like anything is terminally broken tbh.
On Wed, Oct 27, 2021 at 03:30:11PM +0200, Ard Biesheuvel wrote: > As far as I can tell from playing around with Clang, the stubs can > actually be executed directly, I had just finished reading the clang docs which suggest as much and was about to try what the compiler actually generates. > they just jumps to the actual function. > The compiler simply generates a jump table for each prototype that > appears in the code as the target of an indirect jump, and checks > whether the target appears in the list. > > E.g., the code below > > void foo(void) {} > void bar(int) {} > void baz(int) {} > void (* volatile fn1)(void) = foo; > void (* volatile fn2)(int) = bar; > > int main(int argc, char *argv[]) > { > fn1(); > fn2 = baz; > fn2(-1); > } > > produces > > 0000000000400594 <foo.cfi>: > 400594: d65f03c0 ret > > 0000000000400598 <bar.cfi>: > 400598: d65f03c0 ret > > 000000000040059c <baz.cfi>: > 40059c: d65f03c0 ret Right, so these are the actual functions ^. > 00000000004005a0 <main>: > 4005a0: a9bf7bfd stp x29, x30, [sp, #-16]! > > // First indirect call > 4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> > 4005a8: f9401508 ldr x8, [x8, #40] > 4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278> > 4005b0: 91182129 add x9, x9, #0x608 > 4005b4: 910003fd mov x29, sp > 4005b8: eb09011f cmp x8, x9 > 4005bc: 54000241 b.ne 400604 <main+0x64> // b.any > 4005c0: d63f0100 blr x8 That's impenetrable to me, sorry. > // Assignment of fn2 > 4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278> > 4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> > 4005cc: 91184129 add x9, x9, #0x610 > 4005d0: f9001909 str x9, [x8, #48] I'm struggling here, x9 points to the branch at 400610, but then what? x8 is in .data somewhere? > // Second indirect call > 4005d4: f9401908 ldr x8, [x8, #48] > 4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278> > 4005dc: 91183129 add x9, x9, #0x60c > 4005e0: cb090109 sub x9, x8, x9 > 4005e4: 93c90929 ror x9, x9, #2 > 4005e8: f100053f cmp x9, #0x1 > 4005ec: 540000c8 b.hi 400604 <main+0x64> // b.pmore > 4005f0: 12800000 mov w0, #0xffffffff // #-1 > 4005f4: d63f0100 blr x8 > > > 4005f8: 2a1f03e0 mov w0, wzr > 4005fc: a8c17bfd ldp x29, x30, [sp], #16 > 400600: d65f03c0 ret > 400604: d4200020 brk #0x1 > 0000000000400608 <__typeid__ZTSFvvE_global_addr>: > 400608: 17ffffe3 b 400594 <foo.cfi> > > 000000000040060c <__typeid__ZTSFviE_global_addr>: > 40060c: 17ffffe3 b 400598 <bar.cfi> > 400610: 17ffffe3 b 40059c <baz.cfi> And these are the stubs per type. > So it looks like taking the address is fine, although not optimal due > to the additional jump. Right. > We could fudge around that by checking the > opcode at the target of the call, or token paste ".cfi" after the > symbol name in the static_call_update() macro, but it doesn't like > like anything is terminally broken tbh. Agreed, since the jump table entries are actually executable it 'works'. I really don't like that extra jump though, so I think I really do want that nocfi_ptr() thing. And going by: https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls and the above, that might be possible (on x86) with something like: /* * Turns a Clang CFI jump-table entry into an actual function pointer. * These jump-table entries are simply jmp.d32 instruction with their * relative offset pointing to the actual function, therefore decode the * instruction to find the real function. */ static __always_inline void *nocfi_ptr(void *func) { union text_poke_insn insn = *(union text_poke_insn *)func; return func + sizeof(insn) + insn.disp; } But really, that wants to be a compiler intrinsic.
On Wed, 27 Oct 2021 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Oct 27, 2021 at 03:30:11PM +0200, Ard Biesheuvel wrote: > > > As far as I can tell from playing around with Clang, the stubs can > > actually be executed directly, > > I had just finished reading the clang docs which suggest as much and was > about to try what the compiler actually generates. > > > they just jumps to the actual function. > > The compiler simply generates a jump table for each prototype that > > appears in the code as the target of an indirect jump, and checks > > whether the target appears in the list. > > > > E.g., the code below > > > > void foo(void) {} > > void bar(int) {} > > void baz(int) {} > > void (* volatile fn1)(void) = foo; > > void (* volatile fn2)(int) = bar; > > > > int main(int argc, char *argv[]) > > { > > fn1(); > > fn2 = baz; > > fn2(-1); > > } > > > > produces > > > > 0000000000400594 <foo.cfi>: > > 400594: d65f03c0 ret > > > > 0000000000400598 <bar.cfi>: > > 400598: d65f03c0 ret > > > > 000000000040059c <baz.cfi>: > > 40059c: d65f03c0 ret > > Right, so these are the actual functions ^. > > > 00000000004005a0 <main>: > > 4005a0: a9bf7bfd stp x29, x30, [sp, #-16]! > > > > // First indirect call > > 4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> > > 4005a8: f9401508 ldr x8, [x8, #40] > > 4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > 4005b0: 91182129 add x9, x9, #0x608 > > 4005b4: 910003fd mov x29, sp > > 4005b8: eb09011f cmp x8, x9 > > 4005bc: 54000241 b.ne 400604 <main+0x64> // b.any > > 4005c0: d63f0100 blr x8 > > That's impenetrable to me, sorry. > This loads the value of fn1 in x8, and takes the address of the jump table in x9. Since it is only one entry long, it does a simple compare to check whether x8 appears in the jump table, and branches to the BRK at the end if they are different. > > // Assignment of fn2 > > 4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > 4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> > > 4005cc: 91184129 add x9, x9, #0x610 > > 4005d0: f9001909 str x9, [x8, #48] > > I'm struggling here, x9 points to the branch at 400610, but then what? > > x8 is in .data somewhere? > This takes the address of the jump table entry of 'baz' in x9, and stores it in fn2 whose address is taken in x8. > > // Second indirect call > > 4005d4: f9401908 ldr x8, [x8, #48] > > 4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > 4005dc: 91183129 add x9, x9, #0x60c > > 4005e0: cb090109 sub x9, x8, x9 > > 4005e4: 93c90929 ror x9, x9, #2 > > 4005e8: f100053f cmp x9, #0x1 > > 4005ec: 540000c8 b.hi 400604 <main+0x64> // b.pmore > > 4005f0: 12800000 mov w0, #0xffffffff // #-1 > > 4005f4: d63f0100 blr x8 > > > > > > 4005f8: 2a1f03e0 mov w0, wzr > > 4005fc: a8c17bfd ldp x29, x30, [sp], #16 > > 400600: d65f03c0 ret > > 400604: d4200020 brk #0x1 > > > > 0000000000400608 <__typeid__ZTSFvvE_global_addr>: > > 400608: 17ffffe3 b 400594 <foo.cfi> > > > > 000000000040060c <__typeid__ZTSFviE_global_addr>: > > 40060c: 17ffffe3 b 400598 <bar.cfi> > > 400610: 17ffffe3 b 40059c <baz.cfi> > > And these are the stubs per type. > > > So it looks like taking the address is fine, although not optimal due > > to the additional jump. > > Right. > ... although it does seem that function_nocfi() doesn't actually work as expected, given that we want the address of <func>.cfi and not the address of the stub. > > We could fudge around that by checking the > > opcode at the target of the call, or token paste ".cfi" after the > > symbol name in the static_call_update() macro, but it doesn't like > > like anything is terminally broken tbh. > > Agreed, since the jump table entries are actually executable it 'works'. > > I really don't like that extra jump though, so I think I really do want > that nocfi_ptr() thing. And going by: > > https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls > > and the above, that might be possible (on x86) with something like: > > /* > * Turns a Clang CFI jump-table entry into an actual function pointer. > * These jump-table entries are simply jmp.d32 instruction with their > * relative offset pointing to the actual function, therefore decode the > * instruction to find the real function. > */ > static __always_inline void *nocfi_ptr(void *func) > { > union text_poke_insn insn = *(union text_poke_insn *)func; > > return func + sizeof(insn) + insn.disp; > } > > But really, that wants to be a compiler intrinsic. Agreed. We could easily do something similar on arm64, but I'd prefer to avoid that too.
On Wed, Oct 27, 2021 at 04:18:17PM +0200, Ard Biesheuvel wrote: > On Wed, 27 Oct 2021 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote: > > /* > > * Turns a Clang CFI jump-table entry into an actual function pointer. > > * These jump-table entries are simply jmp.d32 instruction with their > > * relative offset pointing to the actual function, therefore decode the > > * instruction to find the real function. > > */ > > static __always_inline void *nocfi_ptr(void *func) > > { > > union text_poke_insn insn = *(union text_poke_insn *)func; also, probably, for the paranoid amongst us: if (WARN_ON_ONCE(insn.opcode != JMP32_INSN_OPCODE)) return func; > > return func + sizeof(insn) + insn.disp; > > } > > > > But really, that wants to be a compiler intrinsic. > > Agreed. We could easily do something similar on arm64, but I'd prefer > to avoid that too. Right, because on x86 CET-IBT will force that entry to have a different form (and size), similar on arm64 with BTI. I was thinking the compiler really should implicitly do this conversion when a function pointer is cast to an integer type. But barring that, we really need an intrinsic to perform this. Also, perhaps the compiler should admit it's doing dodgy crap and introduce the notion of address spaces and use the type system to separate these two forms.
On Wed, Oct 27, 2021 at 7:18 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > On Wed, 27 Oct 2021 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Wed, Oct 27, 2021 at 03:30:11PM +0200, Ard Biesheuvel wrote: > > > > > As far as I can tell from playing around with Clang, the stubs can > > > actually be executed directly, > > > > I had just finished reading the clang docs which suggest as much and was > > about to try what the compiler actually generates. > > > > > they just jumps to the actual function. > > > The compiler simply generates a jump table for each prototype that > > > appears in the code as the target of an indirect jump, and checks > > > whether the target appears in the list. > > > > > > E.g., the code below > > > > > > void foo(void) {} > > > void bar(int) {} > > > void baz(int) {} > > > void (* volatile fn1)(void) = foo; > > > void (* volatile fn2)(int) = bar; > > > > > > int main(int argc, char *argv[]) > > > { > > > fn1(); > > > fn2 = baz; > > > fn2(-1); > > > } > > > > > > produces > > > > > > 0000000000400594 <foo.cfi>: > > > 400594: d65f03c0 ret > > > > > > 0000000000400598 <bar.cfi>: > > > 400598: d65f03c0 ret > > > > > > 000000000040059c <baz.cfi>: > > > 40059c: d65f03c0 ret > > > > Right, so these are the actual functions ^. > > > > > 00000000004005a0 <main>: > > > 4005a0: a9bf7bfd stp x29, x30, [sp, #-16]! > > > > > > // First indirect call > > > 4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> > > > 4005a8: f9401508 ldr x8, [x8, #40] > > > 4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > > 4005b0: 91182129 add x9, x9, #0x608 > > > 4005b4: 910003fd mov x29, sp > > > 4005b8: eb09011f cmp x8, x9 > > > 4005bc: 54000241 b.ne 400604 <main+0x64> // b.any > > > 4005c0: d63f0100 blr x8 > > > > That's impenetrable to me, sorry. > > > > This loads the value of fn1 in x8, and takes the address of the jump > table in x9. Since it is only one entry long, it does a simple compare > to check whether x8 appears in the jump table, and branches to the BRK > at the end if they are different. > > > > // Assignment of fn2 > > > 4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > > 4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> > > > 4005cc: 91184129 add x9, x9, #0x610 > > > 4005d0: f9001909 str x9, [x8, #48] > > > > I'm struggling here, x9 points to the branch at 400610, but then what? > > > > x8 is in .data somewhere? > > > > This takes the address of the jump table entry of 'baz' in x9, and > stores it in fn2 whose address is taken in x8. > > > > > // Second indirect call > > > 4005d4: f9401908 ldr x8, [x8, #48] > > > 4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > > 4005dc: 91183129 add x9, x9, #0x60c > > > 4005e0: cb090109 sub x9, x8, x9 > > > 4005e4: 93c90929 ror x9, x9, #2 > > > 4005e8: f100053f cmp x9, #0x1 > > > 4005ec: 540000c8 b.hi 400604 <main+0x64> // b.pmore > > > 4005f0: 12800000 mov w0, #0xffffffff // #-1 > > > 4005f4: d63f0100 blr x8 > > > > > > > > > 4005f8: 2a1f03e0 mov w0, wzr > > > 4005fc: a8c17bfd ldp x29, x30, [sp], #16 > > > 400600: d65f03c0 ret > > > 400604: d4200020 brk #0x1 > > > > > > > 0000000000400608 <__typeid__ZTSFvvE_global_addr>: > > > 400608: 17ffffe3 b 400594 <foo.cfi> > > > > > > 000000000040060c <__typeid__ZTSFviE_global_addr>: > > > 40060c: 17ffffe3 b 400598 <bar.cfi> > > > 400610: 17ffffe3 b 40059c <baz.cfi> > > > > And these are the stubs per type. > > > > > So it looks like taking the address is fine, although not optimal due > > > to the additional jump. > > > > Right. > > > > ... although it does seem that function_nocfi() doesn't actually work > as expected, given that we want the address of <func>.cfi and not the > address of the stub. This is because the example wasn't compiled with -fno-sanitize-cfi-canonical-jump-tables, which we use in the kernel. With non-canonical jump tables, <func> continues to point to the function and <func>.cfi_jt points to the jump table, and therefore, function_nocfi() returns the raw function address. > > > We could fudge around that by checking the > > > opcode at the target of the call, or token paste ".cfi" after the > > > symbol name in the static_call_update() macro, but it doesn't like > > > like anything is terminally broken tbh. > > > > Agreed, since the jump table entries are actually executable it 'works'. > > > > I really don't like that extra jump though, so I think I really do want > > that nocfi_ptr() thing. And going by: > > > > https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls > > > > and the above, that might be possible (on x86) with something like: > > > > /* > > * Turns a Clang CFI jump-table entry into an actual function pointer. > > * These jump-table entries are simply jmp.d32 instruction with their > > * relative offset pointing to the actual function, therefore decode the > > * instruction to find the real function. > > */ > > static __always_inline void *nocfi_ptr(void *func) > > { > > union text_poke_insn insn = *(union text_poke_insn *)func; > > > > return func + sizeof(insn) + insn.disp; > > } > > > > But really, that wants to be a compiler intrinsic. > > Agreed. We could easily do something similar on arm64, but I'd prefer > to avoid that too. I'll see what we can do. Note that the compiler built-in we previously discussed would have semantics similar to function_nocfi(). It would return the raw function address from a symbol name, but it wouldn't decode the address from an arbitrary pointer, so this would require something different. Sami
On Wed, 27 Oct 2021 at 17:50, Sami Tolvanen <samitolvanen@google.com> wrote: > > On Wed, Oct 27, 2021 at 7:18 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > On Wed, 27 Oct 2021 at 16:03, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Wed, Oct 27, 2021 at 03:30:11PM +0200, Ard Biesheuvel wrote: > > > > > > > As far as I can tell from playing around with Clang, the stubs can > > > > actually be executed directly, > > > > > > I had just finished reading the clang docs which suggest as much and was > > > about to try what the compiler actually generates. > > > > > > > they just jumps to the actual function. > > > > The compiler simply generates a jump table for each prototype that > > > > appears in the code as the target of an indirect jump, and checks > > > > whether the target appears in the list. > > > > > > > > E.g., the code below > > > > > > > > void foo(void) {} > > > > void bar(int) {} > > > > void baz(int) {} > > > > void (* volatile fn1)(void) = foo; > > > > void (* volatile fn2)(int) = bar; > > > > > > > > int main(int argc, char *argv[]) > > > > { > > > > fn1(); > > > > fn2 = baz; > > > > fn2(-1); > > > > } > > > > > > > > produces > > > > > > > > 0000000000400594 <foo.cfi>: > > > > 400594: d65f03c0 ret > > > > > > > > 0000000000400598 <bar.cfi>: > > > > 400598: d65f03c0 ret > > > > > > > > 000000000040059c <baz.cfi>: > > > > 40059c: d65f03c0 ret > > > > > > Right, so these are the actual functions ^. > > > > > > > 00000000004005a0 <main>: > > > > 4005a0: a9bf7bfd stp x29, x30, [sp, #-16]! > > > > > > > > // First indirect call > > > > 4005a4: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> > > > > 4005a8: f9401508 ldr x8, [x8, #40] > > > > 4005ac: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > > > 4005b0: 91182129 add x9, x9, #0x608 > > > > 4005b4: 910003fd mov x29, sp > > > > 4005b8: eb09011f cmp x8, x9 > > > > 4005bc: 54000241 b.ne 400604 <main+0x64> // b.any > > > > 4005c0: d63f0100 blr x8 > > > > > > That's impenetrable to me, sorry. > > > > > > > This loads the value of fn1 in x8, and takes the address of the jump > > table in x9. Since it is only one entry long, it does a simple compare > > to check whether x8 appears in the jump table, and branches to the BRK > > at the end if they are different. > > > > > > // Assignment of fn2 > > > > 4005c4: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > > > 4005c8: b0000088 adrp x8, 411000 <__libc_start_main@GLIBC_2.17> > > > > 4005cc: 91184129 add x9, x9, #0x610 > > > > 4005d0: f9001909 str x9, [x8, #48] > > > > > > I'm struggling here, x9 points to the branch at 400610, but then what? > > > > > > x8 is in .data somewhere? > > > > > > > This takes the address of the jump table entry of 'baz' in x9, and > > stores it in fn2 whose address is taken in x8. > > > > > > > > // Second indirect call > > > > 4005d4: f9401908 ldr x8, [x8, #48] > > > > 4005d8: 90000009 adrp x9, 400000 <__abi_tag-0x278> > > > > 4005dc: 91183129 add x9, x9, #0x60c > > > > 4005e0: cb090109 sub x9, x8, x9 > > > > 4005e4: 93c90929 ror x9, x9, #2 > > > > 4005e8: f100053f cmp x9, #0x1 > > > > 4005ec: 540000c8 b.hi 400604 <main+0x64> // b.pmore > > > > 4005f0: 12800000 mov w0, #0xffffffff // #-1 > > > > 4005f4: d63f0100 blr x8 > > > > > > > > > > > > 4005f8: 2a1f03e0 mov w0, wzr > > > > 4005fc: a8c17bfd ldp x29, x30, [sp], #16 > > > > 400600: d65f03c0 ret > > > > 400604: d4200020 brk #0x1 > > > > > > > > > > 0000000000400608 <__typeid__ZTSFvvE_global_addr>: > > > > 400608: 17ffffe3 b 400594 <foo.cfi> > > > > > > > > 000000000040060c <__typeid__ZTSFviE_global_addr>: > > > > 40060c: 17ffffe3 b 400598 <bar.cfi> > > > > 400610: 17ffffe3 b 40059c <baz.cfi> > > > > > > And these are the stubs per type. > > > > > > > So it looks like taking the address is fine, although not optimal due > > > > to the additional jump. > > > > > > Right. > > > > > > > ... although it does seem that function_nocfi() doesn't actually work > > as expected, given that we want the address of <func>.cfi and not the > > address of the stub. > > This is because the example wasn't compiled with > -fno-sanitize-cfi-canonical-jump-tables, which we use in the kernel. > With non-canonical jump tables, <func> continues to point to the > function and <func>.cfi_jt points to the jump table, and therefore, > function_nocfi() returns the raw function address. > Ah excellent. So that means that we don't need function_nocfi() at all, given that - statically allocated references (i.e., DEFINE_STATIC_CALL()) will refer to the function directly; - runtime assignments can decode the target of the *func pointer and strip off the initial branch. It would still be nice to have an intrinsic for that, or some variable attribute that signifies that assigning the address of a function to it will produce the actual function rather than the jump table entry. > > > > We could fudge around that by checking the > > > > opcode at the target of the call, or token paste ".cfi" after the > > > > symbol name in the static_call_update() macro, but it doesn't like > > > > like anything is terminally broken tbh. > > > > > > Agreed, since the jump table entries are actually executable it 'works'. > > > > > > I really don't like that extra jump though, so I think I really do want > > > that nocfi_ptr() thing. And going by: > > > > > > https://clang.llvm.org/docs/ControlFlowIntegrityDesign.html#forward-edge-cfi-for-indirect-function-calls > > > > > > and the above, that might be possible (on x86) with something like: > > > > > > /* > > > * Turns a Clang CFI jump-table entry into an actual function pointer. > > > * These jump-table entries are simply jmp.d32 instruction with their > > > * relative offset pointing to the actual function, therefore decode the > > > * instruction to find the real function. > > > */ > > > static __always_inline void *nocfi_ptr(void *func) > > > { > > > union text_poke_insn insn = *(union text_poke_insn *)func; > > > > > > return func + sizeof(insn) + insn.disp; > > > } > > > > > > But really, that wants to be a compiler intrinsic. > > > > Agreed. We could easily do something similar on arm64, but I'd prefer > > to avoid that too. > > I'll see what we can do. Note that the compiler built-in we previously > discussed would have semantics similar to function_nocfi(). It would > return the raw function address from a symbol name, but it wouldn't > decode the address from an arbitrary pointer, so this would require > something different. > > Sami
On Wed, Oct 27, 2021 at 03:04:55PM +0200, Peter Zijlstra wrote: > On Wed, Oct 27, 2021 at 02:48:52PM +0200, Peter Zijlstra wrote: > > On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote: > > > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > Should not this jump-table thingy get converted to an actual function > > > > > address somewhere around arch_static_call_transform() ? This also seems > > > > > relevant for arm64 (which already has CLANG_CFI supported) given: > > > > > > > > > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org > > > > > > > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere... > > > > > > > > > > Sadly, that only works on symbol names, so we cannot use it to strip > > > CFI-ness from void *func arguments passed into the static call API, > > > unfortunately. > > > > Right, and while mostly static_call_update() is used, whcih is a macro > > and could possibly be used to wrap this, we very much rely on > > __static_call_update() also working without that wrapper and then we're > > up a creek without no paddles. > > Specifically, we support code like: > > struct foo { > void (*func1)(args1); > void (*func2)(args2); > } > > struct foo global_foo; And global_foo is intentionally non-const? > > ... > > DEFINE_STATIC_CALL_NULL(func1, *global_foo.func1); > > ... > > __init foo_init() > { > // whatever code that fills out foo > > static_call_update(func1, global_foo.func1); > } > > ... > > hot_function() > { > ... > static_cal(func1)(args1); > ... > } > > cold_function() > { > ... > global_foo->func1(args1); > ... > } If global_foo is non-const, then the static call stuff is just an obfuscated indirect call. The attack CFI attempts to block is having a controlled write flaw turn into controlled execution. For example, in the above, an attacker would use a flaw that could aim a write to global_foo->func1, and then get the kernel to take an execution path that executes global_foo->func1 (i.e. through cold_function). If static_call_update() accepts an arbitrary function argument, then it needs to perform the same validation that is currently being injected at indirect call sites to avoid having static calls weaken CFI. If things are left alone, static calls will just insert a direct call to the jump table, and things "work" (with an extra jump), but nothing is actually protected (it just requires the attacker locate a more narrow set of kernel call flows that includes static_call_update). Any attacker write to global_foo->func1, followed by a static_call_update(), will create a direct call (with no CFI checking) to the attacker's destination. (It's likely harder to find the needed execution path to induce a static_call_update(), but that shouldn't stop us from trying to protect it.) Getting a "jump table to actual function" primitive only avoids the added jump -- all the CFI checking remains bypassed. If static call function address storage isn't const, for CFI to work as expected the update() routine will need to do the same checking that is done at indirect call sites when extracting the "real" function for writing into a direct call. (i.e. it will need to be function prototype aware, be able to find the real matching jump table and size, and verify the given function is actually in the table. Just dereferencing the jump table to find the address isn't a protection: the attacker just has to write to func1 with a pointer to an attacker-constructed jump table.) I think it's not unreasonable to solve this in phases, though. Given that static calls work with CFI as-is (albeit with an extra jump), and that static calls do offer some hardening of existing indirect call targets (by narrowing the execution path needed to actually bring the func1 value into the kernel execution flow), I don't see either feature blocking the other. Adding proper CFI function prototype checking to static calls seems like a security improvement with or without CFI, and a minor performance improvement under CFI. To avoid all of this, though, it'd be better if static calls only switched between one of a per-site const list of possible functions, which would make it a much tighter hand-rolled CFI system itself. :) (i.e. it would operate from a specific and short list of expected functions rather than the "best effort" approach of matching function prototypes as done by Clang CFI.)
On Wed, Oct 27, 2021 at 10:11:28AM -0700, Kees Cook wrote: > On Wed, Oct 27, 2021 at 03:04:55PM +0200, Peter Zijlstra wrote: > > On Wed, Oct 27, 2021 at 02:48:52PM +0200, Peter Zijlstra wrote: > > > On Wed, Oct 27, 2021 at 02:22:27PM +0200, Ard Biesheuvel wrote: > > > > On Wed, 27 Oct 2021 at 14:05, Mark Rutland <mark.rutland@arm.com> wrote: > > > > > > > > > Should not this jump-table thingy get converted to an actual function > > > > > > address somewhere around arch_static_call_transform() ? This also seems > > > > > > relevant for arm64 (which already has CLANG_CFI supported) given: > > > > > > > > > > > > https://lkml.kernel.org/r/20211025122102.46089-3-frederic@kernel.org > > > > > > > > > > Ugh, yeah, we'll need to do the function_nocfi() dance somewhere... > > > > > > > > > > > > > Sadly, that only works on symbol names, so we cannot use it to strip > > > > CFI-ness from void *func arguments passed into the static call API, > > > > unfortunately. > > > > > > Right, and while mostly static_call_update() is used, whcih is a macro > > > and could possibly be used to wrap this, we very much rely on > > > __static_call_update() also working without that wrapper and then we're > > > up a creek without no paddles. > > > > Specifically, we support code like: > > > > struct foo { > > void (*func1)(args1); > > void (*func2)(args2); > > } > > > > struct foo global_foo; > > And global_foo is intentionally non-const? Yep, since depending on the init function it can discover and stuff in a wild variety of functions. > > > > ... > > > > DEFINE_STATIC_CALL_NULL(func1, *global_foo.func1); > > > > ... > > > > __init foo_init() > > { > > // whatever code that fills out foo > > > > static_call_update(func1, global_foo.func1); > > } > > > > ... > > > > hot_function() > > { > > ... > > static_cal(func1)(args1); > > ... > > } > > > > cold_function() > > { > > ... > > global_foo->func1(args1); > > ... > > } > > If global_foo is non-const, then the static call stuff is just an > obfuscated indirect call. It is not. The target is determined once, at boot time, depending on the hardware, it then never changes. The static_call() results in a direct call to that function. > The attack CFI attempts to block is having > a controlled write flaw turn into controlled execution. For example, > in the above, an attacker would use a flaw that could aim a write to > global_foo->func1, and then get the kernel to take an execution path > that executes global_foo->func1 (i.e. through cold_function). I know, and CFI works for cold_function. > If static_call_update() accepts an arbitrary function argument, then it > needs to perform the same validation that is currently being injected > at indirect call sites to avoid having static calls weaken CFI. static_call_update() is a macro and has compile time signature checks, the actual function is __static_call_update() and someone can go add extra validation in there if they so desire. I did have this patch: https://lkml.kernel.org/r/20210904105529.GA5106@worktop.programming.kicks-ass.net but I never did get around to finishing it. Although looking at it now, I suppose static_call_seal() might be a better name. And you're worried about __static_call_update() over text_poke_bp() because? > Getting a "jump table to actual function" primitive only avoids the added > jump -- all the CFI checking remains bypassed. Exactly, so the extra jump serves no purpose and needs to go. Doubly so because people are looking at static_call() to undo some of the performance damage introduced by CFI :-) > If static call function > address storage isn't const, for CFI to work as expected the update() > routine will need to do the same checking that is done at indirect call > sites when extracting the "real" function for writing into a direct call. I've mentioned static_call like a hundred times in these CFI threads.. if you want to do CFI on them, go ahead. I'm just not sure the C type system is up for that, you'll have to somehow frob the signature symbol into __static_call_update(), something like __builtin_type_symbol(). > To avoid all of this, though, it'd be better if static calls only > switched between one of a per-site const list of possible functions, > which would make it a much tighter hand-rolled CFI system itself. :) > (i.e. it would operate from a specific and short list of expected > functions rather than the "best effort" approach of matching function > prototypes as done by Clang CFI.) That sounds like a ton of painful ugly.
From: Mark Rutland > Sent: 27 October 2021 14:18 > > On Wed, Oct 27, 2021 at 12:55:17PM +0000, David Laight wrote: > > From: Mark Rutland > > > Sent: 27 October 2021 13:05 > > ... > > > Taking a step back, it'd be nicer if we didn't have the jump-table shim > > > at all, and had some SW landing pad (e.g. a NOP with some magic bytes) > > > in the callees that the caller could check for. Then function pointers > > > would remain callable in call cases, and we could explcitly add landing > > > pads to asm to protect those. I *think* that's what the grsecurity folk > > > do, but I could be mistaken. > > > > It doesn't need to be a 'landing pad'. > > The 'magic value' could be at 'label - 8'. > > Sure; I'd intended to mean the general case of something at some fixed > offset from the entrypoint, either before or after, potentially but not > necessarily inline in the executed instruction stream. What you really want is to be able to read the value using the I-cache so as not to pollute the D-cache with code bytes and to avoid having both an I-cache and D-cache miss at the same time for the same memory. Even if the I-cache read took an extra clock (or two) I suspect it would be an overall gain. This is also true for code that uses pc-relative instructions to read constants - common in arm-64. Not sure any hardware lets you do that though :-( David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Oct 27, 2021 at 11:21:26PM +0200, Peter Zijlstra wrote: > On Wed, Oct 27, 2021 at 10:11:28AM -0700, Kees Cook wrote: > > On Wed, Oct 27, 2021 at 03:04:55PM +0200, Peter Zijlstra wrote: > > > [...] > > > cold_function() > > > { > > > ... > > > global_foo->func1(args1); > > > ... > > > } > > > > If global_foo is non-const, then the static call stuff is just an > > obfuscated indirect call. > > It is not. The target is determined once, at boot time, depending on the > hardware, it then never changes. The static_call() results in a direct > call to that function. Right, I mean it is _effectively_ an indirect call in the sense that the call destination is under the control of a writable memory location. Yes, static_call_update() must be called to make the change, hence why I didn't wave my arms around when static_call was originally added. It's a net benefit for the kernel: actual indirect calls now become updatable direct calls. This means reachability becomes much harder for attackers; I'm totally a fan. :) > > If static_call_update() accepts an arbitrary function argument, then it > > needs to perform the same validation that is currently being injected > > at indirect call sites to avoid having static calls weaken CFI. > > static_call_update() is a macro and has compile time signature checks, > the actual function is __static_call_update() and someone can go add > extra validation in there if they so desire. > > I did have this patch: > > https://lkml.kernel.org/r/20210904105529.GA5106@worktop.programming.kicks-ass.net > > but I never did get around to finishing it. Although looking at it now, > I suppose static_call_seal() might be a better name. Right -- though wouldn't just adding __ro_after_init do the same? DEFINE_STATIC_CALL(static_call_name, func_a) __ro_after_init; If you wanted the clean WARN_ON, __static_call_update() could check if the struct static_call_key is in a non-writable region. > And you're worried about __static_call_update() over text_poke_bp() > because? Both have a meaningful lack of exposure to common attacker-reachable code paths (e.g., it's likely rare that there are ways attackers can control the target/content of a text_poke_bp(): alternatives and ftrace, both of which tend to use read-only content). static_call_update() is limited per-site with a fixed set of hardcoded targets (i.e. any place static_call() is used), but the content is effectively arbitrary (coming from writable memory). AIUI, it's intended to be used more widely than text_poke_bp(), and it seems like call sites using static_call_update() will become less rare in future kernels. (Currently it seems mostly focused on sched and pmu, which don't traditionally have much userspace control). So, they're kind of opposite, but for me the question is for examining the changes to reachability and what kind of primitives their existence provides an attacker. IMO, static_calls are a net gain over indirect calls (from some usually writable function pointer) because it becomes a direct call. It's risk doesn't match a "real" direct call, though, since they do still have the (much more narrow) risk of having something call static_call_update() from a writable function pointer as in your example, but that's still a defensively better position than an regular indirect call. What I'm hoping to see from static_calls is maybe defaulting to being ro_after_init, and explicitly marking those that can't be, which makes auditing easier and put things into a default-safe state (i.e. both fixed targets and fixed content). > > Getting a "jump table to actual function" primitive only avoids the added > > jump -- all the CFI checking remains bypassed. > > Exactly, so the extra jump serves no purpose and needs to go. Doubly so > because people are looking at static_call() to undo some of the > performance damage introduced by CFI :-) Well, sure, it's inefficient, not _broken_. :) And can you point to the "use static_calls because CFI is slow" discussion? I'd be curious to read through that; the past performance testing had shown that CFI performance overhead tends to be less than the gains of LTO. So compared to a "stock" kernel, things should be about the same if not faster. That said, I understand I'm talking to you, and you're paying very close attention to the scheduler, etc. :) I'm sure there are specific workloads that look terrible under CFI! > > If static call function > > address storage isn't const, for CFI to work as expected the update() > > routine will need to do the same checking that is done at indirect call > > sites when extracting the "real" function for writing into a direct call. > > I've mentioned static_call like a hundred times in these CFI threads.. > if you want to do CFI on them, go ahead. I'm just not sure the C type > system is up for that, you'll have to somehow frob the signature symbol > into __static_call_update(), something like __builtin_type_symbol(). > > > To avoid all of this, though, it'd be better if static calls only > > switched between one of a per-site const list of possible functions, > > which would make it a much tighter hand-rolled CFI system itself. :) > > (i.e. it would operate from a specific and short list of expected > > functions rather than the "best effort" approach of matching function > > prototypes as done by Clang CFI.) > > That sounds like a ton of painful ugly. Right, I know. That's why I'm saying that I see these features as certainly related, but not at odds with each other. CFI doesn't protect static_call sites, but static_call sites are already more well protected than their earlier indirect calls. The only note I had was that if static_call wants to dig into the jump table, it likely needs to static_call-specific: we don't want a general way to do that without knowing we have some reasonable bounds on the behavior of the code using it. I think it's fine to have static_calls do this, though it'd be nice if they could be a little more defensive (generally) at the same time.
On Wed, Oct 27, 2021 at 03:27:59PM -0700, Kees Cook wrote: > Right -- though wouldn't just adding __ro_after_init do the same? > > DEFINE_STATIC_CALL(static_call_name, func_a) __ro_after_init; That breaks modules (and your jump_label patch doing the same is similarly broken). When a module is loaded that uses the static_call(), it needs to register it's .static_call_sites range with the static_call_key which requires modifying it.
On Thu, Oct 28, 2021 at 01:09:39PM +0200, Peter Zijlstra wrote: > On Wed, Oct 27, 2021 at 03:27:59PM -0700, Kees Cook wrote: > > > Right -- though wouldn't just adding __ro_after_init do the same? > > > > DEFINE_STATIC_CALL(static_call_name, func_a) __ro_after_init; > > That breaks modules (and your jump_label patch doing the same is > similarly broken). Well that's no fun. :) I'd like to understand this better so I can fix it! > > When a module is loaded that uses the static_call(), it needs to > register it's .static_call_sites range with the static_call_key which > requires modifying it. Reading static_call_add_module() leaves me with even more questions. ;) It looks like module static calls need to write to kernel text? I don't understand. Is this when a module is using an non-module key for a call site? And in that case, this happens: key |= s_key & STATIC_CALL_SITE_FLAGS; Where "key" is not in the module? And the flags can be: #define STATIC_CALL_SITE_TAIL 1UL /* tail call */ #define STATIC_CALL_SITE_INIT 2UL /* init section */ But aren't these per-site attributes? Why are they stored per-key? if (!init && static_call_is_init(site)) continue; ... arch_static_call_transform(site_addr, NULL, func, static_call_is_tail(site));
On Thu, Oct 28, 2021 at 10:12:32AM -0700, Kees Cook wrote: > On Thu, Oct 28, 2021 at 01:09:39PM +0200, Peter Zijlstra wrote: > > On Wed, Oct 27, 2021 at 03:27:59PM -0700, Kees Cook wrote: > > > > > Right -- though wouldn't just adding __ro_after_init do the same? > > > > > > DEFINE_STATIC_CALL(static_call_name, func_a) __ro_after_init; > > > > That breaks modules (and your jump_label patch doing the same is > > similarly broken). > > Well that's no fun. :) I'd like to understand this better so I can fix > it! > > > > > When a module is loaded that uses the static_call(), it needs to > > register it's .static_call_sites range with the static_call_key which > > requires modifying it. > > Reading static_call_add_module() leaves me with even more questions. ;) Yes, that function is highly magical.. > It looks like module static calls need to write to kernel text? No, they need to modify the static_call_key though. > I don't > understand. Is this when a module is using an non-module key for a call > site? And in that case, this happens: > > key |= s_key & STATIC_CALL_SITE_FLAGS; > > Where "key" is not in the module? > > And the flags can be: > > #define STATIC_CALL_SITE_TAIL 1UL /* tail call */ > #define STATIC_CALL_SITE_INIT 2UL /* init section */ > > But aren't these per-site attributes? Why are they stored per-key? They are per site, but stored in the key pointer. So static_call has (and jump_label is nearly identical): struct static_call_site { s32 addr; s32 key; }; struct static_call_mod { struct static_call_mod *next; struct module *mod; struct static_call_sutes *sites; }; struct static_call_key { void *func; union { unsigned long type; struct static_call_mod *mods; struct static_call_site *sites; }; }; __SCT_##name() tramplines (no analog with jump_label) .static_call_sites section .static_call_tramp_key section (no analog with jump_label) Where the key holds the current function pointer and a pointer to either an array of static_call_site or a pointer to a static_call_mod. Now, a key observation is that all these data structures are word aligned, which means we have at least 2 lsb bits to play with. For static_call_key::{mods,sites} the LSB indicates which, 0:mods, 1:sites. Then the .static_call_sites section is an array of struct static_call_site sorted by the static_call_key pointer. The static_call_sites holds relative displacements, but represents: struct static_call_key *key; unsigned long call_address; Now, since code (on x86) is variable length, there are no spare bits in the code address, but since static_call_key is aligned, we have spare bits. It is those bits we use to encode TAIL (Bit0) and INIT (Bit1). If INIT, the address points to an __init section and we shouldn't try and touch if after those have been freed or bad stuff happens. If TAIL, it's a tail-call and we get to write a jump instruction instead of a call instruction. So, objtool builds .static_call_sites at built time, then at init (or module load) time we sort the array by static_call_key pointer, such that we get consequtive ranges per key. We iterate the array and every time the key pointer changes, we -- already having the key pointer -- set key->sites to the first. Now, kernel init of static_call happens *really* early and memory allocation doesn't work yet, which is why we have that {mods,sites} thing. Therefore, when the first module gets loaded, we need to allocate a struct static_call_mod for the kernel (mod==NULL) and transfer the sites pointer to it and change key to a mods pointer. So one possible solution would be to have a late init (but before RO), that, re-iterates the sites array and pre-allocates the kernel static_call_mod structure. That way, static_call_key gets changed to a mods pointer and wouldn't ever need changing after that, only the static_call_mod (which isn't RO) gets changed when modules get added/deleted. The above is basically identical to jump_labels. However static_call() have one more trick: EXPORT_STATIC_CALL_TRAMP() That exports the trampoline symbol, but not the static_call_key data structure. The result is that modules can use the static_call(), but cannot use static_call_update() because they cannot get at the key. In this case objtool cannot correctly put the static_call_key address in the static_call_site, what it does instead is store the trampoline address (there's a 1:1 relation between key and tramplines). And then we ues the .static_call_tramp_key section to find a mapping from trampoline to key and rewrite the site to be 'right'. All this happens before sorting it on key obv. Hope that clarifies things, instead of making it worse :-)
On Wed, Oct 27, 2021 at 08:50:17AM -0700, Sami Tolvanen wrote: > On Wed, Oct 27, 2021 at 7:18 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > /* > > > * Turns a Clang CFI jump-table entry into an actual function pointer. > > > * These jump-table entries are simply jmp.d32 instruction with their > > > * relative offset pointing to the actual function, therefore decode the > > > * instruction to find the real function. > > > */ > > > static __always_inline void *nocfi_ptr(void *func) > > > { > > > union text_poke_insn insn = *(union text_poke_insn *)func; > > > > > > return func + sizeof(insn) + insn.disp; > > > } > > > > > > But really, that wants to be a compiler intrinsic. > > > > Agreed. We could easily do something similar on arm64, but I'd prefer > > to avoid that too. > > I'll see what we can do. Note that the compiler built-in we previously > discussed would have semantics similar to function_nocfi(). It would > return the raw function address from a symbol name, but it wouldn't > decode the address from an arbitrary pointer, so this would require > something different. So I had a bit of a peek at what clang generates: 3fa4: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 3fa7: R_X86_64_32S __SCK__x86_pmu_handle_irq 3fab: 48 c7 c6 00 00 00 00 mov $0x0,%rsi 3fae: R_X86_64_32S __SCT__x86_pmu_handle_irq.cfi_jt 3fb2: e8 00 00 00 00 call 3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32 __static_call_update-0x4 So this then gives the trampoline jump table entry to __static_call_update(), with the result that it will rewrite the jump-table entry, not the trampoline! Now it so happens that the trampoline looks *exactly* like the jump-table entry (one jmp.d32 instruction), so in that regards it'll again 'work'. But this is all really, as in *really*, wrong. And I'm really sad I'm the one to have to discover this, even though I've mentioned static_call()s being tricky in previous reviews.
On Fri, Oct 29, 2021 at 1:04 PM Peter Zijlstra <peterz@infradead.org> wrote: > > On Wed, Oct 27, 2021 at 08:50:17AM -0700, Sami Tolvanen wrote: > > On Wed, Oct 27, 2021 at 7:18 AM Ard Biesheuvel <ardb@kernel.org> wrote: > > > > > /* > > > > * Turns a Clang CFI jump-table entry into an actual function pointer. > > > > * These jump-table entries are simply jmp.d32 instruction with their > > > > * relative offset pointing to the actual function, therefore decode the > > > > * instruction to find the real function. > > > > */ > > > > static __always_inline void *nocfi_ptr(void *func) > > > > { > > > > union text_poke_insn insn = *(union text_poke_insn *)func; > > > > > > > > return func + sizeof(insn) + insn.disp; > > > > } > > > > > > > > But really, that wants to be a compiler intrinsic. > > > > > > Agreed. We could easily do something similar on arm64, but I'd prefer > > > to avoid that too. > > > > I'll see what we can do. Note that the compiler built-in we previously > > discussed would have semantics similar to function_nocfi(). It would > > return the raw function address from a symbol name, but it wouldn't > > decode the address from an arbitrary pointer, so this would require > > something different. > > So I had a bit of a peek at what clang generates: > > 3fa4: 48 c7 c7 00 00 00 00 mov $0x0,%rdi 3fa7: R_X86_64_32S __SCK__x86_pmu_handle_irq > 3fab: 48 c7 c6 00 00 00 00 mov $0x0,%rsi 3fae: R_X86_64_32S __SCT__x86_pmu_handle_irq.cfi_jt > 3fb2: e8 00 00 00 00 call 3fb7 <init_hw_perf_events+0x1dc> 3fb3: R_X86_64_PLT32 __static_call_update-0x4 > > So this then gives the trampoline jump table entry to > __static_call_update(), with the result that it will rewrite the > jump-table entry, not the trampoline! > > Now it so happens that the trampoline looks *exactly* like the > jump-table entry (one jmp.d32 instruction), so in that regards it'll > again 'work'. Ugh, good catch! > But this is all really, as in *really*, wrong. And I'm really sad I'm > the one to have to discover this, even though I've mentioned > static_call()s being tricky in previous reviews. My bad, I didn't realize Clang does this with a typeof(func) declaration. I'll make sure we have a reasonable fix for this before the next version. Sami
On 10/27/21 15:27, Kees Cook wrote: > On Wed, Oct 27, 2021 at 11:21:26PM +0200, Peter Zijlstra wrote: >> On Wed, Oct 27, 2021 at 10:11:28AM -0700, Kees Cook wrote: >>> On Wed, Oct 27, 2021 at 03:04:55PM +0200, Peter Zijlstra wrote: >>>> [...] >>>> cold_function() >>>> { >>>> ... >>>> global_foo->func1(args1); >>>> ... >>>> } >>> >>> If global_foo is non-const, then the static call stuff is just an >>> obfuscated indirect call. >> >> It is not. The target is determined once, at boot time, depending on the >> hardware, it then never changes. The static_call() results in a direct >> call to that function. > > Right, I mean it is _effectively_ an indirect call in the sense that the > call destination is under the control of a writable memory location. > Yes, static_call_update() must be called to make the change, hence why I > didn't wave my arms around when static_call was originally added. It's a > net benefit for the kernel: actual indirect calls now become updatable > direct calls. This means reachability becomes much harder for attackers; > I'm totally a fan. :) I think this means that function_nocfi() is the wrong abstraction. To solve this properly, we want (sorry for fake C++ concept-ish syntax, but this would be a builtin): template<function_pointer T> uintptr_t cfi_check_get_raw_address(T ptr); You pass in a function pointer and you get out the address of the actual function body. And this *also* emits the same type checking code that an actual call would emit. So: void (*ptr)(void); ptr(); and: void (*ptr)(void); asm volatile ("call *%0" :: "rmi" (cfi_check_get_raw_address(ptr))); would emit comparably safe code, except that the latter would actually read the code bytes and the latter is missing clobbers and such. And yes, this means that, without special support, the jump table can't live in XO memory. void foo(void); int (*ptr)(void) = (void *)foo; cfi_check_get_raw_address(ptr); would crash due to a CFI violation. For what it's worth, I feel fairly strongly that we should not merge half-baked CFI support. Years ago, we merged half-baked stack protector support in which we shoehorned in a totally inappropriate x86_32 gcc implementation into the kernel, and the result was a mess. It's *finally* cleaned up, and that only happened because we got gcc to add the actual required support, then we waited a while, and then we dropped stack protector support for old gcc versions. Yuck.
On Thu, Oct 28, 2021 at 10:29:05PM +0200, Peter Zijlstra wrote: > Now, since code (on x86) is variable length, there are no spare bits in > the code address, but since static_call_key is aligned, we have spare > bits. It is those bits we use to encode TAIL (Bit0) and INIT (Bit1). > > If INIT, the address points to an __init section and we shouldn't try > and touch if after those have been freed or bad stuff happens. > > If TAIL, it's a tail-call and we get to write a jump instruction instead > of a call instruction. I think this is the part that I was missing: the information is about the _address_, but it's stored in the _key_'s low bits (regardless of the key's actual/masked key pointer). > [...] > Hope that clarifies things, instead of making it worse :-) It does help, yes, thanks! I will need to read it again and go follow along in the code, but yes, that helps explain it.