diff mbox series

[RFC,01/11] x86: kernel FineIBT

Message ID 20220420004241.2093-2-joao@overdrivepizza.com (mailing list archive)
State Changes Requested
Headers show
Series Kernel FineIBT Support | expand

Commit Message

Joao Moreira April 20, 2022, 12:42 a.m. UTC
From: Joao Moreira <joao@overdrivepizza.com>

Make kernel code compatible to be compiled with FineIBT.
- Set FineIBT defines.
- Mark functions reached from assembly as coarse-grained.
- Add FineIBT handler function, which is invoked on policy violations.

Signed-off-by: Joao Moreira <joao@overdrivepizza.com>
---
 arch/x86/entry/entry_64.S            |   1 +
 arch/x86/include/asm/ibt.h           |  16 ++++
 arch/x86/include/asm/setup.h         |  12 +--
 arch/x86/include/asm/special_insns.h |   4 +-
 arch/x86/kernel/cpu/common.c         |   2 +-
 arch/x86/kernel/fineibt.c            | 123 +++++++++++++++++++++++++++
 arch/x86/kernel/head64.c             |  12 +--
 arch/x86/kernel/smpboot.c            |   2 +-
 8 files changed, 156 insertions(+), 16 deletions(-)
 create mode 100644 arch/x86/kernel/fineibt.c

Comments

Josh Poimboeuf April 29, 2022, 1:37 a.m. UTC | #1
On Tue, Apr 19, 2022 at 05:42:31PM -0700, joao@overdrivepizza.com wrote:
> +void __noendbr __fineibt_handler(void){
> +	unsigned i;
> +	unsigned long flags;
> +	bool skip;
> +	void * ret;
> +	void * caller;
> +
> +	DO_ALL_PUSHS;

So this function isn't C ABI compliant, right? e.g. the compiler just
calls the handler without regard for preserving registers?

If this function is going to be implemented in C, it should probably
have an asm thunk wrapper which can properly save/restore the registers
before calling into the C version.

Even better, if the compiler did an invalid op (UD2?), which I think you
mentioned elsewhere, instead of calling the handler directly, and there
were a way for the trap code to properly detect it as a FineIBT
violation, we could get rid of the pushes/pops, plus the uaccess objtool
warning from patch 7, plus I'm guessing a bunch of noinstr validation
warnings.

> +
> +	spin_lock_irqsave(&fineibt_lock, flags);
> +	skip = false;
> +
> +	asm("\t movq 0x90(%%rsp),%0" : "=r"(ret));
> +	asm("\t movq 0x98(%%rsp),%0" : "=r"(caller));

This is making some questionable assumptions about the stack layout.

I assume this function is still in the prototype stage ;-)

> +	if(!skip) {
> +		printk("FineIBT violation: %px:%px:%u\n", ret, caller,
> +				vlts_next);
> +	}
> +	DO_ALL_POPS;
> +}

Right now this handler just does a printk if it hasn't already for this
caller/callee combo, and then resumes control.  Which is fine for
debugging, but it really needs to behave similarly to an IBT violation,
by panicking unless "ibt=warn" on the cmdline.

Not sure what would happen for "ibt=off"?  Maybe apply_ibt_endbr() could
NOP out all the FineIBT stuff.
Joao Moreira May 2, 2022, 5:17 p.m. UTC | #2
On 2022-04-28 18:37, Josh Poimboeuf wrote:
> On Tue, Apr 19, 2022 at 05:42:31PM -0700, joao@overdrivepizza.com 
> wrote:
>> +void __noendbr __fineibt_handler(void){
>> +	unsigned i;
>> +	unsigned long flags;
>> +	bool skip;
>> +	void * ret;
>> +	void * caller;
>> +
>> +	DO_ALL_PUSHS;
> 
> So this function isn't C ABI compliant, right? e.g. the compiler just
> calls the handler without regard for preserving registers?
> 
> If this function is going to be implemented in C, it should probably
> have an asm thunk wrapper which can properly save/restore the registers
> before calling into the C version.
> 
> Even better, if the compiler did an invalid op (UD2?), which I think 
> you
> mentioned elsewhere, instead of calling the handler directly, and there
> were a way for the trap code to properly detect it as a FineIBT
> violation, we could get rid of the pushes/pops, plus the uaccess 
> objtool
> warning from patch 7, plus I'm guessing a bunch of noinstr validation
> warnings.

Cool, I'll try to come up with something!

> 
>> +
>> +	spin_lock_irqsave(&fineibt_lock, flags);
>> +	skip = false;
>> +
>> +	asm("\t movq 0x90(%%rsp),%0" : "=r"(ret));
>> +	asm("\t movq 0x98(%%rsp),%0" : "=r"(caller));
> 
> This is making some questionable assumptions about the stack layout.
> 
> I assume this function is still in the prototype stage ;-)

Yeah, this is just a messy instrumentation to get reports about 
mismatching prototypes (as the ones reported further down the series).

The issue with having the call is that it bloats the binary, so the ud2 
is 3-bytes-per-function better. Yet, we may consider a FINEIBT_DEBUG 
config, which can then enable a handler. This would be useful together 
with a fuzzer or a stress tool to cover possible control-flow paths 
within the kernel and map mismatching prototypes more properly I guess.

> 
>> +	if(!skip) {
>> +		printk("FineIBT violation: %px:%px:%u\n", ret, caller,
>> +				vlts_next);
>> +	}
>> +	DO_ALL_POPS;
>> +}
> 
> Right now this handler just does a printk if it hasn't already for this
> caller/callee combo, and then resumes control.  Which is fine for
> debugging, but it really needs to behave similarly to an IBT violation,
> by panicking unless "ibt=warn" on the cmdline.
> 
> Not sure what would happen for "ibt=off"?  Maybe apply_ibt_endbr() 
> could
> NOP out all the FineIBT stuff.

Either that, or...

I'm thinking about a way to have FineIBT interchangeable with KCFI. 
Currently KCFI adds a 4 byte hash + 2 byte nops before function entry, 
to allow for proper prototype checking. After that, there should be an 
ENDBR of 4 bytes. This gives us 10 bytes in total. Then, my yet to be 
properly thought idea would be patch these 10 bytes with:

endbr
call fineibt_handler_<$HASH>
nop

and then, on the caller side, patch the "cmp <$HASH>, -0x6(%r11); je; 
ud2; call" sequence with a "sub 0x6, r11; mov $HASH, %r10; call %r11, 
add 0x6 %r11". This would then allow the kernel to verify if the CPU is 
IBT capable on boot time and only then setting the proper scheme.

The downsides of having something like this would be that this sub 
r11/add r11 sequence is kinda meh. We can avoid that by having two 
padding nops after the original ENDBR, which will be skipped when the 
function is reached directly by the linker optimization I'm working on, 
and that we can convert into a JMP -offset that makes control flow reach 
the padding area before the prologue and from where we can call the 
fineibt_handler function. The resulting instrumentation would be 
something like:

1:
call fineibt_handler_<$HASH>
jmp 2f
<foo>
endbr
jmp 1b
2:

Also, it would prevent a paranoid user to have both schemes 
simultaneously (there are reasons why people could want that).

Any thoughts?
Josh Poimboeuf May 3, 2022, 10:02 p.m. UTC | #3
On Mon, May 02, 2022 at 10:17:42AM -0700, Joao Moreira wrote:
> > > +	asm("\t movq 0x90(%%rsp),%0" : "=r"(ret));
> > > +	asm("\t movq 0x98(%%rsp),%0" : "=r"(caller));
> > 
> > This is making some questionable assumptions about the stack layout.
> > 
> > I assume this function is still in the prototype stage ;-)
> 
> Yeah, this is just a messy instrumentation to get reports about mismatching
> prototypes (as the ones reported further down the series).
> 
> The issue with having the call is that it bloats the binary, so the ud2 is
> 3-bytes-per-function better. Yet, we may consider a FINEIBT_DEBUG config,
> which can then enable a handler. This would be useful together with a fuzzer
> or a stress tool to cover possible control-flow paths within the kernel and
> map mismatching prototypes more properly I guess.

It should be possible to have a non-fatal #UD2 handler.

See for example how WARN() is implemented with __WARN_FLAGS in
arch/x86/include/asm/bug.h .

So hopefully we can just get rid of the need for the "call handler"
thing altogether.

> > Not sure what would happen for "ibt=off"?  Maybe apply_ibt_endbr() could
> > NOP out all the FineIBT stuff.
> 
> Either that, or...
> 
> I'm thinking about a way to have FineIBT interchangeable with KCFI.
> Currently KCFI adds a 4 byte hash + 2 byte nops before function entry, to
> allow for proper prototype checking. After that, there should be an ENDBR of
> 4 bytes. This gives us 10 bytes in total. Then, my yet to be properly
> thought idea would be patch these 10 bytes with:
> 
> endbr
> call fineibt_handler_<$HASH>
> nop
> 
> and then, on the caller side, patch the "cmp <$HASH>, -0x6(%r11); je; ud2;
> call" sequence with a "sub 0x6, r11; mov $HASH, %r10; call %r11, add 0x6
> %r11". This would then allow the kernel to verify if the CPU is IBT capable
> on boot time and only then setting the proper scheme.
> 
> The downsides of having something like this would be that this sub r11/add
> r11 sequence is kinda meh. We can avoid that by having two padding nops
> after the original ENDBR, which will be skipped when the function is reached
> directly by the linker optimization I'm working on, and that we can convert
> into a JMP -offset that makes control flow reach the padding area before the
> prologue and from where we can call the fineibt_handler function. The
> resulting instrumentation would be something like:
> 
> 1:
> call fineibt_handler_<$HASH>
> jmp 2f
> <foo>
> endbr
> jmp 1b
> 2:
> 
> Also, it would prevent a paranoid user to have both schemes simultaneously
> (there are reasons why people could want that).
> 
> Any thoughts?

I'm not really qualified to comment on this too directly since I haven't
looked very much at the variations on FineIBT/CFI/KCFI, and what the
protections and drawbacks are for each approach, and when it might even
make sense to combine them for a "paranoid user".

Since we have multiple similar and possibly competing technologies being
discussed, one thing I do want to warn against is that we as kernel
developers tend to err on the side of giving people too many choices and
combinations which *never* get used.

All those unused options can confuse the user and significantly add to
complexity and maintenance overhead for us.  Especially for invasive
features like these.

(Just to be clear, I'm not saying that's happening here, but it's
something we need to be careful about.)

Here, documentation is going to be crucial, for both reviewers and
users.  Something that describes when/why I should use X or Y or X+Y.

If we truly want to add more options/combos for different use cases then
we'll also need clear and concise documentation about which
options/combos would be used under what circumstances.
Joao Moreira May 4, 2022, 2:19 a.m. UTC | #4
> 
> It should be possible to have a non-fatal #UD2 handler.
> 
> See for example how WARN() is implemented with __WARN_FLAGS in
> arch/x86/include/asm/bug.h .
> 
> So hopefully we can just get rid of the need for the "call handler"
> thing altogether.
> 
Nice, I'll look into it. Tks.

>> > Not sure what would happen for "ibt=off"?  Maybe apply_ibt_endbr() could
>> > NOP out all the FineIBT stuff.
>> 
>> Either that, or...
>> 
>> I'm thinking about a way to have FineIBT interchangeable with KCFI.
>> Currently KCFI adds a 4 byte hash + 2 byte nops before function entry, 
>> to
>> allow for proper prototype checking. After that, there should be an 
>> ENDBR of
>> 4 bytes. This gives us 10 bytes in total. Then, my yet to be properly
>> thought idea would be patch these 10 bytes with:
>> 
>> endbr
>> call fineibt_handler_<$HASH>
>> nop
>> 
>> and then, on the caller side, patch the "cmp <$HASH>, -0x6(%r11); je; 
>> ud2;
>> call" sequence with a "sub 0x6, r11; mov $HASH, %r10; call %r11, add 
>> 0x6
>> %r11". This would then allow the kernel to verify if the CPU is IBT 
>> capable
>> on boot time and only then setting the proper scheme.
>> 
>> The downsides of having something like this would be that this sub 
>> r11/add
>> r11 sequence is kinda meh. We can avoid that by having two padding 
>> nops
>> after the original ENDBR, which will be skipped when the function is 
>> reached
>> directly by the linker optimization I'm working on, and that we can 
>> convert
>> into a JMP -offset that makes control flow reach the padding area 
>> before the
>> prologue and from where we can call the fineibt_handler function. The
>> resulting instrumentation would be something like:
>> 
>> 1:
>> call fineibt_handler_<$HASH>
>> jmp 2f
>> <foo>
>> endbr
>> jmp 1b
>> 2:
>> 
>> Also, it would prevent a paranoid user to have both schemes 
>> simultaneously
>> (there are reasons why people could want that).
>> 
>> Any thoughts?
> 
> I'm not really qualified to comment on this too directly since I 
> haven't
> looked very much at the variations on FineIBT/CFI/KCFI, and what the
> protections and drawbacks are for each approach, and when it might even
> make sense to combine them for a "paranoid user".
> 
> Since we have multiple similar and possibly competing technologies 
> being
> discussed, one thing I do want to warn against is that we as kernel
> developers tend to err on the side of giving people too many choices 
> and
> combinations which *never* get used.
> 
> All those unused options can confuse the user and significantly add to
> complexity and maintenance overhead for us.  Especially for invasive
> features like these.
> 
> (Just to be clear, I'm not saying that's happening here, but it's
> something we need to be careful about.)
> 
> Here, documentation is going to be crucial, for both reviewers and
> users.  Something that describes when/why I should use X or Y or X+Y.
> 
> If we truly want to add more options/combos for different use cases 
> then
> we'll also need clear and concise documentation about which
> options/combos would be used under what circumstances.

Yeah, I totally understand/support this concern and I feel the same way. 
While, in this case, I can't see super clear reasons for X+Y, there are 
aspects why someone could prefer X or Y -- so I think that using 
alternatives to flip the instrumentation is a valid consideration. In 
time, taking the chance to be fair on the credits, using alternatives to 
replace KCFI/FineIBT was also Peter's idea, not mine. It looked hard to 
do at first sight because of the caller/callee-side checks differences, 
but since Peter mentioned it, I started trying to solve the puzzle of 
having the best suitable instrumentation that would be changeable. I 
haven't discussed this with anyone yet, but at this point I think it 
might be doable, although not in the most performant shape. Anyway, I'll 
post something here once I have a more solid idea.

And yes, I agree that documentation will be key and I totally see your 
point/understand how confusing I was in my previous e-mail. I'll keep 
that in mind for the next revision. Thanks for pointing it out :)
Peter Zijlstra May 4, 2022, 10:20 a.m. UTC | #5
On Tue, May 03, 2022 at 03:02:44PM -0700, Josh Poimboeuf wrote:

> I'm not really qualified to comment on this too directly since I haven't
> looked very much at the variations on FineIBT/CFI/KCFI, and what the
> protections and drawbacks are for each approach, and when it might even
> make sense to combine them for a "paranoid user".
> 
> Since we have multiple similar and possibly competing technologies being
> discussed, one thing I do want to warn against is that we as kernel
> developers tend to err on the side of giving people too many choices and
> combinations which *never* get used.

So I don't think there's going to be a user choice here. If there's
hardware support, FineIBT makes more sense. That also means that kCFI no
longer needs to worry about IBT.

If we do something like:


        kCFI                                            FineIBT

__cfi_\sym:                                     __cfi_\sym:
        endbr                           # 4             endbr                   # 4
        sub $hash, %r10                 # 7             sub $hash, %r10         # 7
        je \sym                         # 2             je \sym                 # 2
        ud2                             # 2             ud2                     # 2
\sym:                                           \sym:


caller:                                         caller:
        cmpl $hash, -8(%r11)            # 8             movl $hash, %r10d       # 6
        je 1f                           # 2             sub 15, %r11            # 4
        ud2                             # 2             call *%r11              # 3
1:      call __x86_indirect_thunk_r11   # 5             .nop 4                  # 4 (could even fix up r11 again)


Then, all that's required is a slight tweak to apply_retpolines() to
rewrite a little more text.

Note that this also does away with having to fix up the linker, since
all direct call will already point at \sym. It's just the IBT indirect
calls that need to frob the pointer in order to hit the ENDBR.

On top of that, we no longer have to special case the objtool
instruction decoder, the prelude are proper instructions now.
Peter Collingbourne May 4, 2022, 5:04 p.m. UTC | #6
On Wed, May 04, 2022 at 12:20:19PM +0200, Peter Zijlstra wrote:
> On Tue, May 03, 2022 at 03:02:44PM -0700, Josh Poimboeuf wrote:
> 
> > I'm not really qualified to comment on this too directly since I haven't
> > looked very much at the variations on FineIBT/CFI/KCFI, and what the
> > protections and drawbacks are for each approach, and when it might even
> > make sense to combine them for a "paranoid user".
> > 
> > Since we have multiple similar and possibly competing technologies being
> > discussed, one thing I do want to warn against is that we as kernel
> > developers tend to err on the side of giving people too many choices and
> > combinations which *never* get used.
> 
> So I don't think there's going to be a user choice here. If there's
> hardware support, FineIBT makes more sense. That also means that kCFI no
> longer needs to worry about IBT.
> 
> If we do something like:
> 
> 
>         kCFI                                            FineIBT
> 
> __cfi_\sym:                                     __cfi_\sym:
>         endbr                           # 4             endbr                   # 4
>         sub $hash, %r10                 # 7             sub $hash, %r10         # 7
>         je \sym                         # 2             je \sym                 # 2
>         ud2                             # 2             ud2                     # 2
> \sym:                                           \sym:
> 
> 
> caller:                                         caller:
>         cmpl $hash, -8(%r11)            # 8             movl $hash, %r10d       # 6
>         je 1f                           # 2             sub 15, %r11            # 4
>         ud2                             # 2             call *%r11              # 3
> 1:      call __x86_indirect_thunk_r11   # 5             .nop 4                  # 4 (could even fix up r11 again)
> 
> 
> Then, all that's required is a slight tweak to apply_retpolines() to
> rewrite a little more text.
> 
> Note that this also does away with having to fix up the linker, since
> all direct call will already point at \sym. It's just the IBT indirect
> calls that need to frob the pointer in order to hit the ENDBR.
> 
> On top of that, we no longer have to special case the objtool
> instruction decoder, the prelude are proper instructions now.

For kCFI this brings back the gadget problem that I mentioned here:
https://lore.kernel.org/all/Yh7fLRYl8KgMcOe5@google.com/

because the hash at the call site is 8 bytes before the call
instruction.

Peter
Peter Zijlstra May 4, 2022, 6:16 p.m. UTC | #7
On Wed, May 04, 2022 at 10:04:02AM -0700, Peter Collingbourne wrote:
> On Wed, May 04, 2022 at 12:20:19PM +0200, Peter Zijlstra wrote:
> > On Tue, May 03, 2022 at 03:02:44PM -0700, Josh Poimboeuf wrote:
> > 
> > > I'm not really qualified to comment on this too directly since I haven't
> > > looked very much at the variations on FineIBT/CFI/KCFI, and what the
> > > protections and drawbacks are for each approach, and when it might even
> > > make sense to combine them for a "paranoid user".
> > > 
> > > Since we have multiple similar and possibly competing technologies being
> > > discussed, one thing I do want to warn against is that we as kernel
> > > developers tend to err on the side of giving people too many choices and
> > > combinations which *never* get used.
> > 
> > So I don't think there's going to be a user choice here. If there's
> > hardware support, FineIBT makes more sense. That also means that kCFI no
> > longer needs to worry about IBT.
> > 
> > If we do something like:
> > 
> > 
> >         kCFI                                            FineIBT
> > 
> > __cfi_\sym:                                     __cfi_\sym:
> >         endbr                           # 4             endbr                   # 4
> >         sub $hash, %r10                 # 7             sub $hash, %r10         # 7
> >         je \sym                         # 2             je \sym                 # 2
> >         ud2                             # 2             ud2                     # 2
> > \sym:                                           \sym:
> > 
> > 
> > caller:                                         caller:
> >         cmpl $hash, -8(%r11)            # 8             movl $hash, %r10d       # 6
> >         je 1f                           # 2             sub 15, %r11            # 4
> >         ud2                             # 2             call *%r11              # 3
> > 1:      call __x86_indirect_thunk_r11   # 5             .nop 4                  # 4 (could even fix up r11 again)
> > 
> > 
> > Then, all that's required is a slight tweak to apply_retpolines() to
> > rewrite a little more text.
> > 
> > Note that this also does away with having to fix up the linker, since
> > all direct call will already point at \sym. It's just the IBT indirect
> > calls that need to frob the pointer in order to hit the ENDBR.
> > 
> > On top of that, we no longer have to special case the objtool
> > instruction decoder, the prelude are proper instructions now.
> 
> For kCFI this brings back the gadget problem that I mentioned here:
> https://lore.kernel.org/all/Yh7fLRYl8KgMcOe5@google.com/
> 
> because the hash at the call site is 8 bytes before the call
> instruction.

Damn, I forgot about that. Too subtle :-/

So Joao had another crazy idea, lemme diagram that to see if it works.

(sorry I inverted the order by accident)


	FineIBT						kCFI

__fineibt_\hash:
	xor	\hash, %r10	# 7
	jz	1f		# 2
	ud2			# 2
1:	ret			# 1
	int3			# 1


__cfi_\sym:					__cfi_\sym:
							int3; int3				# 2
	endbr			# 4			mov	\hash, %eax			# 5
	call	__fineibt_\hash	# 5			int3; int3				# 2
\sym:						\sym:
	...						...


caller:						caller:
	movl	\hash, %r10d	# 6			cmpl	\hash, -6(%r11)			# 8
	sub	$9, %r11	# 4			je	1f				# 2
	call	*%r11		# 3			ud2					# 2
	.nop 4			# 4 (or fixup r11)	call	__x86_indirect_thunk_r11	# 5


This way we also need to patch the __cfi_\sym contents, but we get a
little extra room to place the constant for kCFI in a suitable location.

It seems to preserve the properties of the last one in that direct calls
will already be correct and we don't need linker fixups, and objtool can
simply parse the preamble as regular instructions without needing
further help.
Sami Tolvanen May 5, 2022, 12:28 a.m. UTC | #8
On Wed, May 4, 2022 at 11:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> __cfi_\sym:                                     __cfi_\sym:
>                                                         int3; int3                              # 2
>         endbr                   # 4                     mov     \hash, %eax                     # 5
>         call    __fineibt_\hash # 5                     int3; int3                              # 2
> \sym:                                           \sym:

OK, that looks reasonable to me.

> It seems to preserve the properties of the last one in that direct calls
> will already be correct and we don't need linker fixups, and objtool can
> simply parse the preamble as regular instructions without needing
> further help.

Wouldn't objtool still print out unreachable instruction warnings here?

Sami
Peter Zijlstra May 5, 2022, 7:36 a.m. UTC | #9
On Wed, May 04, 2022 at 05:28:57PM -0700, Sami Tolvanen wrote:
> On Wed, May 4, 2022 at 11:17 AM Peter Zijlstra <peterz@infradead.org> wrote:
> > __cfi_\sym:                                     __cfi_\sym:
> >                                                         int3; int3                              # 2
> >         endbr                   # 4                     mov     \hash, %eax                     # 5
> >         call    __fineibt_\hash # 5                     int3; int3                              # 2
> > \sym:                                           \sym:
> 
> OK, that looks reasonable to me.
> 
> > It seems to preserve the properties of the last one in that direct calls
> > will already be correct and we don't need linker fixups, and objtool can
> > simply parse the preamble as regular instructions without needing
> > further help.
> 
> Wouldn't objtool still print out unreachable instruction warnings here?

Depends a bit on what kind of symbol they end up being, if they're
STT_FUNC we'll probably get the complaint that it falls through into the
next symbol, while if they're STT_NOTYPE then yes, we'll get the
unreachable thing.

So either way we need to special case the __cfi_\sym things anyway.

But that should be relatively straight forward. I think I would lean
towards making then STT_FUNC (they really are for FineIBT anyway) and
then supressing the fallthrough warning for all functions that start
with "__cfi_". This way we get an ORC entry for them and the unwinder
will be happy.
Kees Cook May 8, 2022, 8:29 a.m. UTC | #10
On Wed, May 04, 2022 at 08:16:57PM +0200, Peter Zijlstra wrote:
> 	FineIBT						kCFI
> 
> __fineibt_\hash:
> 	xor	\hash, %r10	# 7
> 	jz	1f		# 2
> 	ud2			# 2
> 1:	ret			# 1
> 	int3			# 1
> 
> 
> __cfi_\sym:					__cfi_\sym:
> 							int3; int3				# 2
> 	endbr			# 4			mov	\hash, %eax			# 5
> 	call	__fineibt_\hash	# 5			int3; int3				# 2
> \sym:						\sym:
> 	...						...
> 
> 
> caller:						caller:
> 	movl	\hash, %r10d	# 6			cmpl	\hash, -6(%r11)			# 8
> 	sub	$9, %r11	# 4			je	1f				# 2
> 	call	*%r11		# 3			ud2					# 2
> 	.nop 4			# 4 (or fixup r11)	call	__x86_indirect_thunk_r11	# 5

This looks good!

And just to double-check my understanding here... \sym is expected to
start with endbr with IBT + kCFI?


Random extra thoughts... feel free to ignore. :) Given that both CFI
schemes depend on an attacker not being able to construct an executable
memory region that either starts with endbr (for FineIBT) or starts with
hash & 2 bytes (for kCFI), we should likely take another look at where
the kernel uses PAGE_KERNEL_EXEC.

It seems non-specialized use is entirely done via module_alloc(). Obviously
modules need to stay as-is. So we're left with other module_alloc()
callers: BPF JIT, ftrace, and kprobes.

Perhaps enabling CFI should tie bpf_jit_harden (which performs constant
blinding) to the value of bpf_jit_enable? (i.e. either use BPF VM which
reads from non-exec memory, or use BPF JIT with constant blinding.)

I *think* all the kprobes and ftrace stuff ends up using constructed
direct calls, though, yes? So if we did bounds checking, we could
"exclude" them as well as the BPF JIT. Though I'm not sure how
controllable the content written to the kprobes and ftrace regions are,
though?

For exclusion, we could separate actual modules from the other
module_alloc() users by maybe allocating in opposite directions from the
randomized offset and check indirect calls against the kernel text bounds
and the new modules-only bounds. Sounds expensive, though. Maybe PKS,
but I can't imagine 2 MSR writes per indirect call would be fast. Hmm...
Peter Zijlstra May 9, 2022, 11:22 a.m. UTC | #11
On Sun, May 08, 2022 at 01:29:13AM -0700, Kees Cook wrote:
> On Wed, May 04, 2022 at 08:16:57PM +0200, Peter Zijlstra wrote:
> > 	FineIBT						kCFI
> > 
> > __fineibt_\hash:
> > 	xor	\hash, %r10	# 7
> > 	jz	1f		# 2
> > 	ud2			# 2
> > 1:	ret			# 1
> > 	int3			# 1
> > 
> > 
> > __cfi_\sym:					__cfi_\sym:
> > 							int3; int3				# 2
> > 	endbr			# 4			mov	\hash, %eax			# 5
> > 	call	__fineibt_\hash	# 5			int3; int3				# 2
> > \sym:						\sym:
> > 	...						...
> > 
> > 
> > caller:						caller:
> > 	movl	\hash, %r10d	# 6			cmpl	\hash, -6(%r11)			# 8
> > 	sub	$9, %r11	# 4			je	1f				# 2
> > 	call	*%r11		# 3			ud2					# 2
> > 	.nop 4			# 4 (or fixup r11)	call	__x86_indirect_thunk_r11	# 5
> 
> This looks good!
> 
> And just to double-check my understanding here... \sym is expected to
> start with endbr with IBT + kCFI?

Ah, the thinking was that 'if IBT then FineIBT', so the combination of
kCFI and IBT is of no concern. And since FineIBT will have the ENDBR in
the __cfi_\sym thing, \sym will not need it.

But thinking about this now I suppose __nocfi call symbols will stlil
need the ENDBR on. Objtool IBT validation would need to either find
ENDBR or a matching __cfi_\sym I suppose.

So I was talking to Joao on IRC the other day, and I realized that if
kCFI generates code as per the above, then we can do FineIBT purely
in-kernel. That is; have objtool generate a section of __cfi_\sym
locations. Then use the .retpoline_sites and .cfi_sites to rewrite kCFI
into the FineIBT form in multi pass:

 - read all the __cfi_\sym sites and collect all unique hash values

 - allocate (module) memory and write __fineibt_\hash functions for each
   unique hash value found

 - rewrite callers; nop out kCFI

 - rewrite all __cfi_\sym

 - rewrite all callers

 - enable IBT

And the same on module load I suppose.

But I've only thought about this, not actually implemented it, so who
knows what surprises are lurking there :-)

> Random extra thoughts... feel free to ignore. :) Given that both CFI
> schemes depend on an attacker not being able to construct an executable
> memory region that either starts with endbr (for FineIBT) or starts with
> hash & 2 bytes (for kCFI), we should likely take another look at where
> the kernel uses PAGE_KERNEL_EXEC.
> 
> It seems non-specialized use is entirely done via module_alloc(). Obviously
> modules need to stay as-is. So we're left with other module_alloc()
> callers: BPF JIT, ftrace, and kprobes.
> 
> Perhaps enabling CFI should tie bpf_jit_harden (which performs constant
> blinding) to the value of bpf_jit_enable? (i.e. either use BPF VM which
> reads from non-exec memory, or use BPF JIT with constant blinding.)
> 
> I *think* all the kprobes and ftrace stuff ends up using constructed
> direct calls, though, yes? So if we did bounds checking, we could
> "exclude" them as well as the BPF JIT. Though I'm not sure how
> controllable the content written to the kprobes and ftrace regions are,
> though?

Both ftrace and kprobe only write fairly simple tramplines based off of
a template. Neither has indirect calls.

> For exclusion, we could separate actual modules from the other
> module_alloc() users by maybe allocating in opposite directions from the
> randomized offset and check indirect calls against the kernel text bounds
> and the new modules-only bounds. Sounds expensive, though. Maybe PKS,
> but I can't imagine 2 MSR writes per indirect call would be fast. Hmm...

I'm not sure what problem you're trying to solve..
diff mbox series

Patch

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4faac48ebec5..901b702fb16d 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -295,6 +295,7 @@  SYM_CODE_START(ret_from_fork)
 	/* kernel thread */
 	UNWIND_HINT_EMPTY
 	movq	%r12, %rdi
+	FINEIBT_HASH(0x89f22991)
 	CALL_NOSPEC rbx
 	/*
 	 * A kernel thread is allowed to return here after successfully
diff --git a/arch/x86/include/asm/ibt.h b/arch/x86/include/asm/ibt.h
index 689880eca9ba..580aa8b83bb2 100644
--- a/arch/x86/include/asm/ibt.h
+++ b/arch/x86/include/asm/ibt.h
@@ -21,6 +21,16 @@ 
 
 #define HAS_KERNEL_IBT	1
 
+#if defined(CONFIG_X86_KERNEL_FINEIBT) && !defined(BUILD_VDSO) && !defined(BUILD_VDSO32_64)
+#define HAS_KERNEL_FINEIBT  1
+#define FINEIBT_HASH(hash) mov $hash, %r11d
+#define __coarseendbr __attribute__((coarsecf_check))
+#else
+#define HAS_KERNEL_FINEIBT  0
+#define FINEIBT_HASH(hash)
+#define __coarseendbr
+#endif
+
 #ifndef __ASSEMBLY__
 
 #ifdef CONFIG_X86_64
@@ -29,6 +39,7 @@ 
 #define ASM_ENDBR	"endbr32\n\t"
 #endif
 
+#undef __noendbr
 #define __noendbr	__attribute__((nocf_check))
 
 static inline __attribute_const__ u32 gen_endbr(void)
@@ -80,12 +91,17 @@  extern __noendbr void ibt_restore(u64 save);
 #else /* !IBT */
 
 #define HAS_KERNEL_IBT	0
+#define HAS_KERNEL_FINEIBT 0
+#define FINEIBT_HASH(hash)
 
 #ifndef __ASSEMBLY__
 
 #define ASM_ENDBR
 
+#undef __noendbr
 #define __noendbr
+#undef __coarseendbr
+#define __coarseendbr
 
 static inline bool is_endbr(u32 val) { return false; }
 
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 896e48d45828..d2a2f6456403 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -49,10 +49,10 @@  extern unsigned long saved_video_mode;
 
 extern void reserve_standard_io_resources(void);
 extern void i386_reserve_resources(void);
-extern unsigned long __startup_64(unsigned long physaddr, struct boot_params *bp);
-extern unsigned long __startup_secondary_64(void);
-extern void startup_64_setup_env(unsigned long physbase);
-extern void early_setup_idt(void);
+extern unsigned long __coarseendbr __startup_64(unsigned long physaddr, struct boot_params *bp);
+extern unsigned long __coarseendbr __startup_secondary_64(void);
+extern void __coarseendbr startup_64_setup_env(unsigned long physbase);
+extern void __coarseendbr early_setup_idt(void);
 extern void __init do_early_exception(struct pt_regs *regs, int trapnr);
 
 #ifdef CONFIG_X86_INTEL_MID
@@ -137,8 +137,8 @@  extern void probe_roms(void);
 asmlinkage void __init i386_start_kernel(void);
 
 #else
-asmlinkage void __init x86_64_start_kernel(char *real_mode);
-asmlinkage void __init x86_64_start_reservations(char *real_mode_data);
+asmlinkage void __init __coarseendbr x86_64_start_kernel(char *real_mode);
+asmlinkage void __init __coarseendbr x86_64_start_reservations(char *real_mode_data);
 
 #endif /* __i386__ */
 #endif /* _SETUP */
diff --git a/arch/x86/include/asm/special_insns.h b/arch/x86/include/asm/special_insns.h
index 68c257a3de0d..7f32ef8d23f0 100644
--- a/arch/x86/include/asm/special_insns.h
+++ b/arch/x86/include/asm/special_insns.h
@@ -49,7 +49,7 @@  static inline unsigned long __native_read_cr3(void)
 	return val;
 }
 
-static inline void native_write_cr3(unsigned long val)
+static inline void __coarseendbr native_write_cr3(unsigned long val)
 {
 	asm volatile("mov %0,%%cr3": : "r" (val) : "memory");
 }
@@ -74,7 +74,7 @@  static inline unsigned long native_read_cr4(void)
 	return val;
 }
 
-void native_write_cr4(unsigned long val);
+void __coarseendbr native_write_cr4(unsigned long val);
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 static inline u32 rdpkru(void)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ed4417500700..70e94194ee2b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -463,7 +463,7 @@  void native_write_cr0(unsigned long val)
 }
 EXPORT_SYMBOL(native_write_cr0);
 
-void __no_profile native_write_cr4(unsigned long val)
+void __no_profile __coarseendbr native_write_cr4(unsigned long val)
 {
 	unsigned long bits_changed = 0;
 
diff --git a/arch/x86/kernel/fineibt.c b/arch/x86/kernel/fineibt.c
new file mode 100644
index 000000000000..685e4308d86e
--- /dev/null
+++ b/arch/x86/kernel/fineibt.c
@@ -0,0 +1,123 @@ 
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This file contains the FineIBT handler function.
+ */
+
+#include <linux/export.h>
+#include <linux/printk.h>
+#include <linux/kernel.h>
+#include<linux/spinlock.h>
+#include <asm/ibt.h>
+
+void __noendbr __fineibt_handler(void);
+
+void __fineibt_debug(void) {
+	asm volatile ("nop\n");
+	printk("fineibt debug\n");
+};
+EXPORT_SYMBOL(__fineibt_debug);
+
+#define FINEIBT_VADDR_LEN 4096
+#define DO_ALL_PUSHS    \
+	asm("nop\n\t"         \
+	    "push %rsi\n\t"   \
+	    "push %rdi\n\t"   \
+	    "push %rdx\n\t"   \
+	    "push %rcx\n\t"   \
+	    "push %rbx\n\t"   \
+	    "push %rax\n\t"   \
+	    "push %r8\n\t"    \
+	    "push %r9\n\t"    \
+	    "push %r10\n\t"   \
+	    "push %r11\n\t"   \
+	    "push %r12\n\t"   \
+	    "push %r13\n\t"   \
+	    "push %r14\n\t"   \
+	    "push %r15\n\t")
+
+#define DO_ALL_POPS    \
+	asm("nop\n\t"        \
+	    "pop %r15\n\t"   \
+	    "pop %r14\n\t"   \
+	    "pop %r13\n\t"   \
+	    "pop %r12\n\t"   \
+	    "pop %r11\n\t"   \
+	    "pop %r10\n\t"   \
+	    "pop %r9\n\t"    \
+	    "pop %r8\n\t"    \
+	    "pop %rax\n\t"   \
+	    "pop %rbx\n\t"   \
+	    "pop %rcx\n\t"   \
+	    "pop %rdx\n\t"   \
+	    "pop %rdi\n\t"   \
+	    "pop %rsi\n\t")
+
+struct fineibt_violation{
+	void * vaddr;
+	void * caddr;
+	bool printed;
+};
+
+typedef struct fineibt_violation fineibt_violation;
+
+static fineibt_violation vlts[FINEIBT_VADDR_LEN];
+static unsigned long vlts_next = 0;
+static bool vlts_initialize = true;
+static DEFINE_SPINLOCK(fineibt_lock);
+
+void __noendbr __fineibt_handler(void){
+	unsigned i;
+	unsigned long flags;
+	bool skip;
+	void * ret;
+	void * caller;
+
+	DO_ALL_PUSHS;
+
+	spin_lock_irqsave(&fineibt_lock, flags);
+	skip = false;
+
+	asm("\t movq 0x90(%%rsp),%0" : "=r"(ret));
+	asm("\t movq 0x98(%%rsp),%0" : "=r"(caller));
+
+	if(vlts_initialize){
+		for(i = 0; i < FINEIBT_VADDR_LEN; i++) {
+			vlts[i].vaddr = 0;
+			vlts[i].caddr = 0;
+			vlts[i].printed = 0;
+		}
+		vlts_initialize = false;
+	}
+
+	if(vlts_next >= FINEIBT_VADDR_LEN) {
+		if(vlts_next == FINEIBT_VADDR_LEN) {
+			printk("FineIBT reached max buffer\n");
+			vlts_next++;
+		}
+		skip = true;
+	}
+
+	for(i = 0; i < vlts_next; i++){
+		if(vlts[i].vaddr == ret && vlts[i].caddr == caller) {
+			skip = true;
+			break;
+		}
+	}
+
+	if(!skip) {
+		vlts[vlts_next].vaddr = ret;
+		vlts[vlts_next].caddr = caller;
+		vlts[vlts_next].printed = 0;
+		vlts_next = vlts_next + 1;
+	}
+
+	spin_unlock_irqrestore(&fineibt_lock, flags);
+
+	if(!skip) {
+		printk("FineIBT violation: %px:%px:%u\n", ret, caller,
+				vlts_next);
+	}
+	DO_ALL_POPS;
+}
+
+EXPORT_SYMBOL(__fineibt_handler);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 4f5ecbbaae77..f773d771e07d 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -162,7 +162,7 @@  static unsigned long __head sme_postprocess_startup(struct boot_params *bp, pmdv
  * boot-time crashes. To work around this problem, every global pointer must
  * be adjusted using fixup_pointer().
  */
-unsigned long __head __startup_64(unsigned long physaddr,
+unsigned long __head __coarseendbr __startup_64(unsigned long physaddr,
 				  struct boot_params *bp)
 {
 	unsigned long load_delta, *p;
@@ -308,7 +308,7 @@  unsigned long __head __startup_64(unsigned long physaddr,
 	return sme_postprocess_startup(bp, pmd);
 }
 
-unsigned long __startup_secondary_64(void)
+unsigned long __coarseendbr __startup_secondary_64(void)
 {
 	/*
 	 * Return the SME encryption mask (if SME is active) to be used as a
@@ -464,8 +464,8 @@  static void __init copy_bootdata(char *real_mode_data)
 	sme_unmap_bootdata(real_mode_data);
 }
 
-asmlinkage __visible void __init x86_64_start_kernel(char * real_mode_data)
-{
+asmlinkage __visible void __init __coarseendbr
+x86_64_start_kernel(char * real_mode_data) {
 	/*
 	 * Build-time sanity checks on the kernel image and module
 	 * area mappings. (these are purely build-time and produce no code)
@@ -597,7 +597,7 @@  static void startup_64_load_idt(unsigned long physbase)
 }
 
 /* This is used when running on kernel addresses */
-void early_setup_idt(void)
+void __coarseendbr early_setup_idt(void)
 {
 	/* VMM Communication Exception */
 	if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT))
@@ -610,7 +610,7 @@  void early_setup_idt(void)
 /*
  * Setup boot CPU state needed before kernel switches to virtual addresses.
  */
-void __head startup_64_setup_env(unsigned long physbase)
+void __head __coarseendbr startup_64_setup_env(unsigned long physbase)
 {
 	/* Load GDT */
 	startup_gdt_descr.address = (unsigned long)fixup_pointer(startup_gdt, physbase);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 2ef14772dc04..5fa17d5716bb 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -214,7 +214,7 @@  static int enable_start_cpu0;
 /*
  * Activate a secondary processor.
  */
-static void notrace start_secondary(void *unused)
+static void notrace __coarseendbr start_secondary(void *unused)
 {
 	/*
 	 * Don't put *anything* except direct CPU state initialization