diff mbox

[RFC,09/10] x86/enter: Create macros to restrict/unrestrict Indirect Branch Speculation

Message ID 20180123075358.nztpyxympwfkyi2a@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ingo Molnar Jan. 23, 2018, 7:53 a.m. UTC
* Ingo Molnar <mingo@kernel.org> wrote:

> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > But wait, why did I say "mostly"? Well, not everyone has a retpoline
> > compiler yet... but OK, screw them; they need to update.
> > 
> > Then there's Skylake, and that generation of CPU cores. For complicated
> > reasons they actually end up being vulnerable not just on indirect
> > branches, but also on a 'ret' in some circumstances (such as 16+ CALLs
> > in a deep chain).
> > 
> > The IBRS solution, ugly though it is, did address that. Retpoline
> > doesn't. There are patches being floated to detect and prevent deep
> > stacks, and deal with some of the other special cases that bite on SKL,
> > but those are icky too. And in fact IBRS performance isn't anywhere
> > near as bad on this generation of CPUs as it is on earlier CPUs
> > *anyway*, which makes it not quite so insane to *contemplate* using it
> > as Intel proposed.
> 
> There's another possible method to avoid deep stacks on Skylake, without compiler 
> support:
> 
>   - Use the existing mcount based function tracing live patching machinery
>     (CONFIG_FUNCTION_TRACER=y) to install a _very_ fast and simple stack depth 
>     tracking tracer which would issue a retpoline when stack depth crosses 
>     boundaries of ~16 entries.

The patch below demonstrates the principle, it forcibly enables dynamic ftrace 
patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET:

  ffffffff81a01a40 <__fentry__>:
  ffffffff81a01a40:       c3                      retq   

This would have to be extended with (very simple) call stack depth tracking (just 
3 more instructions would do in the fast path I believe) and a suitable SkyLake 
workaround (and also has to play nice with the ftrace callbacks).

On non-SkyLake the overhead would be 0 cycles.

On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
obviously all this code and data would be very cache hot. Given that the average 
number of function calls per system call is around a dozen, this would be _much_ 
faster than any microcode/MSR based approach.

Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? 
Is there a description of the exact speculative execution vulnerability that has 
to be addressed to begin with?

If this approach is workable I'd much prefer it to any MSR writes in the syscall 
entry path not just because it's fast enough in practice to not be turned off by 
everyone, but also because everyone would agree that per function call overhead 
needs to go away on new CPUs. Both deployment and backporting is also _much_ more 
flexible, simpler, faster and more complete than microcode/firmware or compiler 
based solutions.

Assuming the vulnerability can be addressed via this route that is, which is a big 
assumption!

Thanks,

	Ingo

 arch/x86/Kconfig            | 3 +++
 arch/x86/kernel/ftrace_64.S | 1 +
 2 files changed, 4 insertions(+)

Comments

Ingo Molnar Jan. 23, 2018, 9:27 a.m. UTC | #1
* Ingo Molnar <mingo@kernel.org> wrote:

> Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? 
> Is there a description of the exact speculative execution vulnerability that has 
> to be addressed to begin with?

Ok, so for now I'm assuming that this is the 16 entries return-stack-buffer 
underflow condition where SkyLake falls back to the branch predictor (while other 
CPUs wrap the buffer).

> If this approach is workable I'd much prefer it to any MSR writes in the syscall 
> entry path not just because it's fast enough in practice to not be turned off by 
> everyone, but also because everyone would agree that per function call overhead 
> needs to go away on new CPUs. Both deployment and backporting is also _much_ more 
> flexible, simpler, faster and more complete than microcode/firmware or compiler 
> based solutions.
> 
> Assuming the vulnerability can be addressed via this route that is, which is a big 
> assumption!

So I talked this over with PeterZ, and I think it's all doable:

 - the CALL __fentry__ callbacks maintain the depth tracking (on the kernel 
   stack, fast to access), and issue an "RSB-stuffing sequence" when depth reaches
   16 entries.

 - "the RSB-stuffing sequence" is a return trampoline that pushes a CALL on the 
   stack which is executed on the RET.

 - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The 
   tracking could probably made IRQ and maybe even NMI safe, but the worst-case 
   nesting scenarios make my head ache.)

I.e. IBRS can be mostly replaced with a kernel based solution that is better than 
IBRS and which does not negatively impact any other non-SkyLake CPUs or general 
code quality.

I.e. a full upstream Spectre solution.

Thanks,

	Ingo
David Woodhouse Jan. 23, 2018, 9:30 a.m. UTC | #2
On Tue, 2018-01-23 at 08:53 +0100, Ingo Molnar wrote:
> 
> The patch below demonstrates the principle, it forcibly enables dynamic ftrace 
> patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET:
> 
>   ffffffff81a01a40 <__fentry__>:
>   ffffffff81a01a40:       c3                      retq   
> 
> This would have to be extended with (very simple) call stack depth tracking (just 
> 3 more instructions would do in the fast path I believe) and a suitable SkyLake 
> workaround (and also has to play nice with the ftrace callbacks).
> 
> On non-SkyLake the overhead would be 0 cycles.

The overhead of forcing CONFIG_DYNAMIC_FTRACE=y is precisely zero
cycles? That seems a little optimistic. ;)

I'll grant you if it goes straight to a 'ret' it isn't *that* high
though.

> On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
> obviously all this code and data would be very cache hot. Given that the average 
> number of function calls per system call is around a dozen, this would be _much_ 
> faster than any microcode/MSR based approach.

That's kind of neat, except you don't want it at the top of the
function; you want it at the bottom.

If you could hijack the *return* site, then you could check for
underflow and stuff the RSB right there. But in __fentry__ there's not
a lot you can do other than complain that something bad is going to
happen in the future. You know that a string of 16+ rets is going to
happen, but you've got no gadget in *there* to deal with it when it
does.

HJ did have patches to turn 'ret' into a form of retpoline, which I
don't think ever even got performance-tested. They'd have forced a
mispredict on *every* ret. A cheaper option might be to turn ret into a
'jmp skylake_ret_hack'. Which on pre-SKL will be a bare ret, and SKL+
can do the counting (in conjunction with a 'per_cpu(call_depth)++' in
__fentry__) and stuff the RSB before actually returning, when
appropriate.

By the time you've made it work properly, I suspect we're approaching
the barf-factor of IBRS, for a less complete solution.

> Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? 

Andi's been experimenting at 
https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=spec/deep-chain-3

> Is there a description of the exact speculative execution vulnerability that has 
> to be addressed to begin with?

"It takes predictions from the generic branch target buffer when the
RSB underflows".

IBRS filters what can come from the BTB, and resolves the problem that
way. Retpoline avoids the indirect branches that on *earlier* CPUs were
the only things that would use the offending predictions. But on SKL,
now 'ret' is one of the problematic instructions too. Fun! :)

> If this approach is workable I'd much prefer it to any MSR writes in the syscall 
> entry path not just because it's fast enough in practice to not be turned off by 
> everyone, but also because everyone would agree that per function call overhead 
> needs to go away on new CPUs. Both deployment and backporting is also _much_ more 
> flexible, simpler, faster and more complete than microcode/firmware or compiler 
> based solutions.
> 
> Assuming the vulnerability can be addressed via this route that is, which is a big 
> assumption!

I think it's close. There are some other cases which empty the RSB,
like sleeping and loading microcode, which can happily be special-
cased. Andi's rounded up many of the remaining details already at 
https://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc.git/log/?h=spec/skl-rsb-3

And there's SMI, which is a pain but I think Linus is right we can
possibly just stick our fingers in our ears and pretend we didn't hear
about that one as it's likely to be hard to trigger (famous last
words).

On the whole though, I think you can see why we're keeping IBRS around
for now, sent out purely as an RFC and rebased on top of the stuff
we're *actually* sending to Linus for inclusion.

When we have a clear idea of what we're doing for Skylake, it'll be
useful to have a proper comparison of the security, the performance and
the "ick" factor of whatever we come up with, vs. IBRS.

Right now the plan is just "screw Skylake"; we'll just forget it's a
special snowflake and treat it like everything else, except for a bit
of extra RSB-stuffing on context switch (since we had to add that for
!SMEP anyway). And that's not *entirely* unreasonable but as I said I'd
*really* like to have a decent analysis of the implications of that,
not just some hand-wavy "nah, it'll be fine".
David Woodhouse Jan. 23, 2018, 9:37 a.m. UTC | #3
On Tue, 2018-01-23 at 10:27 +0100, Ingo Molnar wrote:
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > 
> > Is there a testcase for the SkyLake 16-deep-call-stack problem that I could run? 
> > Is there a description of the exact speculative execution vulnerability that has 
> > to be addressed to begin with?
>
> Ok, so for now I'm assuming that this is the 16 entries return-stack-buffer 
> underflow condition where SkyLake falls back to the branch predictor (while other 
> CPUs wrap the buffer).

Yep.

> > 
> > If this approach is workable I'd much prefer it to any MSR writes in the syscall 
> > entry path not just because it's fast enough in practice to not be turned off by 
> > everyone, but also because everyone would agree that per function call overhead 
> > needs to go away on new CPUs. Both deployment and backporting is also _much_ more 
> > flexible, simpler, faster and more complete than microcode/firmware or compiler 
> > based solutions.
> > 
> > Assuming the vulnerability can be addressed via this route that is, which is a big 
> > assumption!
>
> So I talked this over with PeterZ, and I think it's all doable:
> 
>  - the CALL __fentry__ callbacks maintain the depth tracking (on the kernel 
>    stack, fast to access), and issue an "RSB-stuffing sequence" when depth reaches
>    16 entries.
> 
>  - "the RSB-stuffing sequence" is a return trampoline that pushes a CALL on the 
>    stack which is executed on the RET.

That's neat. We'll want to make sure the unwinder can cope but hey,
Peter *loves* hacking objtool, right? :)

>  - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The 
>    tracking could probably made IRQ and maybe even NMI safe, but the worst-case 
>    nesting scenarios make my head ache.)
> 
> I.e. IBRS can be mostly replaced with a kernel based solution that is better than 
> IBRS and which does not negatively impact any other non-SkyLake CPUs or general 
> code quality.
> 
> I.e. a full upstream Spectre solution.

Sounds good. I look forward to seeing it.

In the meantime I'll resend the basic bits of the feature detection and
especially turning off KPTI when RDCL_NO is set.

We do also want to do IBPB even with retpoline, so I'll send those
patches for KVM and context switch. There is some bikeshedding to be
done there about the precise conditions under which we do it.

Finally, KVM should be *exposing* IBRS to guests even if we don't use
it ourselves. We'll do that too.
Ingo Molnar Jan. 23, 2018, 10:15 a.m. UTC | #4
* David Woodhouse <dwmw2@infradead.org> wrote:

> On Tue, 2018-01-23 at 08:53 +0100, Ingo Molnar wrote:
> > 
> > The patch below demonstrates the principle, it forcibly enables dynamic ftrace 
> > patching (CONFIG_DYNAMIC_FTRACE=y et al) and turns mcount/__fentry__ into a RET:
> > 
> >   ffffffff81a01a40 <__fentry__>:
> >   ffffffff81a01a40:       c3                      retq   
> > 
> > This would have to be extended with (very simple) call stack depth tracking (just 
> > 3 more instructions would do in the fast path I believe) and a suitable SkyLake 
> > workaround (and also has to play nice with the ftrace callbacks).
> > 
> > On non-SkyLake the overhead would be 0 cycles.
> 
> The overhead of forcing CONFIG_DYNAMIC_FTRACE=y is precisely zero
> cycles? That seems a little optimistic. ;)

The overhead of the quick hack patch I sent to show what exact code I mean is 
obviously not zero.

The overhead of using my proposed solution, to utilize the function call callback 
that CONFIG_DYNAMIC_FTRACE=y provides, is exactly zero on non-SkyLake systems 
where the callback is patched out, on typical Linux distros.

The callback is widely enabled on distro kernels:

  Fedora:                    CONFIG_DYNAMIC_FTRACE=y
  Ubuntu:                    CONFIG_DYNAMIC_FTRACE=y
  OpenSuse (default flavor): CONFIG_DYNAMIC_FTRACE=y

BTW., the reason this is enabled on all distro kernels is because the overhead is 
a single patched-in NOP instruction in the function epilogue, when tracing is 
disabled. So it's not even a CALL+RET - it's a patched in NOP.

Thanks,

	Ingo
Ingo Molnar Jan. 23, 2018, 10:23 a.m. UTC | #5
* David Woodhouse <dwmw2@infradead.org> wrote:

> > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
> > obviously all this code and data would be very cache hot. Given that the average 
> > number of function calls per system call is around a dozen, this would be _much_ 
> > faster than any microcode/MSR based approach.
> 
> That's kind of neat, except you don't want it at the top of the
> function; you want it at the bottom.
> 
> If you could hijack the *return* site, then you could check for
> underflow and stuff the RSB right there. But in __fentry__ there's not
> a lot you can do other than complain that something bad is going to
> happen in the future. You know that a string of 16+ rets is going to
> happen, but you've got no gadget in *there* to deal with it when it
> does.

No, it can be done with the existing CALL instrumentation callback that 
CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from 
the CALL trampoline - see my previous email.

> HJ did have patches to turn 'ret' into a form of retpoline, which I
> don't think ever even got performance-tested.

Return instrumentation is possible as well, but there are two major drawbacks:

 - GCC support for it is not as widely available and return instrumentation is 
   less tested in Linux kernel contexts

 - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already 
   enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs 
   would be literally zero, while still allowing to fix the RSB vulnerability on 
   SkyLake.

Thanks,

	Ingo
David Woodhouse Jan. 23, 2018, 10:27 a.m. UTC | #6
On Tue, 2018-01-23 at 11:15 +0100, Ingo Molnar wrote:
> 
> BTW., the reason this is enabled on all distro kernels is because the overhead is 
> a single patched-in NOP instruction in the function epilogue, when tracing is 
> disabled. So it's not even a CALL+RET - it's a patched in NOP.

Hm? We still have GCC emitting 'call __fentry__' don't we? Would be
nice to get to the point where we can patch *that* out into a NOP... or
are you saying we already can?

But this is a digression. I was being pedantic about the "0 cycles" but
sure, this would be perfectly tolerable.
Ingo Molnar Jan. 23, 2018, 10:44 a.m. UTC | #7
* David Woodhouse <dwmw2@infradead.org> wrote:

> On Tue, 2018-01-23 at 11:15 +0100, Ingo Molnar wrote:
> > 
> > BTW., the reason this is enabled on all distro kernels is because the overhead 
> > is  a single patched-in NOP instruction in the function epilogue, when tracing 
> > is  disabled. So it's not even a CALL+RET - it's a patched in NOP.
> 
> Hm? We still have GCC emitting 'call __fentry__' don't we? Would be nice to get 
> to the point where we can patch *that* out into a NOP... or are you saying we 
> already can?

Yes, we already can and do patch the 'call __fentry__/ mcount' call site into a 
NOP today - all 50,000+ call sites on a typical distro kernel.

We did so for a long time - this is all a well established, working mechanism.

> But this is a digression. I was being pedantic about the "0 cycles" but sure, 
> this would be perfectly tolerable.

It's not a digression in two ways:

- I wanted to make it clear that for distro kernels it _is_ a zero cycles overhead
  mechanism for non-SkyLake CPUs, literally.

- I noticed that Meltdown and the CR3 writes for PTI appears to have established a
  kind of ... insensitivity and numbness to kernel micro-costs, which peaked with
  the per-syscall MSR write nonsense patch of the SkyLake workaround.
  That attitude is totally unacceptable to me as x86 maintainer and yes, still
  every cycle counts.

Thanks,

	Ingo
Dave Hansen Jan. 23, 2018, 3:01 p.m. UTC | #8
On 01/23/2018 01:27 AM, Ingo Molnar wrote:
> 
>  - All asynchronous contexts (IRQs, NMIs, etc.) stuff the RSB before IRET. (The 
>    tracking could probably made IRQ and maybe even NMI safe, but the worst-case 
>    nesting scenarios make my head ache.)

This all sounds totally workable to me.  We talked about using ftrace
itself to track call depth, but it would be unusable in production, of
course.  This seems workable, though.  You're also totally right about
the zero overhead on most kernels with it turned off when we don't need
RSB underflow protection (basically pre-Skylake).

I also agree that the safe thing to do is to just stuff before iret.  I
bet we can get a ftrace-driven RSB tracker working precisely enough even
with NMIs, but it's way simpler to just stuff and be done with it for now.
Thomas Gleixner Feb. 4, 2018, 6:43 p.m. UTC | #9
On Tue, 23 Jan 2018, Ingo Molnar wrote:
> * David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > > On SkyLake this would add an overhead of maybe 2-3 cycles per function call and 
> > > obviously all this code and data would be very cache hot. Given that the average 
> > > number of function calls per system call is around a dozen, this would be _much_ 
> > > faster than any microcode/MSR based approach.
> > 
> > That's kind of neat, except you don't want it at the top of the
> > function; you want it at the bottom.
> > 
> > If you could hijack the *return* site, then you could check for
> > underflow and stuff the RSB right there. But in __fentry__ there's not
> > a lot you can do other than complain that something bad is going to
> > happen in the future. You know that a string of 16+ rets is going to
> > happen, but you've got no gadget in *there* to deal with it when it
> > does.
> 
> No, it can be done with the existing CALL instrumentation callback that 
> CONFIG_DYNAMIC_FTRACE=y provides, by pushing a RET trampoline on the stack from 
> the CALL trampoline - see my previous email.
> 
> > HJ did have patches to turn 'ret' into a form of retpoline, which I
> > don't think ever even got performance-tested.
> 
> Return instrumentation is possible as well, but there are two major drawbacks:
> 
>  - GCC support for it is not as widely available and return instrumentation is 
>    less tested in Linux kernel contexts
> 
>  - a major point of my suggestion is that CONFIG_DYNAMIC_FTRACE=y is already 
>    enabled in distros here and today, so the runtime overhead to non-SkyLake CPUs 
>    would be literally zero, while still allowing to fix the RSB vulnerability on 
>    SkyLake.

I played around with that a bit during the week and it turns out to be less
simple than you thought.

1) Injecting a trampoline return only works for functions which have all
   arguments in registers. For functions with arguments on stack like all
   varg functions this breaks because the function wont find its arguments
   anymore.

   I have not yet found a way to figure out reliably which functions have
   arguments on stack. That might be an option to simply ignore them.

   The workaround is to replace the original return on stack with the
   trampoline and store the original return in a per thread stack, which I
   implemented. But this sucks performance wise badly.

2) Doing the whole dance on function entry has a real down side because you
   refill RSB on every 15th return no matter whether its required or
   not. That really gives a very prominent performance hit.

An alternative idea is to do the following (not yet implemented):

__fentry__:
	incl	PER_CPU_VAR(call_depth)
	retq

and use -mfunction-return=thunk-extern which is available on retpoline
enabled compilers. That's a reasonable requirement because w/o retpoline
the whole SKL magic is pointless anyway.

-mfunction-return=thunk-extern issues

	jump	__x86_return_thunk

instead of ret. In the thunk we can do the whole shebang of mitigation.
That jump can be identified at build time and it can be patched into a ret
for unaffected CPUs. Ideally we do the patching at build time and only
patch the jump in when SKL is detected or paranoia requests it.

We could actually look into that for tracing as well. The only reason why
we don't do that is to select the ideal nop for the CPU the kernel runs on,
which obviously cannot be known at build time.

__x86_return_thunk would look like this:

__x86_return_thunk:
	testl	$0xf, PER_CPU_VAR(call_depth)
	jnz	1f	
	stuff_rsb
   1:
	decl	PER_CPU_VAR(call_depth)
   	ret

The call_depth variable would be reset on context switch.

Though that has another problem: tail calls. Tail calls will invoke the
__fentry__ call of the tail called function, which makes the call_depth
counter unbalanced. Tail calls can be prevented by using
-fno-optimize-sibling-calls, but that probably sucks as well.

Yet another possibility is to avoid the function entry and accouting magic
and use the generic gcc return thunk:

__x86_return_thunk:
	call L2
L1:
	pause
	lfence
	jmp L1
L2:
	lea 8(%rsp), %rsp|lea 4(%esp), %esp
	ret

which basically refills the RSB on every return. That can be inline or
extern, but in both cases we should be able to patch it out.

I have no idea how that affects performance, but it might be worthwhile to
experiment with that.

If nobody beats me to it, I'll play around with that some more after
vacation.

Thanks,

	tglx
David Woodhouse Feb. 4, 2018, 8:22 p.m. UTC | #10
On Sun, 2018-02-04 at 19:43 +0100, Thomas Gleixner wrote:
> 
> __x86_return_thunk would look like this:
> 
> __x86_return_thunk:
>         testl   $0xf, PER_CPU_VAR(call_depth)
>         jnz     1f      
>         stuff_rsb
>    1:
>         decl    PER_CPU_VAR(call_depth)
>         ret
> 
> The call_depth variable would be reset on context switch.

Note that the 'jnz' can be predicted taken there, allowing the CPU to
speculate all the way to the 'ret'... and beyond.
David Woodhouse Feb. 6, 2018, 9:14 a.m. UTC | #11
On Sun, 2018-02-04 at 19:43 +0100, Thomas Gleixner wrote:
> Yet another possibility is to avoid the function entry and accouting magic
> and use the generic gcc return thunk:
> 
> __x86_return_thunk:
>         call L2
> L1:
>         pause
>         lfence
>         jmp L1
> L2:
>         lea 8(%rsp), %rsp|lea 4(%esp), %esp
>         ret
> 
> which basically refills the RSB on every return. That can be inline or
> extern, but in both cases we should be able to patch it out.
> 
> I have no idea how that affects performance, but it might be worthwhile to
> experiment with that.

That was what I had in mind when I asked HJ to add -mfunction-return.

I suspect the performance hit would be significant because it would
cause a prediction miss on *every* return.

But as I said, let's implement what we can without IBRS for Skylake,
then we can compare the two options for performance, security coverage
and general fugliness.
diff mbox

Patch

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 423e4b64e683..df471538a79c 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -133,6 +133,8 @@  config X86
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_DYNAMIC_FTRACE
 	select HAVE_DYNAMIC_FTRACE_WITH_REGS
+	select DYNAMIC_FTRACE
+	select DYNAMIC_FTRACE_WITH_REGS
 	select HAVE_EBPF_JIT			if X86_64
 	select HAVE_EFFICIENT_UNALIGNED_ACCESS
 	select HAVE_EXIT_THREAD
@@ -140,6 +142,7 @@  config X86
 	select HAVE_FTRACE_MCOUNT_RECORD
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FUNCTION_TRACER
+	select FUNCTION_TRACER
 	select HAVE_GCC_PLUGINS
 	select HAVE_HW_BREAKPOINT
 	select HAVE_IDE
diff --git a/arch/x86/kernel/ftrace_64.S b/arch/x86/kernel/ftrace_64.S
index 7cb8ba08beb9..1e219e0f2887 100644
--- a/arch/x86/kernel/ftrace_64.S
+++ b/arch/x86/kernel/ftrace_64.S
@@ -19,6 +19,7 @@  EXPORT_SYMBOL(__fentry__)
 # define function_hook	mcount
 EXPORT_SYMBOL(mcount)
 #endif
+	ret
 
 /* All cases save the original rbp (8 bytes) */
 #ifdef CONFIG_FRAME_POINTER