diff mbox series

[3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation

Message ID 20190427100639.15074-4-nstange@suse.de (mailing list archive)
State New
Headers show
Series x86/ftrace: make ftrace_int3_handler() not to skip fops invocation | expand

Commit Message

Nicolai Stange April 27, 2019, 10:06 a.m. UTC
With dynamic ftrace, ftrace patches call sites on x86 in a three steps:
1. Put a breakpoint at the to be patched location,
2. update call site and
3. finally remove the breakpoint again.

Note that the breakpoint ftrace_int3_handler() simply makes execution to
skip over the to be patched instruction.

This patching happens in the following circumstances:
a.) the global ftrace_trace_function changes and the call sites at
    ftrace_call and ftrace_regs_call get patched,
b.) an ftrace_ops' ->func changes and the call site in its trampoline gets
    patched,
c.) graph tracing gets enabled/disabled and the jump site at
    ftrace_graph_call gets patched,
d.) a mcount site gets converted from nop -> call, call -> nop, or
    call -> call.

The latter case, i.e. a mcount call getting redirected, is possible in e.g.
a transition from trampolined to not trampolined upon a user enabling
function tracing on a live patched function.

The ftrace_int3_handler() simply skipping over the updated insn is quite
problematic in the context of live patching, because it means that a live
patched function gets temporarily reverted to its unpatched original and
this breaks the live patching consistency model. But even without live
patching, it is desirable to avoid missing traces when making changes to
the tracing setup.

Make ftrace_int3_handler not to skip over the fops invocation, but modify
the interrupted control flow to issue a call as needed.

Case c.) from the list above can be ignored, because a jmp instruction gets
changed to a nop or vice versa.

The remaining cases a.), b.) and d.) all involve call instructions. For a.)
and b.), the call always goes to some ftrace_func_t and the generic
ftrace_ops_list_func() impementation will be able to demultiplex and do the
right thing. For case d.), the call target is either of ftrace_caller(),
ftrace_regs_caller() or some ftrace_ops' trampoline. Because providing the
register state won't cause any harm for !FTRACE_OPS_FL_SAVE_REGS
ftrace_ops, ftrace_regs_caller() would be a suitable target capable of
handling any case.

ftrace_int3_handler()'s context is different from the interrupted call
instruction's one, obviously. In order to be able to emulate the call
within the original context, make ftrace_int3_handler() set its iret
frame's ->ip to some helper stub. Upon return from the trap, this stub will
then mimic the call by pushing the the return address onto the stack and
issuing a jmp to the target address. As describe above, the jmp target
will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide
one such stub implementation for each of the two cases.

Finally, the desired return address, which is derived from the trapping
->ip, must get passed from ftrace_int3_handler() to these stubs. Make
ftrace_int3_handler() push it onto the ftrace_int3_stack introduced by an
earlier patch and let the stubs consume it. Be careful to use proper
compiler barriers such that nested int3 handling from e.g. irqs won't
clobber entries owned by outer instances.

Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Nicolai Stange <nstange@suse.de>
---
 arch/x86/kernel/Makefile            |  1 +
 arch/x86/kernel/ftrace.c            | 79 +++++++++++++++++++++++++++++++------
 arch/x86/kernel/ftrace_int3_stubs.S | 61 ++++++++++++++++++++++++++++
 3 files changed, 130 insertions(+), 11 deletions(-)
 create mode 100644 arch/x86/kernel/ftrace_int3_stubs.S

Comments

Peter Zijlstra April 27, 2019, 10:26 a.m. UTC | #1
On Sat, Apr 27, 2019 at 12:06:38PM +0200, Nicolai Stange wrote:
> ftrace_int3_handler()'s context is different from the interrupted call
> instruction's one, obviously. In order to be able to emulate the call
> within the original context, make ftrace_int3_handler() set its iret
> frame's ->ip to some helper stub. Upon return from the trap, this stub will
> then mimic the call by pushing the the return address onto the stack and
> issuing a jmp to the target address. As describe above, the jmp target
> will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide
> one such stub implementation for each of the two cases.

Yuck; I'd much rather we get that static_call() stuff sorted such that
text_poke() and poke_int3_handler() can do CALL emulation.

Given all the back and forth, I think the solution where we shift
pt_regs a bit to allow the emulated PUSH is a viable solution; eg. I
think we collectively hated it least.
Steven Rostedt April 28, 2019, 5:38 p.m. UTC | #2
[ Added Linus ]

On Sat, 27 Apr 2019 12:26:57 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Apr 27, 2019 at 12:06:38PM +0200, Nicolai Stange wrote:
> > ftrace_int3_handler()'s context is different from the interrupted call
> > instruction's one, obviously. In order to be able to emulate the call
> > within the original context, make ftrace_int3_handler() set its iret
> > frame's ->ip to some helper stub. Upon return from the trap, this stub will
> > then mimic the call by pushing the the return address onto the stack and
> > issuing a jmp to the target address. As describe above, the jmp target
> > will be either of ftrace_ops_list_func() or ftrace_regs_caller(). Provide
> > one such stub implementation for each of the two cases.  
> 
> Yuck; I'd much rather we get that static_call() stuff sorted such that
> text_poke() and poke_int3_handler() can do CALL emulation.
> 
> Given all the back and forth, I think the solution where we shift
> pt_regs a bit to allow the emulated PUSH is a viable solution; eg. I
> think we collectively hated it least.

Note, this use case is slightly different than the static calls but has
the same issue.  That issue is that we need a way to simulate a "call"
function from int3, and there's no good way to handle that case right
now.

The static call needs to modify the call sight but make the call still
work from int3 context.

This use case is similar, but the issue is with a "bug" in the function
tracing implementation, that has become an issue with live kernel
patching which is built on top of it.

The issue is this:

For optimization reasons, if there's only a single user of a function
it gets its own trampoline that sets up the call to its callback and
calls that callback directly:


<function being traced>
  call custom trampoline


<custom trampoline>
   save regs to call C code
   call custom_callback
   restore regs
   ret

If more than one callback is attached to that function, then we need to
call the iterator that loops through all registered callbacks of the
function tracer and checks their hashes to see if they want to trace
this function or not.

<function being traced>
  call iterator_trampoline

<iterator_trampoline>
  save regs to call C code
  call iterator
  restore regs
  ret

What happens in that transition? We add an "int3" at the function being
traced, and change:

  call custom_trampoline

to

  call iterator_trampoline

But if the function being traced is executed during this transition,
the int3 just makes it act like a nop and returns pass the call.

For tracing this is a bug because we just missed a function that should
be traced. For live kernel patching, this could be more devastating
because the original "buggy" patched function is called, and this could
cause subtle problems.

If we can add a slot to the int3 handler, that we can use to emulate a
call, (perhaps add another slot to pt_regs structure, call it link
register)

It would make it easier to solve both this and the static call problems.

-- Steve
Linus Torvalds April 29, 2019, 6:06 p.m. UTC | #3
On Sun, Apr 28, 2019 at 10:38 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> For optimization reasons, if there's only a single user of a function
> it gets its own trampoline that sets up the call to its callback and
> calls that callback directly:

So this is the same issue as the static calls, and it has exactly the
same solution.

Which I already outlined once, and nobody wrote the code for.

So here's a COMPLETELY UNTESTED patch that only works (_if_ it works) for

 (a) 64-bit

 (b) SMP

but that's just because I've hardcoded the percpu segment handling.

It does *not* emulate the "call" in the BP handler itself, instead if
replace the %ip (the same way all the other BP handlers replace the
%ip) with a code sequence that just does

        push %gs:bp_call_return
        jmp *%gs:bp_call_target

after having filled in those per-cpu things.

The reason they are percpu is that after the %ip has been changed, the
target CPU goes its merry way, and doesn't wait for the text--poke
semaphore serialization. But since we have interrupts disabled on that
CPU, we know that *another* text poke won't be coming around and
changing the values.

THIS IS ENTIRELY UNTESTED! I've built it, and it at least seems to
build, although with warnings

  arch/x86/kernel/alternative.o: warning: objtool:
emulate_call_irqoff()+0x9: indirect jump found in RETPOLINE build
  arch/x86/kernel/alternative.o: warning: objtool:
emulate_call_irqon()+0x8: indirect jump found in RETPOLINE build
  arch/x86/kernel/alternative.o: warning: objtool:
emulate_call_irqoff()+0x9: sibling call from callable instruction with
modified stack frame
  arch/x86/kernel/alternative.o: warning: objtool:
emulate_call_irqon()+0x8: sibling call from callable instruction with
modified stack frame

that will need the appropriate "ignore this case" annotations that I didn't do.

Do I expect it to work? No. I'm sure there's some silly mistake here,
but the point of the patch is to show it as an example, so that it can
actually be tested.

With this, it should be possible (under the text rewriting lock) to do

        replace_call(callsite, newcallopcode, callsize, calltargettarget);

to do the static rewriting of the call at "callsite" to have the new
call target.

And again. Untested. But doesn't need any special code in the entry
path, and the concept is simple even if there are probably stupid bugs
just because it's entirely untested.

Oh, and did I mention that I didn't test this?

                Linus
arch/x86/kernel/alternative.c | 54 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 51 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9a79c7808f9c..92b59958cff3 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -739,7 +739,11 @@ static void do_sync_core(void *info)
 }
 
 static bool bp_patching_in_progress;
-static void *bp_int3_handler, *bp_int3_addr;
+static void *bp_int3_handler_irqoff, *bp_int3_handler_irqon, *bp_int3_addr;
+static void *bp_int3_call_target, *bp_int3_call_return;
+
+static DEFINE_PER_CPU(void *, bp_call_return);
+static DEFINE_PER_CPU(void *, bp_call_target);
 
 int poke_int3_handler(struct pt_regs *regs)
 {
@@ -762,7 +766,22 @@ int poke_int3_handler(struct pt_regs *regs)
 		return 0;
 
 	/* set up the specified breakpoint handler */
-	regs->ip = (unsigned long) bp_int3_handler;
+	regs->ip = (unsigned long) bp_int3_handler_irqon;
+
+	/*
+	 * If we want an irqoff irq3 handler, and interrupts were
+	 * on, we turn them off and use the special irqoff handler
+	 * instead.
+	 */
+	if (bp_int3_handler_irqoff) {
+		this_cpu_write(bp_call_target, bp_int3_call_target);
+		this_cpu_write(bp_call_return, bp_int3_call_return);
+
+		if (regs->flags & X86_EFLAGS_IF) {
+			regs->flags &= ~X86_EFLAGS_IF;
+			regs->ip = (unsigned long) bp_int3_handler_irqoff;
+		}
+	}
 
 	return 1;
 }
@@ -792,7 +811,7 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 {
 	unsigned char int3 = 0xcc;
 
-	bp_int3_handler = handler;
+	bp_int3_handler_irqon = handler;
 	bp_int3_addr = (u8 *)addr + sizeof(int3);
 	bp_patching_in_progress = true;
 
@@ -830,7 +849,36 @@ void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler)
 	 * the writing of the new instruction.
 	 */
 	bp_patching_in_progress = false;
+	bp_int3_handler_irqoff = NULL;
 
 	return addr;
 }
 
+extern asmlinkage void emulate_call_irqon(void);
+extern asmlinkage void emulate_call_irqoff(void);
+
+asm(
+	".text\n"
+	".global emulate_call_irqoff\n"
+	".type emulate_call_irqoff, @function\n"
+	"emulate_call_irqoff:\n\t"
+		"push %gs:bp_call_return\n\t"
+		"sti\n\t"
+		"jmp *%gs:bp_call_target\n"
+	".size emulate_call_irqoff, .-emulate_call_irqoff\n"
+
+	".global emulate_call_irqon\n"
+	".type emulate_call_irqon, @function\n"
+	"emulate_call_irqon:\n\t"
+		"push %gs:bp_call_return\n\t"
+		"jmp *%gs:bp_call_target\n"
+	".size emulate_call_irqon, .-emulate_call_irqon\n"
+	".previous\n");
+
+void replace_call(void *addr, const void *opcode, size_t len, void *target)
+{
+	bp_int3_call_target = target;
+	bp_int3_call_return = addr + len;
+	bp_int3_handler_irqoff = emulate_call_irqoff;
+	text_poke_bp(addr, opcode, len, emulate_call_irqon);
+}
Linus Torvalds April 29, 2019, 6:22 p.m. UTC | #4
On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> It does *not* emulate the "call" in the BP handler itself, instead if
> replace the %ip (the same way all the other BP handlers replace the
> %ip) with a code sequence that just does
>
>         push %gs:bp_call_return
>         jmp *%gs:bp_call_target
>
> after having filled in those per-cpu things.

Note that if you read the patch, you'll see that my explanation
glossed over the "what if an interrupt happens" part. Which is handled
by having two handlers, one for "interrupts were already disabled" and
one for "interrupts were enabled, so I disabled them before entering
the handler".

The second handler does the same push/jmp sequence, but has a "sti"
before the jmp. Because of the one-instruction sti shadow, interrupts
won't actually be enabled until after the jmp instruction has
completed, and thus the "push/jmp" is atomic wrt regular interrupts.

It's not safe wrt NMI, of course, but since NMI won't be rescheduling,
and since any SMP IPI won't be punching through that sequence anyway,
it's still atomic wrt _another_ text_poke() attempt coming in and
re-using the bp_call_return/tyarget slots.

So yeah, it's not "one-liner" trivial, but it's not like it's
complicated either, and it actually matches the existing "call this
code to emulate the replaced instruction". So I'd much rather have a
couple of tens of lines of code here that still acts pretty much
exactly like all the other rewriting does, rather than play subtle
games with the entry stack frame.

Finally: there might be other situations where you want to have this
kind of "pseudo-atomic" replacement sequence, so I think while it's a
hack specific to emulating a "call" instruction, I don't think it is
conceptually limited to just that case.

                 Linus
Andy Lutomirski April 29, 2019, 6:42 p.m. UTC | #5
On Mon, Apr 29, 2019 at 11:29 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, Apr 29, 2019 at 11:06 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> >
> > It does *not* emulate the "call" in the BP handler itself, instead if
> > replace the %ip (the same way all the other BP handlers replace the
> > %ip) with a code sequence that just does
> >
> >         push %gs:bp_call_return
> >         jmp *%gs:bp_call_target
> >
> > after having filled in those per-cpu things.
>
> Note that if you read the patch, you'll see that my explanation
> glossed over the "what if an interrupt happens" part. Which is handled
> by having two handlers, one for "interrupts were already disabled" and
> one for "interrupts were enabled, so I disabled them before entering
> the handler".

This is quite a cute solution.

>
> The second handler does the same push/jmp sequence, but has a "sti"
> before the jmp. Because of the one-instruction sti shadow, interrupts
> won't actually be enabled until after the jmp instruction has
> completed, and thus the "push/jmp" is atomic wrt regular interrupts.
>
> It's not safe wrt NMI, of course, but since NMI won't be rescheduling,
> and since any SMP IPI won't be punching through that sequence anyway,
> it's still atomic wrt _another_ text_poke() attempt coming in and
> re-using the bp_call_return/tyarget slots.

I'm less than 100% convinced about this argument.  Sure, an NMI right
there won't cause a problem.  But an NMI followed by an interrupt will
kill us if preemption is on.  I can think of three solutions:

1. Assume that all CPUs (and all relevant hypervisors!) either mask
NMIs in the STI shadow or return from NMIs with interrupts masked for
one instruction.  Ditto for MCE.  This seems too optimistic.

2. Put a fixup in the NMI handler and MCE handler: if the return
address is one of these magic jumps, clear IF and back up IP by one
byte.  This should *work*, but it's a bit ugly.

3. Use a different magic sequence:

push %gs:bp_call_return
int3

and have the int3 handler adjust regs->ip and regs->flags as appropriate.

I think I like #3 the best, even though it's twice as slow.

FWIW, kernel shadow stack patches will show up eventually, and none of
these approaches are compatible as is.  Perhaps the actual sequence
should be this, instead:

bp_int3_fixup_irqsoff:
  call 1f
1:
  int3

bp_int3_fixup_irqson:
  call 1f
1:
  int3

and the int3 handler will update the conventional return address *and*
the shadow return address.  Linus, what do you think about this
variant?

Finally, with my maintainer hat on: if anyone actually wants to merge
this thing, I want to see a test case, exercised regularly (every boot
in configured, perhaps) that actually *runs* this code.  Hitting it in
practice will be rare, and I want bugs to be caught.
Steven Rostedt April 29, 2019, 6:52 p.m. UTC | #6
On Mon, 29 Apr 2019 11:06:58 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> +void replace_call(void *addr, const void *opcode, size_t len, void *target)
> +{
> +	bp_int3_call_target = target;
> +	bp_int3_call_return = addr + len;
> +	bp_int3_handler_irqoff = emulate_call_irqoff;
> +	text_poke_bp(addr, opcode, len, emulate_call_irqon);
> +}

Note, the function tracer does not use text poke. It does it in batch
mode. It can update over 40,000 calls in one go:

	add int3 breakpoint to all 40,000 call sites.
	sync()
	change the last four bytes of each of those call sites
	sync()
	remove int3 from the 40,000 call site with new call.

It's a bit more intrusive than the static call updates we were
discussing before.

-- Steve
Andy Lutomirski April 29, 2019, 6:56 p.m. UTC | #7
On Mon, Apr 29, 2019 at 11:53 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
>
> On Mon, Apr 29, 2019, 11:42 Andy Lutomirski <luto@kernel.org> wrote:
>>
>>
>> I'm less than 100% convinced about this argument.  Sure, an NMI right
>> there won't cause a problem.  But an NMI followed by an interrupt will
>> kill us if preemption is on.  I can think of three solutions:
>
>
> No, because either the sti shadow disables nmi too (that's the case on some CPUs at least) or the iret from nmi does.
>
> Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture.
>

Is this documented somewhere?  And do you actually believe that this
is true under KVM, Hyper-V, etc?  As I recall, Andrew Cooper dug in to
the way that VMX dealt with this stuff and concluded that the SDM was
blatantly wrong in many cases, which leads me to believe that Xen
HVM/PVH is the *only* hypervisor that gets it right.

Steven's point about batched updates is quite valid, though.  My
personal favorite solution to this whole mess is to rework the whole
thing so that the int3 handler simply returns and retries and to
replace the sync_core() broadcast with an SMI broadcast.  I don't know
whether this will actually work on real CPUs and on VMs and whether
it's going to crash various BIOSes out there.
Steven Rostedt April 29, 2019, 7:07 p.m. UTC | #8
On Mon, 29 Apr 2019 11:59:04 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I really don't care. Just do what I suggested, and if you have numbers to
> show problems, then maybe I'll care.
> 

Are you suggesting that I rewrite the code to do it one function at a
time? This has always been batch mode. This is not something new. The
function tracer has been around longer than the text poke code.

> Right now you're just making excuses for this. I described the solution
> months ago, now I've written a patch, if that's not good enough then we can
> just skip this all entirely.
> 
> Honestly, if you need to rewrite tens of thousands of calls, maybe you're
> doing something wrong?
> 

 # cd /sys/kernel/debug/tracing
 # cat available_filter_functions | wc -l
 45856
 # cat enabled_functions | wc -l
 0
 # echo function > current_tracer
 # cat enabled_functions | wc -l
 45856

There, I just enabled 45,856 function call sites in one shot!

How else do you want to update them? Every function in the kernel has a
nop, that turns into a call to the ftrace_handler, if I add another
user of that code, it will change each one as well.

-- Steve
Andy Lutomirski April 29, 2019, 7:24 p.m. UTC | #9
On Mon, Apr 29, 2019 at 12:13 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
>
> On Mon, Apr 29, 2019, 12:02 Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>>
>>
>> If nmi were to break it, it would be a cpu bug.
>
>
> Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
>

Where?  STI; IRET would be nuts.

Before:

commit 4214a16b02971c60960afd675d03544e109e0d75
Author: Andy Lutomirski <luto@kernel.org>
Date:   Thu Apr 2 17:12:12 2015 -0700

    x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER

we did sti; sysxit, but, when we discussed this, I don't recall anyone
speaking up in favor of the safely of the old code.

Not to mention that the crash we'll get if we get an NMI and a
rescheduling interrupt in this path will be very, very hard to debug.
Linus Torvalds April 29, 2019, 8:06 p.m. UTC | #10
On Mon, Apr 29, 2019 at 12:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Are you suggesting that I rewrite the code to do it one function at a
> time? This has always been batch mode. This is not something new. The
> function tracer has been around longer than the text poke code.

Only do the 'call' instructions one at a time. Why would you change
_existing_ code?

                 Linus
Linus Torvalds April 29, 2019, 8:07 p.m. UTC | #11
On Mon, Apr 29, 2019 at 12:24 PM Andy Lutomirski <luto@kernel.org> wrote:
> > Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
>
> Where?  STI; IRET would be nuts.

Sorry, not 'sti;iret' but 'sti;sysexit'

> before commit 4214a16b02971c60960afd675d03544e109e0d75
>     x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
>
> we did sti; sysxit, but, when we discussed this, I don't recall anyone
> speaking up in favor of the safely of the old code.

We still have that sti sysexit in the 32-bit code.

                     Linus
Linus Torvalds April 29, 2019, 8:16 p.m. UTC | #12
On Mon, Apr 29, 2019 at 12:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If nmi were to break it, it would be a cpu bug. I'm pretty sure I've
> seen the "shadow stops even nmi" documented for some uarch, but as
> mentioned it's not necessarily the only way to guarantee the shadow.

In fact, the documentation is simply the official Intel instruction
docs for "STI":

    The IF flag and the STI and CLI instructions do not prohibit the
    generation of exceptions and NMI interrupts. NMI interrupts (and
    SMIs) may be blocked for one macroinstruction following an STI.

note the "may be blocked". As mentioned, that's just one option for
not having NMI break the STI shadow guarantee, but it's clearly one
that Intel has done at times, and clearly even documents as having
done so.

There is absolutely no question that the sti shadow is real, and that
people have depended on it for _decades_. It would be a horrible
errata if the shadow can just be made to go away by randomly getting
an NMI or SMI.

              Linus
Linus Torvalds April 29, 2019, 8:20 p.m. UTC | #13
On Mon, Apr 29, 2019 at 1:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Only do the 'call' instructions one at a time. Why would you change
> _existing_ code?

Side note: if you want to, you can easily batch up rewriting 'call'
instructions to the same target using the exact same code. You just
need to change the int3 handler case to calculate the
bp_int3_call_return from the fixed one-time address to use sopmething
like

     this_cpu_write(bp_call_return, int3_address-1+bp_int3_call_size);

instead (and you'd need to also teach the function that there's not
just a single int3 live at a time)

                Linus
Steven Rostedt April 29, 2019, 8:30 p.m. UTC | #14
On Mon, 29 Apr 2019 13:06:17 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Apr 29, 2019 at 12:07 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Are you suggesting that I rewrite the code to do it one function at a
> > time? This has always been batch mode. This is not something new. The
> > function tracer has been around longer than the text poke code.  
> 
> Only do the 'call' instructions one at a time. Why would you change
> _existing_ code?

The function tracing is a call instruction.

On boot:

<function_X>:
	nop
	blah
	blah

After a callback to function tracing is called:

<function_X>
	call custom_trampoline
	blah
	blah


If we have two functions to that function added:

<function_X>
	call iterator_trampoline
	blah
	blah

The update from "call custom_trampoline" to "call iterator_trampoline"
is where we have an issue.

We could make this a special case where we do this one at a time, but
currently the code is all the same looking at tables to determine to
what to do. Which is one of three:

 1) change nop to call function
 2) change call function to nop
 3) update call function to another call function

#3 is where we have an issue. But if we want this to be different, we
would need to change the code significantly, and know that we are only
updating calls to calls. Which would take a bit of accounting to see if
that's the change that is being made.

This thread started about that #3 operation causing a call to be missed
because we turn it into a nop while we make the transition, where in
reality it needs to be a call to one of the two functions in the
transition.

-- Steve
Linus Torvalds April 29, 2019, 9:38 p.m. UTC | #15
On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The update from "call custom_trampoline" to "call iterator_trampoline"
> is where we have an issue.

So it has never worked. Just tell people that they have two chocies:

 - you do the careful rewriting, which takes more time

 - you do it by rewriting as nop and then back, which is what
historically has been done, and that is fast and simple, because
there's no need to be careful.

Really. I find your complaints completely incomprehensible. You've
never rewritten call instructions atomically before, and now you
complain about it being slightly more expensive to do it when I give
you the code? Yes it damn well will be slightly more expensive. Deal
with it.

Btw, once again - I several months ago also gave a suggestion on how
it could be done batch-mode by having lots of those small stubs and
just generating them dynamically.

You never wrote that code *EITHER*. It's been *months*.

So now I've written the non-batch-mode code for you, and you just
*complain* about it?

I'm done with this discussion. I'm totally fed up.

                 Linus
Linus Torvalds April 29, 2019, 10:06 p.m. UTC | #16
On Mon, Apr 29, 2019 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture.
>
> Is this documented somewhere?

Btw, if you really don't trust the sti shadow despite it going all the
way back to the 8086, then you could instead make the irqoff code do

        push %gs:bp_call_return
        push %gs:bp_call_target
        sti
        ret

which just keeps interrupts explicitly disabled over the whole use of
the percpu data.

The actual "ret" instruction doesn't matter, it's not going to change
in this model (where the code isn't dynamically generated or changed).
So I claim that it will still be protected by the sti shadow, but when
written that way it doesn't actually matter, and you could reschedule
immediately after the sti (getting an interrupt there might make the
stack frame look odd, but it doesn't really affect anything else)

                  Linus
Steven Rostedt April 29, 2019, 10:07 p.m. UTC | #17
On Mon, 29 Apr 2019 14:38:35 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > The update from "call custom_trampoline" to "call iterator_trampoline"
> > is where we have an issue.  
> 
> So it has never worked. Just tell people that they have two chocies:

The custom call to iterator translation never worked. One option is to
always call the iterator, and just take the hit. There's another
solution to to make permanent updates that would handle the live
patching case, but not for the tracing case. It is more critical for
live patching than tracing (missed traced function is not as critical
as running buggy code unexpectedly). I could look to work on that
instead.

> 
>  - you do the careful rewriting, which takes more time

Which would be the method I said making the call to call a special case.

> 
>  - you do it by rewriting as nop and then back, which is what
> historically has been done, and that is fast and simple, because
> there's no need to be careful.

Except in the live kernel patching case where you just put back the
buggy function.

> 
> Really. I find your complaints completely incomprehensible. You've
> never rewritten call instructions atomically before, and now you
> complain about it being slightly more expensive to do it when I give
> you the code? Yes it damn well will be slightly more expensive. Deal
> with it.

I wasn't complaining about the expense of doing it that way. I was
stating that it would take more time to get it right as it as it would
require a different implementation for the call to call case.


> 
> Btw, once again - I several months ago also gave a suggestion on how
> it could be done batch-mode by having lots of those small stubs and
> just generating them dynamically.
> 
> You never wrote that code *EITHER*. It's been *months*.

Note, I wasn't the one writing the code back then either. I posted a
proof of concept for a similar topic (trace events) for the purpose of
bringing back the performance lost due to the retpolines (something
like 8% hit). Josh took the initiative to do that work, but I don't
remember ever getting to a consensus on a solution to that. Yes you had
given us ideas, but I remember people (like Andy) having concerns with
it. But because it was just an optimization change, and Josh had other
things to work on, that work just stalled.

But I just found out that the function tracing code has the same issue,
but this can cause real problems with live kernel patching. This is why
this patch set started.

> 
> So now I've written the non-batch-mode code for you, and you just
> *complain* about it?

Because this is a different story. The trace event code is updated one
at a time (there's patches out there to do it in batch). But this is
not trace events. This is function tracing which only does batching.


> 
> I'm done with this discussion. I'm totally fed up.
> 

Note, I wasn't writing this code anyway as I didn't have the time to. I
was helping others who took the initiative to do this work. But I guess
they didn't have time to move it forward either.

For the live kernel patching case, I'll start working on the permanent
changes, where once a function entry is changed, it can never be put
back. Then we wont have an issue with the live kernel patching case,
but only for tracing. But the worse thing there is you miss tracing
functions while an update is being made.

If Nicolai has time, perhaps he can try out the method you suggest and
see if we can move this forward.

-- Steve
Sean Christopherson April 29, 2019, 10:08 p.m. UTC | #18
On Mon, Apr 29, 2019 at 01:16:10PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 12:02 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > If nmi were to break it, it would be a cpu bug. I'm pretty sure I've
> > seen the "shadow stops even nmi" documented for some uarch, but as
> > mentioned it's not necessarily the only way to guarantee the shadow.
> 
> In fact, the documentation is simply the official Intel instruction
> docs for "STI":
> 
>     The IF flag and the STI and CLI instructions do not prohibit the
>     generation of exceptions and NMI interrupts. NMI interrupts (and
>     SMIs) may be blocked for one macroinstruction following an STI.
> 
> note the "may be blocked". As mentioned, that's just one option for
> not having NMI break the STI shadow guarantee, but it's clearly one
> that Intel has done at times, and clearly even documents as having
> done so.
> 
> There is absolutely no question that the sti shadow is real, and that
> people have depended on it for _decades_. It would be a horrible
> errata if the shadow can just be made to go away by randomly getting
> an NMI or SMI.

FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm
not sure that counters the "horrible errata" statement ;-).  SMI+RSM saves
and restores STI blocking in that case, but AFAICT NMI has no such
protection and will effectively break the shadow on its IRET.

All other (modern) Intel uArchs block NMI in the shadow and also enforce
STI_BLOCKING==0 when injecting an NMI via VM-Enter, i.e. prevent a VMM
from breaking the shadow so long as the VMM preserves the shadow info.

KVM is generally ok with respect to STI blocking, but ancient versions
didn't migrate STI blocking and there's currently a hole where
single-stepping a guest (from host userspace) could drop STI_BLOCKING
if a different VM-Exit occurs between the single-step #DB VM-Exit and the
instruction in the shadow.  Though "don't do that" may be a reasonable
answer in that case.
Linus Torvalds April 29, 2019, 10:22 p.m. UTC | #19
On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm
> not sure that counters the "horrible errata" statement ;-).  SMI+RSM saves
> and restores STI blocking in that case, but AFAICT NMI has no such
> protection and will effectively break the shadow on its IRET.

Ugh. I can't say I care deeply about Quark (ie never seemed to go
anywhere), but it's odd. I thought it was based on a Pentium core (or
i486+?). Are you saying those didn't do it either?

I have this dim memory about talking about this with some (AMD?)
engineer, and having an alternative approach for the sti shadow wrt
NMI - basically not checking interrupts in the instruction you return
to with 'iret'. I don't think it was even conditional on the "iret
from NMI", I think it was basically any iret also did the sti shadow
thing.

But I can find no actual paper to back that up, so this may be me just
making sh*t up.

> KVM is generally ok with respect to STI blocking, but ancient versions
> didn't migrate STI blocking and there's currently a hole where
> single-stepping a guest (from host userspace) could drop STI_BLOCKING
> if a different VM-Exit occurs between the single-step #DB VM-Exit and the
> instruction in the shadow.  Though "don't do that" may be a reasonable
> answer in that case.

I thought the sti shadow blocked the single-step exception too? I know
"mov->ss" does block debug interrupts too.

Or are you saying that it's some "single step by emulation" that just
miss setting the STI_BLOCKING flag?

                   Linus
Sean Christopherson April 30, 2019, 12:08 a.m. UTC | #20
On Mon, Apr 29, 2019 at 03:22:09PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm
> > not sure that counters the "horrible errata" statement ;-).  SMI+RSM saves
> > and restores STI blocking in that case, but AFAICT NMI has no such
> > protection and will effectively break the shadow on its IRET.
> 
> Ugh. I can't say I care deeply about Quark (ie never seemed to go
> anywhere), but it's odd. I thought it was based on a Pentium core (or
> i486+?). Are you saying those didn't do it either?

It's 486 based, but either way I suspect the answer is "yes".  IIRC,
Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
was based on P54C, though I'm struggling to recall exactly what the
Larrabee weirdness was.

> I have this dim memory about talking about this with some (AMD?)
> engineer, and having an alternative approach for the sti shadow wrt
> NMI - basically not checking interrupts in the instruction you return
> to with 'iret'. I don't think it was even conditional on the "iret
> from NMI", I think it was basically any iret also did the sti shadow
> thing.
> 
> But I can find no actual paper to back that up, so this may be me just
> making sh*t up.

If Intel CPUs ever did anything like that on IRET it's long gone.

> > KVM is generally ok with respect to STI blocking, but ancient versions
> > didn't migrate STI blocking and there's currently a hole where
> > single-stepping a guest (from host userspace) could drop STI_BLOCKING
> > if a different VM-Exit occurs between the single-step #DB VM-Exit and the
> > instruction in the shadow.  Though "don't do that" may be a reasonable
> > answer in that case.
> 
> I thought the sti shadow blocked the single-step exception too? I know
> "mov->ss" does block debug interrupts too.

{MOV,POP}SS blocks #DBs, STI does not.

> Or are you saying that it's some "single step by emulation" that just
> miss setting the STI_BLOCKING flag?

This is the case I was talking about for KVM.  KVM supports single-stepping
the guest from userpace and uses EFLAGS.TF to do so (since it works on both
Intel and AMD).  VMX has a consistency check that fails VM-Entry if
STI_BLOCKING=1, EFLAGS.TF==1, IA32_DEBUGCTL.BTF=0 and there isn't a pending
single-step #DB, and so KVM clears STI_BLOCKING immediately before entering
the guest when single-stepping the guest.  If a VM-Exit occurs immediately
after VM-Entry, e.g. due to hardware interrupt, then KVM will see
STI_BLOCKING=0 when processing guest events in its run loop and will inject
any pending interrupts.

I *think* the KVM behavior can be fixed, e.g. I'm not entirely sure why KVM
takes this approach instead of setting PENDING_DBG.BS, but that's probably
a moot point.
Sean Christopherson April 30, 2019, 12:45 a.m. UTC | #21
On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote:
> On Mon, Apr 29, 2019 at 03:22:09PM -0700, Linus Torvalds wrote:
> > On Mon, Apr 29, 2019 at 3:08 PM Sean Christopherson
> > <sean.j.christopherson@intel.com> wrote:
> > >
> > > FWIW, Lakemont (Quark) doesn't block NMI/SMI in the STI shadow, but I'm
> > > not sure that counters the "horrible errata" statement ;-).  SMI+RSM saves
> > > and restores STI blocking in that case, but AFAICT NMI has no such
> > > protection and will effectively break the shadow on its IRET.
> > 
> > Ugh. I can't say I care deeply about Quark (ie never seemed to go
> > anywhere), but it's odd. I thought it was based on a Pentium core (or
> > i486+?). Are you saying those didn't do it either?
> 
> It's 486 based, but either way I suspect the answer is "yes".  IIRC,
> Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
> was based on P54C, though I'm struggling to recall exactly what the
> Larrabee weirdness was.

Aha!  Found an ancient comment that explicitly states P5 does not block
NMI/SMI in the STI shadow, while P6 does block NMI/SMI.
Linus Torvalds April 30, 2019, 2:26 a.m. UTC | #22
On Mon, Apr 29, 2019 at 5:45 PM Sean Christopherson
<sean.j.christopherson@intel.com> wrote:
>
> On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote:
> >
> > It's 486 based, but either way I suspect the answer is "yes".  IIRC,
> > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
> > was based on P54C, though I'm struggling to recall exactly what the
> > Larrabee weirdness was.
>
> Aha!  Found an ancient comment that explicitly states P5 does not block
> NMI/SMI in the STI shadow, while P6 does block NMI/SMI.

Ok, so the STI shadow really wouldn't be reliable on those machines. Scary.

Of course, the good news is that hopefully nobody has them any more,
and if they do, they presumably don't use fancy NMI profiling etc, so
any actual NMI's are probably relegated purely to largely rare and
effectively fatal errors anyway (ie memory parity errors).

                     Linus
Nicolai Stange April 30, 2019, 9:24 a.m. UTC | #23
Steven Rostedt <rostedt@goodmis.org> writes:

> On Mon, 29 Apr 2019 14:38:35 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Mon, Apr 29, 2019 at 1:30 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>> >
>> > The update from "call custom_trampoline" to "call iterator_trampoline"
>> > is where we have an issue.  
>> 
>> So it has never worked. Just tell people that they have two chocies:
>
> The custom call to iterator translation never worked. One option is to
> always call the iterator, and just take the hit. There's another
> solution to to make permanent updates that would handle the live
> patching case, but not for the tracing case. It is more critical for
> live patching than tracing (missed traced function is not as critical
> as running buggy code unexpectedly). I could look to work on that
> instead.

Making the live patching case permanent would disable tracing of live
patched functions though?

For unconditionally calling the iterator: I don't have numbers, but
would expect that always searching through something like 50-70 live
patching ftrace_ops from some hot path will be noticeable.


> If Nicolai has time, perhaps he can try out the method you suggest and
> see if we can move this forward.

You mean making ftrace handle call->call transitions one by one in
non-batch mode, right? Sure, I can do that.

Another question though: is there anything that prevents us from calling
ftrace_ops_list_func() directly from ftrace_int3_handler()? We don't
have parent_ip, but if that is used for reporting only, maybe setting it
to some ftrace_is_in_int3_and_doesnt_now_the_parent() will work?
Graph tracing entries would still be lost, but well...

Thanks,

Nicolai
Peter Zijlstra April 30, 2019, 10:40 a.m. UTC | #24
On Mon, Apr 29, 2019 at 07:26:02PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 5:45 PM Sean Christopherson
> <sean.j.christopherson@intel.com> wrote:
> >
> > On Mon, Apr 29, 2019 at 05:08:46PM -0700, Sean Christopherson wrote:
> > >
> > > It's 486 based, but either way I suspect the answer is "yes".  IIRC,
> > > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
> > > was based on P54C, though I'm struggling to recall exactly what the
> > > Larrabee weirdness was.
> >
> > Aha!  Found an ancient comment that explicitly states P5 does not block
> > NMI/SMI in the STI shadow, while P6 does block NMI/SMI.
> 
> Ok, so the STI shadow really wouldn't be reliable on those machines. Scary.
> 
> Of course, the good news is that hopefully nobody has them any more,
> and if they do, they presumably don't use fancy NMI profiling etc, so
> any actual NMI's are probably relegated purely to largely rare and
> effectively fatal errors anyway (ie memory parity errors).

We do have KNC perf support, if that chip has 'issues'...

Outside of that, we only do perf from P6 onwards. With P4 support being
in dubious shape, because it's massively weird and 'nobody' still has
those machines.
Peter Zijlstra April 30, 2019, 10:46 a.m. UTC | #25
On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
> On Mon, 29 Apr 2019 11:06:58 -0700
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > +void replace_call(void *addr, const void *opcode, size_t len, void *target)
> > +{
> > +	bp_int3_call_target = target;
> > +	bp_int3_call_return = addr + len;
> > +	bp_int3_handler_irqoff = emulate_call_irqoff;
> > +	text_poke_bp(addr, opcode, len, emulate_call_irqon);
> > +}
> 
> Note, the function tracer does not use text poke. It does it in batch
> mode. It can update over 40,000 calls in one go:

Note that Daniel is adding batch stuff to text_poke(), because
jump_label/static_key stuffs also end up wanting to change more than one
site at a time and IPI spraying the machine for every single instance is
hurting.

So ideally ftrace would start to use the 'normal' code when that
happens.
Jiri Kosina April 30, 2019, 11:17 a.m. UTC | #26
On Mon, 29 Apr 2019, Linus Torvalds wrote:

> > > It's 486 based, but either way I suspect the answer is "yes".  IIRC,
> > > Knights Corner, a.k.a. Larrabee, also had funkiness around SMM and that
> > > was based on P54C, though I'm struggling to recall exactly what the
> > > Larrabee weirdness was.
> >
> > Aha!  Found an ancient comment that explicitly states P5 does not block
> > NMI/SMI in the STI shadow, while P6 does block NMI/SMI.
> 
> Ok, so the STI shadow really wouldn't be reliable on those machines. Scary.
> 
> Of course, the good news is that hopefully nobody has them any more, and 
> if they do, they presumably don't use fancy NMI profiling etc, so any 
> actual NMI's are probably relegated purely to largely rare and 
> effectively fatal errors anyway (ie memory parity errors).

FWIW, if that thing has local apic (I have no idea, I've never seen Quark 
myself), then NMIs are used to trigger all-cpu backtrace as well. Which 
still can be done in situations where the kernel is then expected to 
continue running undisrupted (softlockup, sysrq, hung task detector, ...).

Nothing to really worry about in the particular case of this HW perhaps, 
sure.
Peter Zijlstra April 30, 2019, 11:18 a.m. UTC | #27
On Mon, Apr 29, 2019 at 03:06:30PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 11:57 AM Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > Otherwise you could never trust the whole sti shadow thing - and it very much is part of the architecture.
> >
> > Is this documented somewhere?
> 
> Btw, if you really don't trust the sti shadow despite it going all the
> way back to the 8086, then you could instead make the irqoff code do
> 
>         push %gs:bp_call_return
>         push %gs:bp_call_target
>         sti
>         ret

This variant cures the RETPOLINE complaint; due to there not actually
being an indirect jump anymore. And it cures the sibling call complaint,
but trades it for "return with modified stack frame".

Something like so is clean:

+extern asmlinkage void emulate_call_irqon(void);
+extern asmlinkage void emulate_call_irqoff(void);
+
+asm(
+	".text\n"
+	".global emulate_call_irqoff\n"
+	".type emulate_call_irqoff, @function\n"
+	"emulate_call_irqoff:\n\t"
+		"push %gs:bp_call_return\n\t"
+		"push %gs:bp_call_target\n\t"
+		"sti\n\t"
+		"ret\n"
+	".size emulate_call_irqoff, .-emulate_call_irqoff\n"
+
+	".global emulate_call_irqon\n"
+	".type emulate_call_irqon, @function\n"
+	"emulate_call_irqon:\n\t"
+		"push %gs:bp_call_return\n\t"
+		"push %gs:bp_call_target\n\t"
+		"ret\n"
+	".size emulate_call_irqon, .-emulate_call_irqon\n"
+	".previous\n");
+
+STACK_FRAME_NON_STANDARD(emulate_call_irqoff);
+STACK_FRAME_NON_STANDARD(emulate_call_irqon);
Steven Rostedt April 30, 2019, 1:44 p.m. UTC | #28
On Tue, 30 Apr 2019 12:46:48 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
> > On Mon, 29 Apr 2019 11:06:58 -0700
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >   
> > > +void replace_call(void *addr, const void *opcode, size_t len, void *target)
> > > +{
> > > +	bp_int3_call_target = target;
> > > +	bp_int3_call_return = addr + len;
> > > +	bp_int3_handler_irqoff = emulate_call_irqoff;
> > > +	text_poke_bp(addr, opcode, len, emulate_call_irqon);
> > > +}  
> > 
> > Note, the function tracer does not use text poke. It does it in batch
> > mode. It can update over 40,000 calls in one go:  
> 
> Note that Daniel is adding batch stuff to text_poke(), because
> jump_label/static_key stuffs also end up wanting to change more than one
> site at a time and IPI spraying the machine for every single instance is
> hurting.
> 
> So ideally ftrace would start to use the 'normal' code when that
> happens.

Sure, but that's a completely different issue. We would need to solve
the 'emulate' call for batch mode first.

-- Steve
Peter Zijlstra April 30, 2019, 1:56 p.m. UTC | #29
On Mon, Apr 29, 2019 at 01:07:33PM -0700, Linus Torvalds wrote:
> On Mon, Apr 29, 2019 at 12:24 PM Andy Lutomirski <luto@kernel.org> wrote:
> > > Side note: we *already* depend on sti shadow working in other parts of the kernel, namely sti->iret.
> >
> > Where?  STI; IRET would be nuts.
> 
> Sorry, not 'sti;iret' but 'sti;sysexit'
> 
> > before commit 4214a16b02971c60960afd675d03544e109e0d75
> >     x86/asm/entry/64/compat: Use SYSRETL to return from compat mode SYSENTER
> >
> > we did sti; sysxit, but, when we discussed this, I don't recall anyone
> > speaking up in favor of the safely of the old code.
> 
> We still have that sti sysexit in the 32-bit code.

We also have both: "STI; HLT" and "STI; MWAIT" where we rely on the STI
shadow. Getting an NMI in between shouldn't hurt too much, but if that
in turn can lead to an actual interrupt happening, we're up some creek
without no paddle.

Most moden systems don't use either anymore though. As
mwait_idle_with_hints() relies on MWAIT ECX[0]=1 to allow MWAIT to wake
from pending interrupts.
Peter Zijlstra April 30, 2019, 2:20 p.m. UTC | #30
On Tue, Apr 30, 2019 at 09:44:45AM -0400, Steven Rostedt wrote:
> On Tue, 30 Apr 2019 12:46:48 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Mon, Apr 29, 2019 at 02:52:50PM -0400, Steven Rostedt wrote:
> > > On Mon, 29 Apr 2019 11:06:58 -0700
> > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >   
> > > > +void replace_call(void *addr, const void *opcode, size_t len, void *target)
> > > > +{
> > > > +	bp_int3_call_target = target;
> > > > +	bp_int3_call_return = addr + len;
> > > > +	bp_int3_handler_irqoff = emulate_call_irqoff;
> > > > +	text_poke_bp(addr, opcode, len, emulate_call_irqon);
> > > > +}  
> > > 
> > > Note, the function tracer does not use text poke. It does it in batch
> > > mode. It can update over 40,000 calls in one go:  
> > 
> > Note that Daniel is adding batch stuff to text_poke(), because
> > jump_label/static_key stuffs also end up wanting to change more than one
> > site at a time and IPI spraying the machine for every single instance is
> > hurting.
> > 
> > So ideally ftrace would start to use the 'normal' code when that
> > happens.
> 
> Sure, but that's a completely different issue. We would need to solve
> the 'emulate' call for batch mode first.

I don't see a problem there; when INT3 happens; you bsearch() the batch
array to find the one you hit, you frob that into the percpu variables
and do the thing. Or am I loosing the plot again?
Steven Rostedt April 30, 2019, 2:36 p.m. UTC | #31
On Tue, 30 Apr 2019 16:20:47 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > 
> > Sure, but that's a completely different issue. We would need to solve
> > the 'emulate' call for batch mode first.  
> 
> I don't see a problem there; when INT3 happens; you bsearch() the batch
> array to find the one you hit, you frob that into the percpu variables
> and do the thing. Or am I loosing the plot again?

I may need to tweak Linus's patch slightly, but I may be able to get it
to work with the batch mode.

-- Steve
Linus Torvalds April 30, 2019, 4:06 p.m. UTC | #32
On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Apr 29, 2019 at 01:07:33PM -0700, Linus Torvalds wrote:
> >
> > We still have that sti sysexit in the 32-bit code.
>
> We also have both: "STI; HLT" and "STI; MWAIT" where we rely on the STI
> shadow.

I guess the good news is that in all cases we really only ever protect
against a very unlikely race, and if the race happens it's not
actually fatal.

Yes, if we get an NMI and then an interrupt in between the "st;hlt" we
might wait for the next interrupt and get a (potentially fairly
horrible) latency issue. I guess that with maximal luck it might be a
one-shot timer and not get re-armed, but it sounds very very very
unlikely.

Googling around, I actually find a patch from Avi Kivity from back in
2010 for this exact issue, apparently because kvm got this case wrong
and somebody hit it. The patch never made it upstream exactly because
kvm could be fixed and people decided that most real hardware didn't
have the issue in the first place.

In the discussion I found, Peter Anvin tried to get confirmation from
AMD engineers about this too, but I don't see any resolution.

Realistically, I don't think you can hit the problem in practice. The
only way to hit that incredibly small race of "one instruction, *both*
NMI and interrupts" is to have a lot of interrupts going all at the
same time, but that will also then solve the latency problem, so the
very act of triggering it will also fix it.

I don't see any case where it's really bad. The "sti sysexit" race is
similar, just about latency of user space signal reporting (and
perhaps any pending TIF_WORK_xyz flags).

So maybe we don't care deeply about the sti shadow. It's a potential
latecy problem when broken, but not a huge issue. And for the
instruction rewriting hack, moving to "push+sti+ret" also makes a lost
sti shadow just a "possibly odd stack frame visibility" issue rather
than anything deeply fatal.

We can probably just write it off as "some old CPU's (and a smattering
or very rare and not relevant new ones) have potential but unlikely
latency issues because of a historical CPU mis-design - don't do perf
on them".

                Linus
Andy Lutomirski April 30, 2019, 4:33 p.m. UTC | #33
On Tue, Apr 30, 2019 at 9:06 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
>
> Realistically, I don't think you can hit the problem in practice. The
> only way to hit that incredibly small race of "one instruction, *both*
> NMI and interrupts" is to have a lot of interrupts going all at the
> same time, but that will also then solve the latency problem, so the
> very act of triggering it will also fix it.
>
> I don't see any case where it's really bad. The "sti sysexit" race is
> similar, just about latency of user space signal reporting (and
> perhaps any pending TIF_WORK_xyz flags).

In the worst case, it actually kills the machine.  Last time I tracked
a bug like this down, I think the issue was that we got preempted
after the last TIF_ check, entered a VM, exited, context switched
back, and switched to user mode without noticing that there was a
ending KVM user return notifier.  This left us with bogus CPU state
and the machine exploded.

Linus, can I ask you to reconsider your opposition to Josh's other
approach of just shifting the stack on int3 entry?  I agree that it's
ugly, but the ugliness is easily manageable and fairly self-contained.
We add a little bit of complication to the entry asm (but it's not
like it's unprecedented -- the entry asm does all kinds of stack
rearrangement due to IST and PTI crap already), and we add an
int3_emulate_call(struct pt_regs *regs, unsigned long target) helper
that has appropriate assertions that the stack is okay and emulates
the call.  And that's it.

In contrast, your approach involves multiple asm trampolines, hash
tables, batching complications, and sti shadows.

As an additional argument, with the stack-shifting approach, it runs
on *every int3 from kernel mode*.  This means that we can do something
like this:

static bool int3_emulate_call_okay(struct pt_regs *regs)
{
    unsigned long available_stack = regs->sp - (unsigned long);
    return available_stack >= sizeof(long);
}

void do_int3(...) {
{
  WARN_ON_ONCE(!user_mode(regs) && !int3_emulate_call_okay(regs));
  ...;
}

static void int3_emulate_call(struct pt_regs *regs, unsigned long target)
{
  BUG_ON(user_mode(regs) || !int3_emulate_call_okey(regs));
  regs->sp -= sizeof(unsigned long);
  *(unsigned long *)regs->sp = target;
  /* CET SHSTK fixup goes here */
}

Obviously the CET SHSTK fixup might be rather nasty, but I suspect
it's a solvable problem.

A major benefit of this is that the entry asm nastiness will get
exercised all the time, and, if we screw it up, the warning will fire.
This is the basic principle behind why the entry stuff *works* these
days.  I've put a lot of effort into making sure that running kernels
with CONFIG_DEBUG_ENTRY and running the selftests actually exercises
the nasty cases.

--Andy
Steven Rostedt April 30, 2019, 5:03 p.m. UTC | #34
On Tue, 30 Apr 2019 09:33:51 -0700
Andy Lutomirski <luto@kernel.org> wrote:


> Linus, can I ask you to reconsider your opposition to Josh's other
> approach of just shifting the stack on int3 entry?  I agree that it's
> ugly, but the ugliness is easily manageable and fairly self-contained.
> We add a little bit of complication to the entry asm (but it's not
> like it's unprecedented -- the entry asm does all kinds of stack
> rearrangement due to IST and PTI crap already), and we add an
> int3_emulate_call(struct pt_regs *regs, unsigned long target) helper
> that has appropriate assertions that the stack is okay and emulates
> the call.  And that's it.

I also prefer Josh's stack shift solution, as I personally believe
that's a cleaner solution. But I went ahead and implemented Linus's
version to get it working for ftrace. Here's the code, and it survived
some preliminary tests.

There's three places that use the update code. One is the start of
every function call (yes, I counted that as one, and that case is
determined by: ftrace_location(ip)). The other is the trampoline itself
has an update. That could also be converted to a text poke, but for now
its here as it was written before text poke existed. The third place is
actually a jump (to the function graph code). But that can be safely
skipped if we are converting it, as it only goes from jump to nop, or
nop to jump.

The trampolines reflect this. Also, as NMI code is traced by ftrace, I
had to duplicate the trampolines for the nmi case (but only for the
interrupts disabled case as NMIs don't have interrupts enabled).

-- Steve

diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..bf320bf791dd 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -17,6 +17,7 @@
 #include <linux/uaccess.h>
 #include <linux/ftrace.h>
 #include <linux/percpu.h>
+#include <linux/frame.h>
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/init.h>
@@ -232,6 +233,9 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
 
 static unsigned long ftrace_update_func;
 
+/* Used within inline asm below */
+unsigned long ftrace_update_func_call;
+
 static int update_ftrace_func(unsigned long ip, void *new)
 {
 	unsigned char old[MCOUNT_INSN_SIZE];
@@ -259,6 +263,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	unsigned char *new;
 	int ret;
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
 
@@ -280,6 +286,70 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
 	return 0;
 }
 
+extern asmlinkage void ftrace_emulate_call_irqon(void);
+extern asmlinkage void ftrace_emulate_call_irqoff(void);
+extern asmlinkage void ftrace_emulate_call_nmi(void);
+extern asmlinkage void ftrace_emulate_call_update_irqoff(void);
+extern asmlinkage void ftrace_emulate_call_update_irqon(void);
+extern asmlinkage void ftrace_emulate_call_update_nmi(void);
+
+static DEFINE_PER_CPU(void *, ftrace_bp_call_return);
+static DEFINE_PER_CPU(void *, ftrace_bp_call_nmi_return);
+
+asm(
+	".text\n"
+	".global ftrace_emulate_call_irqoff\n"
+	".type ftrace_emulate_call_irqoff, @function\n"
+	"ftrace_emulate_call_irqoff:\n\t"
+		"push %gs:ftrace_bp_call_return\n\t"
+		"sti\n\t"
+		"jmp ftrace_caller\n"
+	".size ftrace_emulate_call_irqoff, .-ftrace_emulate_call_irqoff\n"
+
+	".global ftrace_emulate_call_irqon\n"
+	".type ftrace_emulate_call_irqon, @function\n"
+	"ftrace_emulate_call_irqon:\n\t"
+		"push %gs:ftrace_bp_call_return\n\t"
+		"jmp ftrace_caller\n"
+	".size ftrace_emulate_call_irqon, .-ftrace_emulate_call_irqon\n"
+
+	".global ftrace_emulate_call_nmi\n"
+	".type ftrace_emulate_call_nmi, @function\n"
+	"ftrace_emulate_call_nmi:\n\t"
+		"push %gs:ftrace_bp_call_nmi_return\n\t"
+		"jmp ftrace_caller\n"
+	".size ftrace_emulate_call_nmi, .-ftrace_emulate_call_nmi\n"
+
+	".global ftrace_emulate_call_update_irqoff\n"
+	".type ftrace_emulate_call_update_irqoff, @function\n"
+	"ftrace_emulate_call_update_irqoff:\n\t"
+		"push %gs:ftrace_bp_call_return\n\t"
+		"sti\n\t"
+		"jmp *ftrace_update_func_call\n"
+	".size ftrace_emulate_call_update_irqoff, .-ftrace_emulate_call_update_irqoff\n"
+
+	".global ftrace_emulate_call_update_irqon\n"
+	".type ftrace_emulate_call_update_irqon, @function\n"
+	"ftrace_emulate_call_update_irqon:\n\t"
+		"push %gs:ftrace_bp_call_return\n\t"
+		"jmp *ftrace_update_func_call\n"
+	".size ftrace_emulate_call_update_irqon, .-ftrace_emulate_call_update_irqon\n"
+
+	".global ftrace_emulate_call_update_nmi\n"
+	".type ftrace_emulate_call_update_nmi, @function\n"
+	"ftrace_emulate_call_update_nmi:\n\t"
+		"push %gs:ftrace_bp_call_nmi_return\n\t"
+		"jmp *ftrace_update_func_call\n"
+	".size ftrace_emulate_call_update_nmi, .-ftrace_emulate_call_update_nmi\n"
+	".previous\n");
+
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqoff);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_irqon);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_nmi);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqoff);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_irqon);
+STACK_FRAME_NON_STANDARD(ftrace_emulate_call_update_nmi);
+
 /*
  * A breakpoint was added to the code address we are about to
  * modify, and this is the handle that will just skip over it.
@@ -295,10 +365,40 @@ int ftrace_int3_handler(struct pt_regs *regs)
 		return 0;
 
 	ip = regs->ip - 1;
-	if (!ftrace_location(ip) && !is_ftrace_caller(ip))
+	if (ftrace_location(ip)) {
+		if (in_nmi()) {
+			this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+			regs->ip = (unsigned long) ftrace_emulate_call_nmi;
+			return 1;
+		}
+		this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+		if (regs->flags & X86_EFLAGS_IF) {
+			regs->flags &= ~X86_EFLAGS_IF;
+			regs->ip = (unsigned long) ftrace_emulate_call_irqoff;
+		} else {
+			regs->ip = (unsigned long) ftrace_emulate_call_irqon;
+		}
+	} else if (is_ftrace_caller(ip)) {
+		/* If it's a jump, just need to skip it */
+		if (!ftrace_update_func_call) {
+			regs->ip += MCOUNT_INSN_SIZE -1;
+			return 1;
+		}
+		if (in_nmi()) {
+			this_cpu_write(ftrace_bp_call_nmi_return, (void *)ip + MCOUNT_INSN_SIZE);
+			regs->ip = (unsigned long) ftrace_emulate_call_update_nmi;
+			return 1;
+		}
+		this_cpu_write(ftrace_bp_call_return, (void *)ip + MCOUNT_INSN_SIZE);
+		if (regs->flags & X86_EFLAGS_IF) {
+			regs->flags &= ~X86_EFLAGS_IF;
+			regs->ip = (unsigned long) ftrace_emulate_call_update_irqoff;
+		} else {
+			regs->ip = (unsigned long) ftrace_emulate_call_update_irqon;
+		}
+	} else {
 		return 0;
-
-	regs->ip += MCOUNT_INSN_SIZE - 1;
+	}
 
 	return 1;
 }
@@ -859,6 +959,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
 
 	func = ftrace_ops_get_func(ops);
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	/* Do a safe modify in case the trampoline is executing */
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
@@ -960,6 +1062,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
 {
 	unsigned char *new;
 
+	ftrace_update_func_call = 0;
 	new = ftrace_jmp_replace(ip, (unsigned long)func);
 
 	return update_ftrace_func(ip, new);
Steven Rostedt April 30, 2019, 5:20 p.m. UTC | #35
On Tue, 30 Apr 2019 13:03:59 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I also prefer Josh's stack shift solution, as I personally believe
> that's a cleaner solution. But I went ahead and implemented Linus's
> version to get it working for ftrace. Here's the code, and it survived
> some preliminary tests.

Well it past the second level of tests.

If this is the way we are going, I could add comments to the code and
apply it to my queue and run it through the rest of my test suite and
make it ready for the merge window. I may add a stable tag to it to go
back to where live kernel patching was added, as it fixes a potential
bug there.

-- Steve
diff mbox series

Patch

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 00b7e27bc2b7..0b63ae02b1f3 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -93,6 +93,7 @@  obj-$(CONFIG_LIVEPATCH)	+= livepatch.o
 obj-$(CONFIG_FUNCTION_TRACER)	+= ftrace_$(BITS).o
 obj-$(CONFIG_FUNCTION_GRAPH_TRACER) += ftrace.o
 obj-$(CONFIG_FTRACE_SYSCALLS)	+= ftrace.o
+obj-$(CONFIG_DYNAMIC_FTRACE)	+= ftrace_int3_stubs.o
 obj-$(CONFIG_X86_TSC)		+= trace_clock.o
 obj-$(CONFIG_KEXEC_CORE)	+= machine_kexec_$(BITS).o
 obj-$(CONFIG_KEXEC_CORE)	+= relocate_kernel_$(BITS).o crash.o
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..917494f35cba 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -280,25 +280,84 @@  static nokprobe_inline int is_ftrace_caller(unsigned long ip)
 	return 0;
 }
 
-/*
- * A breakpoint was added to the code address we are about to
- * modify, and this is the handle that will just skip over it.
- * We are either changing a nop into a trace call, or a trace
- * call to a nop. While the change is taking place, we treat
- * it just like it was a nop.
- */
+extern void ftrace_graph_call(void);
+
+asmlinkage void ftrace_int3_stub_regs_caller(void);
+asmlinkage void ftrace_int3_stub_list_func(void);
+
+/* A breakpoint was added to the code address we are about to modify. */
 int ftrace_int3_handler(struct pt_regs *regs)
 {
 	unsigned long ip;
+	bool is_ftrace_location;
+	struct ftrace_int3_stack *stack;
+	int slot;
 
 	if (WARN_ON_ONCE(!regs))
 		return 0;
 
 	ip = regs->ip - 1;
-	if (!ftrace_location(ip) && !is_ftrace_caller(ip))
+	is_ftrace_location = ftrace_location(ip);
+	if (!is_ftrace_location && !is_ftrace_caller(ip))
 		return 0;
 
-	regs->ip += MCOUNT_INSN_SIZE - 1;
+	ip += MCOUNT_INSN_SIZE;
+
+	if (!is_ftrace_location &&
+	    ftrace_update_func == (unsigned long)ftrace_graph_call) {
+		/*
+		 * The insn at ftrace_graph_call is being changed from a
+		 * nop to a jmp or vice versa. Treat it as a nop and
+		 * skip over it.
+		 */
+		regs->ip = ip;
+		return 1;
+	}
+
+	/*
+	 * The insn having the breakpoint on it is either some mcount
+	 * call site or one of ftrace_call, ftrace_regs_call and their
+	 * equivalents within some trampoline. The currently pending
+	 * transition is known to turn the insn from a nop to a call,
+	 * from a call to a nop or to change the target address of an
+	 * existing call. We're going to emulate a call to the most
+	 * generic implementation capable of handling any possible
+	 * configuration. For the mcount sites that would be
+	 * ftrace_regs_caller() and for the remaining calls, which all
+	 * have got some ftrace_func_t target, ftrace_ops_list_func()
+	 * will do the right thing.
+	 *
+	 * However, the call insn can't get emulated from this trap
+	 * handler here. Rewrite the iret frame's ->ip value to one of
+	 * the ftrace_int3_stub instances which will then setup
+	 * everything in the original context. The address following
+	 * the current insn will be passed to the stub via the
+	 * ftrace_int3_stack.
+	 */
+	stack = &current_thread_info()->ftrace_int3_stack;
+	if (WARN_ON_ONCE(stack->depth >= 4)) {
+		/*
+		 * This should not happen as at most one stack slot is
+		 * required per the contexts "normal", "irq",
+		 * "softirq" and "nmi" each. However, be conservative
+		 * and treat it like a nop.
+		 */
+		regs->ip = ip;
+		return 1;
+	}
+
+	/*
+	 * Make sure interrupts will see the incremented ->depth value
+	 * before writing the stack entry.
+	 */
+	slot = stack->depth;
+	WRITE_ONCE(stack->depth, slot + 1);
+	WRITE_ONCE(stack->slots[slot], ip);
+
+	if (is_ftrace_location)
+		regs->ip = (unsigned long)ftrace_int3_stub_regs_caller;
+	else
+		regs->ip = (unsigned long)ftrace_int3_stub_list_func;
 
 	return 1;
 }
@@ -949,8 +1008,6 @@  void arch_ftrace_trampoline_free(struct ftrace_ops *ops)
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 
 #ifdef CONFIG_DYNAMIC_FTRACE
-extern void ftrace_graph_call(void);
-
 static unsigned char *ftrace_jmp_replace(unsigned long ip, unsigned long addr)
 {
 	return ftrace_text_replace(0xe9, ip, addr);
diff --git a/arch/x86/kernel/ftrace_int3_stubs.S b/arch/x86/kernel/ftrace_int3_stubs.S
new file mode 100644
index 000000000000..ef5f580450bb
--- /dev/null
+++ b/arch/x86/kernel/ftrace_int3_stubs.S
@@ -0,0 +1,61 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2019 SUSE Linux GmbH */
+
+#include <asm/asm.h>
+#include <asm/percpu.h>
+#include <asm/unwind_hints.h>
+#include <asm/asm-offsets.h>
+#include <linux/linkage.h>
+
+#ifdef CONFIG_X86_64
+#define WORD_SIZE 8
+#else
+#define WORD_SIZE 4
+#endif
+
+.macro ftrace_int3_stub call_target
+	/*
+	 * We got here from ftrace_int3_handler() because a breakpoint
+	 * on a call insn currently being modified has been
+	 * hit. ftrace_int3_handler() can't emulate the function call
+	 * directly, because it's running at a different position on
+	 * the stack, obviously. Hence it sets the regs->ip to this
+	 * stub so that we end up here upon the iret from the int3
+	 * handler. The stack is now in its original state and we can
+	 * emulate the function call insn by pushing the return
+	 * address onto the stack and jumping to the call target. The
+	 * desired return address has been put onto the ftrace_int3_stack
+	 * kept within struct thread_info.
+	 */
+	UNWIND_HINT_EMPTY
+	/* Reserve room for the emulated call's return address. */
+	sub	$WORD_SIZE, %_ASM_SP
+	/*
+	 * Pop the return address from ftrace_int3_ip_stack and write
+	 * it to the location just reserved. Be careful to retrieve
+	 * the address before decrementing ->depth in order to protect
+	 * against nested contexts clobbering it.
+	 */
+	push	%_ASM_AX
+	push	%_ASM_CX
+	push	%_ASM_DX
+	mov	PER_CPU_VAR(current_task), %_ASM_AX
+	mov	TASK_TI_ftrace_int3_depth(%_ASM_AX), %_ASM_CX
+	dec	%_ASM_CX
+	mov	TASK_TI_ftrace_int3_slots(%_ASM_AX, %_ASM_CX, WORD_SIZE), %_ASM_DX
+	mov	%_ASM_CX, TASK_TI_ftrace_int3_depth(%_ASM_AX)
+	mov	%_ASM_DX, 3*WORD_SIZE(%_ASM_SP)
+	pop	%_ASM_DX
+	pop	%_ASM_CX
+	pop	%_ASM_AX
+	/* Finally, transfer control to the target function. */
+	jmp	\call_target
+.endm
+
+ENTRY(ftrace_int3_stub_regs_caller)
+	ftrace_int3_stub ftrace_regs_caller
+END(ftrace_int3_stub_regs_caller)
+
+ENTRY(ftrace_int3_stub_list_func)
+	ftrace_int3_stub ftrace_ops_list_func
+END(ftrace_int3_stub_list_func)