Message ID | 20190427100639.15074-2-nstange@suse.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | x86/ftrace: make ftrace_int3_handler() not to skip fops invocation | expand |
> On Apr 27, 2019, at 3:06 AM, Nicolai Stange <nstange@suse.de> wrote: > > Before actually rewriting an insn, x86' DYNAMIC_FTRACE implementation > places an int3 breakpoint on it. Currently, ftrace_int3_handler() simply > treats the insn in question as nop and advances %rip past it. How does this not crash all the time? > An upcoming > patch will improve this by making the int3 trap handler emulate the call > insn. To this end, ftrace_int3_handler() will be made to change its iret > frame's ->ip to some stub which will then mimic the function call in the > original context. > > Somehow the trapping ->ip address will have to get communicated from > ftrace_int3_handler() to these stubs though. Cute. But this should get “ftrace” removed from the name, since it’s potentially more generally useful. Josh wanted something like this for static_call. > Note that at any given point > in time, there can be at most four such call insn emulations pending: > namely at most one per "process", "irq", "softirq" and "nmi" context. > That’s quite an assumption. I think your list should also contain exception, exceptions nested inside that exception, and machine check, at the very least. I’m also wondering why irq and softirq are treated separately. All this makes me think that one of the other solutions we came up with last time we discussed this might be better.
On Sun, 28 Apr 2019 10:41:10 -0700 Andy Lutomirski <luto@amacapital.net> wrote: > > Note that at any given point > > in time, there can be at most four such call insn emulations pending: > > namely at most one per "process", "irq", "softirq" and "nmi" context. > > > > That’s quite an assumption. I think your list should also contain > exception, exceptions nested inside that exception, and machine > check, at the very least. I’m also wondering why irq and softirq are > treated separately. 4 has usually been the context count we choose. But I guess in theory, if we get exceptions then I could potentially be more. As for irq vs softirq, an interrupt can preempt a softirq. Interrupts are enabled while softirqs are running. When sofirqs run, softirqs are disabled to prevent nested softirqs. But interrupts are enabled again, and another interrupt may come in while a softirq is executing. > > All this makes me think that one of the other solutions we came up > with last time we discussed this might be better. +100 Perhaps adding another slot into pt_regs that gets used by int3 to store a slot to emulate a call on return? -- Steve
> On Apr 28, 2019, at 10:51 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Sun, 28 Apr 2019 10:41:10 -0700 > Andy Lutomirski <luto@amacapital.net> wrote: > > >>> Note that at any given point >>> in time, there can be at most four such call insn emulations pending: >>> namely at most one per "process", "irq", "softirq" and "nmi" context. >>> >> >> That’s quite an assumption. I think your list should also contain >> exception, exceptions nested inside that exception, and machine >> check, at the very least. I’m also wondering why irq and softirq are >> treated separately. > > 4 has usually been the context count we choose. But I guess in theory, > if we get exceptions then I could potentially be more. > > As for irq vs softirq, an interrupt can preempt a softirq. Interrupts > are enabled while softirqs are running. When sofirqs run, softirqs are > disabled to prevent nested softirqs. But interrupts are enabled again, > and another interrupt may come in while a softirq is executing. > >> >> All this makes me think that one of the other solutions we came up >> with last time we discussed this might be better. > > +100 > > Perhaps adding another slot into pt_regs that gets used by int3 to > store a slot to emulate a call on return? > > That’s not totally nuts, although finding pt_regs isn’t entirely trivial. I still think I prefer an approach where we just emulate the call directly.
On Sun, 28 Apr 2019 11:08:34 -0700 Andy Lutomirski <luto@amacapital.net> wrote: > > > > Perhaps adding another slot into pt_regs that gets used by int3 to > > store a slot to emulate a call on return? > > > > > > That’s not totally nuts, although finding pt_regs isn’t entirely trivial. I meant on the int3 handler (which stores the pt_regs). > > I still think I prefer an approach where we just emulate the call directly. Then, on the return of int3, if there's anything in that slot, then we could possibly shift the exception handler frame (that was added by the hardware), insert the slot data into the top of the stack, and then call iret (which the int3 handler, would add the return ip to be the function being called), which would in essence emulate the call directly. I believe the complexity comes from the exception frame added by the hardware is where we need to put the return of the call for the emulation. -- Steve
> On Apr 28, 2019, at 12:43 PM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Sun, 28 Apr 2019 11:08:34 -0700 > Andy Lutomirski <luto@amacapital.net> wrote: > >>> >>> Perhaps adding another slot into pt_regs that gets used by int3 to >>> store a slot to emulate a call on return? >>> >>> >> >> That’s not totally nuts, although finding pt_regs isn’t entirely trivial. > > I meant on the int3 handler (which stores the pt_regs). But that’s below the stub’s RSP, so it’s toast if another interrupt happens. Or am I misunderstanding you? > >> >> I still think I prefer an approach where we just emulate the call directly. > > Then, on the return of int3, if there's anything in that slot, then we > could possibly shift the exception handler frame (that was added by the > hardware), insert the slot data into the top of the stack, and then > call iret (which the int3 handler, would add the return ip to be the > function being called), which would in essence emulate the call directly. Oh, I get it. I liked Josh’s old proposal of unconditionally shifting the #BP frame 8 bytes better. It will be interesting when kernel shadow stacks are thrown in the mix, but that’s a problem for another day.
Steven Rostedt <rostedt@goodmis.org> writes: > On Sun, 28 Apr 2019 10:41:10 -0700 > Andy Lutomirski <luto@amacapital.net> wrote: > > >> > Note that at any given point >> > in time, there can be at most four such call insn emulations pending: >> > namely at most one per "process", "irq", "softirq" and "nmi" context. >> > >> >> That’s quite an assumption. I think your list should also contain >> exception, exceptions nested inside that exception, and machine >> check, at the very least. I’m also wondering why irq and softirq are >> treated separately. You're right, I missed the machine check case. > 4 has usually been the context count we choose. But I guess in theory, > if we get exceptions then I could potentially be more. After having seen the static_call discussion, I'm in no way defending this limited approach here, but out of curiosity: can the code between the push onto the stack from ftrace_int3_handler() and the subsequent pop from the stub actually trigger an (non-#MC) exception? There's an iret inbetween, but that can fault only when returning to user space, correct? > As for irq vs softirq, an interrupt can preempt a softirq. Interrupts > are enabled while softirqs are running. When sofirqs run, softirqs are > disabled to prevent nested softirqs. But interrupts are enabled again, > and another interrupt may come in while a softirq is executing. > Thanks, Nicolai
> On Apr 28, 2019, at 2:22 PM, Nicolai Stange <nstange@suse.de> wrote: > > Steven Rostedt <rostedt@goodmis.org> writes: > >> On Sun, 28 Apr 2019 10:41:10 -0700 >> Andy Lutomirski <luto@amacapital.net> wrote: >> >> >>>> Note that at any given point >>>> in time, there can be at most four such call insn emulations pending: >>>> namely at most one per "process", "irq", "softirq" and "nmi" context. >>>> >>> >>> That’s quite an assumption. I think your list should also contain >>> exception, exceptions nested inside that exception, and machine >>> check, at the very least. I’m also wondering why irq and softirq are >>> treated separately. > > You're right, I missed the machine check case. > > >> 4 has usually been the context count we choose. But I guess in theory, >> if we get exceptions then I could potentially be more. > > After having seen the static_call discussion, I'm in no way defending > this limited approach here, but out of curiosity: can the code between > the push onto the stack from ftrace_int3_handler() and the subsequent > pop from the stub actually trigger an (non-#MC) exception? There's an > iret inbetween, but that can fault only when returning to user space, > correct? IRET doesn’t do any fancy masking, so #DB, NMI and regular IRQs should all be possible. > > >> As for irq vs softirq, an interrupt can preempt a softirq. Interrupts >> are enabled while softirqs are running. When sofirqs run, softirqs are >> disabled to prevent nested softirqs. But interrupts are enabled again, >> and another interrupt may come in while a softirq is executing. >> > > Thanks, > > Nicolai > > > -- > SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, > HRB 21284 (AG Nürnberg)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index e0eccbcb8447..83434a88cfbb 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -56,6 +56,17 @@ struct task_struct; struct thread_info { unsigned long flags; /* low level flags */ u32 status; /* thread synchronous flags */ +#ifdef CONFIG_DYNAMIC_FTRACE + struct ftrace_int3_stack { + int depth; + /* + * There can be at most one slot in use per context, + * i.e. at most one for "normal", "irq", "softirq" and + * "nmi" each. + */ + unsigned long slots[4]; + } ftrace_int3_stack; +#endif }; #define INIT_THREAD_INFO(tsk) \ diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c index 168543d077d7..ca6ee24a0c6e 100644 --- a/arch/x86/kernel/asm-offsets.c +++ b/arch/x86/kernel/asm-offsets.c @@ -105,4 +105,12 @@ static void __used common(void) OFFSET(TSS_sp0, tss_struct, x86_tss.sp0); OFFSET(TSS_sp1, tss_struct, x86_tss.sp1); OFFSET(TSS_sp2, tss_struct, x86_tss.sp2); + +#ifdef CONFIG_DYNAMIC_FTRACE + BLANK(); + OFFSET(TASK_TI_ftrace_int3_depth, task_struct, + thread_info.ftrace_int3_stack.depth); + OFFSET(TASK_TI_ftrace_int3_slots, task_struct, + thread_info.ftrace_int3_stack.slots); +#endif }
Before actually rewriting an insn, x86' DYNAMIC_FTRACE implementation places an int3 breakpoint on it. Currently, ftrace_int3_handler() simply treats the insn in question as nop and advances %rip past it. An upcoming patch will improve this by making the int3 trap handler emulate the call insn. To this end, ftrace_int3_handler() will be made to change its iret frame's ->ip to some stub which will then mimic the function call in the original context. Somehow the trapping ->ip address will have to get communicated from ftrace_int3_handler() to these stubs though. Note that at any given point in time, there can be at most four such call insn emulations pending: namely at most one per "process", "irq", "softirq" and "nmi" context. Introduce struct ftrace_int3_stack providing four entries for storing the instruction pointer. In principle, it could be made per-cpu, but this would require making ftrace_int3_handler() to return with preemption disabled and to enable it from those emulation stubs again only after the stack's top entry has been consumed. I've been told that this would "break a lot of norms" and that making this stack part of struct thread_info instead would be less fragile. Follow this advice and add a struct ftrace_int3_stack instance to x86's struct thread_info. Note that these stacks will get only rarely accessed (only during ftrace's code modifications) and thus, cache line dirtying won't have any significant impact on the neighbouring fields. Initialization will take place implicitly through INIT_THREAD_INFO as per the rules for missing elements in initializers. The memcpy() in arch_dup_task_struct() will propagate the initial state properly, because it's always run in process context and won't ever see a non-zero ->depth value. Finally, add the necessary bits to asm-offsets for making struct ftrace_int3_stack accessible from assembly. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Nicolai Stange <nstange@suse.de> --- arch/x86/include/asm/thread_info.h | 11 +++++++++++ arch/x86/kernel/asm-offsets.c | 8 ++++++++ 2 files changed, 19 insertions(+)