Message ID | 20240408090205.3714934-1-elver@google.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | tracing: Add new_exec tracepoint | expand |
On Mon, 8 Apr 2024 11:01:54 +0200 Marco Elver <elver@google.com> wrote: > Add "new_exec" tracepoint, which is run right after the point of no > return but before the current task assumes its new exec identity. > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > runs before flushing the old exec, i.e. while the task still has the > original state (such as original MM), but when the new exec either > succeeds or crashes (but never returns to the original exec). > > Being able to trace this event can be helpful in a number of use cases: > > * allowing tracing eBPF programs access to the original MM on exec, > before current->mm is replaced; > * counting exec in the original task (via perf event); > * profiling flush time ("new_exec" to "sched_process_exec"). > > Example of tracing output ("new_exec" and "sched_process_exec"): How common is this? And can't you just do the same with adding a kprobe? -- Steve
On Tue, 9 Apr 2024 at 16:31, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 8 Apr 2024 11:01:54 +0200 > Marco Elver <elver@google.com> wrote: > > > Add "new_exec" tracepoint, which is run right after the point of no > > return but before the current task assumes its new exec identity. > > > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > > runs before flushing the old exec, i.e. while the task still has the > > original state (such as original MM), but when the new exec either > > succeeds or crashes (but never returns to the original exec). > > > > Being able to trace this event can be helpful in a number of use cases: > > > > * allowing tracing eBPF programs access to the original MM on exec, > > before current->mm is replaced; > > * counting exec in the original task (via perf event); > > * profiling flush time ("new_exec" to "sched_process_exec"). > > > > Example of tracing output ("new_exec" and "sched_process_exec"): > > How common is this? And can't you just do the same with adding a kprobe? Our main use case would be to use this in BPF programs to become exec-aware, where using the sched_process_exec hook is too late. This is particularly important where the BPF program must stop inspecting the user space's VM when the task does exec to become a new process. kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s permission check can still return an error which returns to the original task without crashing. Only at the point of no return are we guaranteed that the exec either succeeds, or the task is terminated on failure. I don't know if "common" is the right question here, because it's a chicken-egg problem: no tracepoint, we give up; we have the tracepoint, it unlocks a range of new use cases (that require robust solution to make BPF programs exec-aware, and a tracepoint is the only option IMHO). Thanks, -- Marco
On Mon, Apr 08, 2024 at 11:01:54AM +0200, Marco Elver wrote: > Add "new_exec" tracepoint, which is run right after the point of no > return but before the current task assumes its new exec identity. > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > runs before flushing the old exec, i.e. while the task still has the > original state (such as original MM), but when the new exec either > succeeds or crashes (but never returns to the original exec). > > Being able to trace this event can be helpful in a number of use cases: > > * allowing tracing eBPF programs access to the original MM on exec, > before current->mm is replaced; > * counting exec in the original task (via perf event); > * profiling flush time ("new_exec" to "sched_process_exec"). > > Example of tracing output ("new_exec" and "sched_process_exec"): > > $ cat /sys/kernel/debug/tracing/trace_pipe > <...>-379 [003] ..... 179.626921: new_exec: filename=/usr/bin/sshd pid=379 comm=sshd > <...>-379 [003] ..... 179.629131: sched_process_exec: filename=/usr/bin/sshd pid=379 old_pid=379 > <...>-381 [002] ..... 180.048580: new_exec: filename=/bin/bash pid=381 comm=sshd > <...>-381 [002] ..... 180.053122: sched_process_exec: filename=/bin/bash pid=381 old_pid=381 > <...>-385 [001] ..... 180.068277: new_exec: filename=/usr/bin/tty pid=385 comm=bash > <...>-385 [001] ..... 180.069485: sched_process_exec: filename=/usr/bin/tty pid=385 old_pid=385 > <...>-389 [006] ..... 192.020147: new_exec: filename=/usr/bin/dmesg pid=389 comm=bash > bash-389 [006] ..... 192.021377: sched_process_exec: filename=/usr/bin/dmesg pid=389 old_pid=389 > > Signed-off-by: Marco Elver <elver@google.com> > --- > fs/exec.c | 2 ++ > include/trace/events/task.h | 30 ++++++++++++++++++++++++++++++ > 2 files changed, 32 insertions(+) > > diff --git a/fs/exec.c b/fs/exec.c > index 38bf71cbdf5e..ab778ae1fc06 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1268,6 +1268,8 @@ int begin_new_exec(struct linux_binprm * bprm) > if (retval) > return retval; > > + trace_new_exec(current, bprm); > + All other steps in this function have explicit comments about what/why/etc. Please add some kind of comment describing why the tracepoint is where it is, etc. For example, maybe something like: /* * Before any changes to 'current', report that the exec is about to * happen (since we made it to the point of no return). On a successful * exec, the 'sched_process_exec' tracepoint will also fire. On failure, * ... [something else] */ > +TRACE_EVENT(new_exec, > + > + TP_PROTO(struct task_struct *task, struct linux_binprm *bprm), > + > + TP_ARGS(task, bprm), > + > + TP_STRUCT__entry( > + __string( filename, bprm->filename ) > + __field( pid_t, pid ) > + __string( comm, task->comm ) > + ), > + > + TP_fast_assign( > + __assign_str(filename, bprm->filename); What about binfmt_misc, and binfmt_script? You may want bprm->interp too? -Kees > + __entry->pid = task->pid; > + __assign_str(comm, task->comm); > + ), > + > + TP_printk("filename=%s pid=%d comm=%s", > + __get_str(filename), __entry->pid, __get_str(comm)) > +); > + > #endif > > /* This part must be outside protection */ > -- > 2.44.0.478.gd926399ef9-goog >
On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote: [...] > > + trace_new_exec(current, bprm); > > + > > All other steps in this function have explicit comments about > what/why/etc. Please add some kind of comment describing why the > tracepoint is where it is, etc. I beefed up the tracepoint documentation, and wrote a little paragraph above where it's called to reinforce what we want. [...] > What about binfmt_misc, and binfmt_script? You may want bprm->interp > too? Good points. I'll make the below changes for v2: diff --git a/fs/exec.c b/fs/exec.c index ab778ae1fc06..472b9f7b40e8 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) return retval; + /* + * This tracepoint marks the point before flushing the old exec where + * the current task is still unchanged, but errors are fatal (point of + * no return). The later "sched_process_exec" tracepoint is called after + * the current task has successfully switched to the new exec. + */ trace_new_exec(current, bprm); /* diff --git a/include/trace/events/task.h b/include/trace/events/task.h index 8853dc44783d..623d9af777c1 100644 --- a/include/trace/events/task.h +++ b/include/trace/events/task.h @@ -61,8 +61,11 @@ TRACE_EVENT(task_rename, * @task: pointer to the current task * @bprm: pointer to linux_binprm used for new exec * - * Called before flushing the old exec, but at the point of no return during - * switching to the new exec. + * Called before flushing the old exec, where @task is still unchanged, but at + * the point of no return during switching to the new exec. At the point it is + * called the exec will either succeed, or on failure terminate the task. Also + * see the "sched_process_exec" tracepoint, which is called right after @task + * has successfully switched to the new exec. */ TRACE_EVENT(new_exec, @@ -71,19 +74,22 @@ TRACE_EVENT(new_exec, TP_ARGS(task, bprm), TP_STRUCT__entry( + __string( interp, bprm->interp ) __string( filename, bprm->filename ) __field( pid_t, pid ) __string( comm, task->comm ) ), TP_fast_assign( + __assign_str(interp, bprm->interp); __assign_str(filename, bprm->filename); __entry->pid = task->pid; __assign_str(comm, task->comm); ), - TP_printk("filename=%s pid=%d comm=%s", - __get_str(filename), __entry->pid, __get_str(comm)) + TP_printk("interp=%s filename=%s pid=%d comm=%s", + __get_str(interp), __get_str(filename), + __entry->pid, __get_str(comm)) ); #endif
On Tue, Apr 09, 2024 at 08:25:45PM +0200, Marco Elver wrote: > On Tue, Apr 09, 2024 at 08:46AM -0700, Kees Cook wrote: > [...] > > > + trace_new_exec(current, bprm); > > > + > > > > All other steps in this function have explicit comments about > > what/why/etc. Please add some kind of comment describing why the > > tracepoint is where it is, etc. > > I beefed up the tracepoint documentation, and wrote a little paragraph > above where it's called to reinforce what we want. > > [...] > > What about binfmt_misc, and binfmt_script? You may want bprm->interp > > too? > > Good points. I'll make the below changes for v2: > > diff --git a/fs/exec.c b/fs/exec.c > index ab778ae1fc06..472b9f7b40e8 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1268,6 +1268,12 @@ int begin_new_exec(struct linux_binprm * bprm) > if (retval) > return retval; > > + /* > + * This tracepoint marks the point before flushing the old exec where > + * the current task is still unchanged, but errors are fatal (point of > + * no return). The later "sched_process_exec" tracepoint is called after > + * the current task has successfully switched to the new exec. > + */ > trace_new_exec(current, bprm); > > /* > diff --git a/include/trace/events/task.h b/include/trace/events/task.h > index 8853dc44783d..623d9af777c1 100644 > --- a/include/trace/events/task.h > +++ b/include/trace/events/task.h > @@ -61,8 +61,11 @@ TRACE_EVENT(task_rename, > * @task: pointer to the current task > * @bprm: pointer to linux_binprm used for new exec > * > - * Called before flushing the old exec, but at the point of no return during > - * switching to the new exec. > + * Called before flushing the old exec, where @task is still unchanged, but at > + * the point of no return during switching to the new exec. At the point it is > + * called the exec will either succeed, or on failure terminate the task. Also > + * see the "sched_process_exec" tracepoint, which is called right after @task > + * has successfully switched to the new exec. > */ > TRACE_EVENT(new_exec, > > @@ -71,19 +74,22 @@ TRACE_EVENT(new_exec, > TP_ARGS(task, bprm), > > TP_STRUCT__entry( > + __string( interp, bprm->interp ) > __string( filename, bprm->filename ) > __field( pid_t, pid ) > __string( comm, task->comm ) > ), > > TP_fast_assign( > + __assign_str(interp, bprm->interp); > __assign_str(filename, bprm->filename); > __entry->pid = task->pid; > __assign_str(comm, task->comm); > ), > > - TP_printk("filename=%s pid=%d comm=%s", > - __get_str(filename), __entry->pid, __get_str(comm)) > + TP_printk("interp=%s filename=%s pid=%d comm=%s", > + __get_str(interp), __get_str(filename), > + __entry->pid, __get_str(comm)) > ); > > #endif Looks good; I await v2, and Steven's Ack. :)
On Tue, 9 Apr 2024 16:45:47 +0200 Marco Elver <elver@google.com> wrote: > On Tue, 9 Apr 2024 at 16:31, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > On Mon, 8 Apr 2024 11:01:54 +0200 > > Marco Elver <elver@google.com> wrote: > > > > > Add "new_exec" tracepoint, which is run right after the point of no > > > return but before the current task assumes its new exec identity. > > > > > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > > > runs before flushing the old exec, i.e. while the task still has the > > > original state (such as original MM), but when the new exec either > > > succeeds or crashes (but never returns to the original exec). > > > > > > Being able to trace this event can be helpful in a number of use cases: > > > > > > * allowing tracing eBPF programs access to the original MM on exec, > > > before current->mm is replaced; > > > * counting exec in the original task (via perf event); > > > * profiling flush time ("new_exec" to "sched_process_exec"). > > > > > > Example of tracing output ("new_exec" and "sched_process_exec"): > > > > How common is this? And can't you just do the same with adding a kprobe? > > Our main use case would be to use this in BPF programs to become > exec-aware, where using the sched_process_exec hook is too late. This > is particularly important where the BPF program must stop inspecting > the user space's VM when the task does exec to become a new process. Just out of curiousity, would you like to audit that the user-program is not malformed? (security tracepoint?) I think that is an interesting idea. What kind of information you need? > > kprobe (or BPF's fentry) is brittle here, because begin_new_exec()'s > permission check can still return an error which returns to the > original task without crashing. Only at the point of no return are we > guaranteed that the exec either succeeds, or the task is terminated on > failure. Just a note: That is BPF limitation, kprobe and kprobe events can put a probe in the function body, but that is not supported on BPF (I guess because it depends on kernel debuginfo.) You can add kprobe-event using "perf probe" tool. Thank you, > > I don't know if "common" is the right question here, because it's a > chicken-egg problem: no tracepoint, we give up; we have the > tracepoint, it unlocks a range of new use cases (that require robust > solution to make BPF programs exec-aware, and a tracepoint is the only > option IMHO). > > Thanks, > -- Marco
On Wed, 10 Apr 2024 at 01:54, Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Tue, 9 Apr 2024 16:45:47 +0200 > Marco Elver <elver@google.com> wrote: > > > On Tue, 9 Apr 2024 at 16:31, Steven Rostedt <rostedt@goodmis.org> wrote: > > > > > > On Mon, 8 Apr 2024 11:01:54 +0200 > > > Marco Elver <elver@google.com> wrote: > > > > > > > Add "new_exec" tracepoint, which is run right after the point of no > > > > return but before the current task assumes its new exec identity. > > > > > > > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > > > > runs before flushing the old exec, i.e. while the task still has the > > > > original state (such as original MM), but when the new exec either > > > > succeeds or crashes (but never returns to the original exec). > > > > > > > > Being able to trace this event can be helpful in a number of use cases: > > > > > > > > * allowing tracing eBPF programs access to the original MM on exec, > > > > before current->mm is replaced; > > > > * counting exec in the original task (via perf event); > > > > * profiling flush time ("new_exec" to "sched_process_exec"). > > > > > > > > Example of tracing output ("new_exec" and "sched_process_exec"): > > > > > > How common is this? And can't you just do the same with adding a kprobe? > > > > Our main use case would be to use this in BPF programs to become > > exec-aware, where using the sched_process_exec hook is too late. This > > is particularly important where the BPF program must stop inspecting > > the user space's VM when the task does exec to become a new process. > > Just out of curiousity, would you like to audit that the user-program > is not malformed? (security tracepoint?) I think that is an interesting > idea. What kind of information you need? I didn't have that in mind. If the BPF program reads (or even writes) to user space memory, it must stop doing so before current->mm is switched, otherwise it will lead to random results or memory corruption. The new process may reallocate the memory that we want to inspect, but the user space process must explicitly opt in to being inspected or being manipulated. Just like the kernel "flushes" various old state on exec since it's becoming a new process, a BPF program that has per-process state needs to do the same.
On Mon, 8 Apr 2024 11:01:54 +0200 Marco Elver <elver@google.com> wrote: > Add "new_exec" tracepoint, which is run right after the point of no > return but before the current task assumes its new exec identity. > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > runs before flushing the old exec, i.e. while the task still has the > original state (such as original MM), but when the new exec either > succeeds or crashes (but never returns to the original exec). > > Being able to trace this event can be helpful in a number of use cases: > > * allowing tracing eBPF programs access to the original MM on exec, > before current->mm is replaced; > * counting exec in the original task (via perf event); > * profiling flush time ("new_exec" to "sched_process_exec"). > > Example of tracing output ("new_exec" and "sched_process_exec"): nit: "new_exec" name a bit stands out compared to other events, and hard to expect it comes before or after "sched_process_exec". Since "begin_new_exec" is internal implementation name, IMHO, it should not exposed to user. What do you think about calling this "sched_prepare_exec" ? Thank you, > > $ cat /sys/kernel/debug/tracing/trace_pipe > <...>-379 [003] ..... 179.626921: new_exec: filename=/usr/bin/sshd pid=379 comm=sshd > <...>-379 [003] ..... 179.629131: sched_process_exec: filename=/usr/bin/sshd pid=379 old_pid=379 > <...>-381 [002] ..... 180.048580: new_exec: filename=/bin/bash pid=381 comm=sshd > <...>-381 [002] ..... 180.053122: sched_process_exec: filename=/bin/bash pid=381 old_pid=381 > <...>-385 [001] ..... 180.068277: new_exec: filename=/usr/bin/tty pid=385 comm=bash > <...>-385 [001] ..... 180.069485: sched_process_exec: filename=/usr/bin/tty pid=385 old_pid=385 > <...>-389 [006] ..... 192.020147: new_exec: filename=/usr/bin/dmesg pid=389 comm=bash > bash-389 [006] ..... 192.021377: sched_process_exec: filename=/usr/bin/dmesg pid=389 old_pid=389 > > Signed-off-by: Marco Elver <elver@google.com> > --- > fs/exec.c | 2 ++ > include/trace/events/task.h | 30 ++++++++++++++++++++++++++++++ > 2 files changed, 32 insertions(+) > > diff --git a/fs/exec.c b/fs/exec.c > index 38bf71cbdf5e..ab778ae1fc06 100644 > --- a/fs/exec.c > +++ b/fs/exec.c > @@ -1268,6 +1268,8 @@ int begin_new_exec(struct linux_binprm * bprm) > if (retval) > return retval; > > + trace_new_exec(current, bprm); > + > /* > * Ensure all future errors are fatal. > */ > diff --git a/include/trace/events/task.h b/include/trace/events/task.h > index 47b527464d1a..8853dc44783d 100644 > --- a/include/trace/events/task.h > +++ b/include/trace/events/task.h > @@ -56,6 +56,36 @@ TRACE_EVENT(task_rename, > __entry->newcomm, __entry->oom_score_adj) > ); > > +/** > + * new_exec - called before setting up new exec > + * @task: pointer to the current task > + * @bprm: pointer to linux_binprm used for new exec > + * > + * Called before flushing the old exec, but at the point of no return during > + * switching to the new exec. > + */ > +TRACE_EVENT(new_exec, > + > + TP_PROTO(struct task_struct *task, struct linux_binprm *bprm), > + > + TP_ARGS(task, bprm), > + > + TP_STRUCT__entry( > + __string( filename, bprm->filename ) > + __field( pid_t, pid ) > + __string( comm, task->comm ) > + ), > + > + TP_fast_assign( > + __assign_str(filename, bprm->filename); > + __entry->pid = task->pid; > + __assign_str(comm, task->comm); > + ), > + > + TP_printk("filename=%s pid=%d comm=%s", > + __get_str(filename), __entry->pid, __get_str(comm)) > +); > + > #endif > > /* This part must be outside protection */ > -- > 2.44.0.478.gd926399ef9-goog >
On Wed, 10 Apr 2024 at 15:56, Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Mon, 8 Apr 2024 11:01:54 +0200 > Marco Elver <elver@google.com> wrote: > > > Add "new_exec" tracepoint, which is run right after the point of no > > return but before the current task assumes its new exec identity. > > > > Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint > > runs before flushing the old exec, i.e. while the task still has the > > original state (such as original MM), but when the new exec either > > succeeds or crashes (but never returns to the original exec). > > > > Being able to trace this event can be helpful in a number of use cases: > > > > * allowing tracing eBPF programs access to the original MM on exec, > > before current->mm is replaced; > > * counting exec in the original task (via perf event); > > * profiling flush time ("new_exec" to "sched_process_exec"). > > > > Example of tracing output ("new_exec" and "sched_process_exec"): > > nit: "new_exec" name a bit stands out compared to other events, and hard to > expect it comes before or after "sched_process_exec". Since "begin_new_exec" > is internal implementation name, IMHO, it should not exposed to user. > What do you think about calling this "sched_prepare_exec" ? I like it, I'll rename it to sched_prepare_exec. Thanks!
diff --git a/fs/exec.c b/fs/exec.c index 38bf71cbdf5e..ab778ae1fc06 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -1268,6 +1268,8 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) return retval; + trace_new_exec(current, bprm); + /* * Ensure all future errors are fatal. */ diff --git a/include/trace/events/task.h b/include/trace/events/task.h index 47b527464d1a..8853dc44783d 100644 --- a/include/trace/events/task.h +++ b/include/trace/events/task.h @@ -56,6 +56,36 @@ TRACE_EVENT(task_rename, __entry->newcomm, __entry->oom_score_adj) ); +/** + * new_exec - called before setting up new exec + * @task: pointer to the current task + * @bprm: pointer to linux_binprm used for new exec + * + * Called before flushing the old exec, but at the point of no return during + * switching to the new exec. + */ +TRACE_EVENT(new_exec, + + TP_PROTO(struct task_struct *task, struct linux_binprm *bprm), + + TP_ARGS(task, bprm), + + TP_STRUCT__entry( + __string( filename, bprm->filename ) + __field( pid_t, pid ) + __string( comm, task->comm ) + ), + + TP_fast_assign( + __assign_str(filename, bprm->filename); + __entry->pid = task->pid; + __assign_str(comm, task->comm); + ), + + TP_printk("filename=%s pid=%d comm=%s", + __get_str(filename), __entry->pid, __get_str(comm)) +); + #endif /* This part must be outside protection */
Add "new_exec" tracepoint, which is run right after the point of no return but before the current task assumes its new exec identity. Unlike the tracepoint "sched_process_exec", the "new_exec" tracepoint runs before flushing the old exec, i.e. while the task still has the original state (such as original MM), but when the new exec either succeeds or crashes (but never returns to the original exec). Being able to trace this event can be helpful in a number of use cases: * allowing tracing eBPF programs access to the original MM on exec, before current->mm is replaced; * counting exec in the original task (via perf event); * profiling flush time ("new_exec" to "sched_process_exec"). Example of tracing output ("new_exec" and "sched_process_exec"): $ cat /sys/kernel/debug/tracing/trace_pipe <...>-379 [003] ..... 179.626921: new_exec: filename=/usr/bin/sshd pid=379 comm=sshd <...>-379 [003] ..... 179.629131: sched_process_exec: filename=/usr/bin/sshd pid=379 old_pid=379 <...>-381 [002] ..... 180.048580: new_exec: filename=/bin/bash pid=381 comm=sshd <...>-381 [002] ..... 180.053122: sched_process_exec: filename=/bin/bash pid=381 old_pid=381 <...>-385 [001] ..... 180.068277: new_exec: filename=/usr/bin/tty pid=385 comm=bash <...>-385 [001] ..... 180.069485: sched_process_exec: filename=/usr/bin/tty pid=385 old_pid=385 <...>-389 [006] ..... 192.020147: new_exec: filename=/usr/bin/dmesg pid=389 comm=bash bash-389 [006] ..... 192.021377: sched_process_exec: filename=/usr/bin/dmesg pid=389 old_pid=389 Signed-off-by: Marco Elver <elver@google.com> --- fs/exec.c | 2 ++ include/trace/events/task.h | 30 ++++++++++++++++++++++++++++++ 2 files changed, 32 insertions(+)