Message ID | 20241107122648.2504368-1-elver@google.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v2,1/2] tracing: Add task_prctl_unknown tracepoint | expand |
On Thu, 7 Nov 2024 13:25:47 +0100 Marco Elver <elver@google.com> wrote: > +/** > + * task_prctl_unknown - called on unknown prctl() option > + * @task: pointer to the current task > + * @option: option passed > + * @arg2: arg2 passed > + * @arg3: arg3 passed > + * @arg4: arg4 passed > + * @arg5: arg5 passed > + * > + * Called on an unknown prctl() option. > + */ > +TRACE_EVENT(task_prctl_unknown, > + > + TP_PROTO(struct task_struct *task, int option, unsigned long arg2, unsigned long arg3, > + unsigned long arg4, unsigned long arg5), > + > + TP_ARGS(task, option, arg2, arg3, arg4, arg5), > + > + TP_STRUCT__entry( > + __string( comm, task->comm ) The question is, do we really need comm? From your example, it's redundant: test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104 ^^^^ ^^^^ -- Steve > + __field( int, option) > + __field( unsigned long, arg2) > + __field( unsigned long, arg3) > + __field( unsigned long, arg4) > + __field( unsigned long, arg5) > + ), > + > + TP_fast_assign( > + __assign_str(comm); > + __entry->option = option; > + __entry->arg2 = arg2; > + __entry->arg3 = arg3; > + __entry->arg4 = arg4; > + __entry->arg5 = arg5; > + ), > + > + TP_printk("comm=%s option=%d arg2=%ld arg3=%ld arg4=%ld arg5=%ld", > + __get_str(comm), __entry->option, > + __entry->arg2, __entry->arg3, __entry->arg4, __entry->arg5) > +); > +
On 2024-11-07 07:25, Marco Elver wrote: > prctl() is a complex syscall which multiplexes its functionality based > on a large set of PR_* options. Currently we count 64 such options. The > return value of unknown options is -EINVAL, and doesn't distinguish from > known options that were passed invalid args that also return -EINVAL. > > To understand if programs are attempting to use prctl() options not yet > available on the running kernel, provide the task_prctl_unknown > tracepoint. > > Note, this tracepoint is in an unlikely cold path, and would therefore > be suitable for continuous monitoring (e.g. via perf_event_open). > > While the above is likely the simplest usecase, additionally this > tracepoint can help unlock some testing scenarios (where probing > sys_enter or sys_exit causes undesirable performance overheads): > > a. unprivileged triggering of a test module: test modules may register a > probe to be called back on task_prctl_unknown, and pick a very large > unknown prctl() option upon which they perform a test function for an > unprivileged user; > > b. unprivileged triggering of an eBPF program function: similar > as idea (a). > > Example trace_pipe output: > > test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104 > My concern is that we start adding tons of special-case tracepoints to the implementation of system calls which are redundant with the sys_enter/exit tracepoints. Why favor this approach rather than hooking on sys_enter/exit ? Thanks, Mathieu > Signed-off-by: Marco Elver <elver@google.com> > --- > v2: > * Remove "pid" in trace output (suggested by Steven). > --- > include/trace/events/task.h | 41 +++++++++++++++++++++++++++++++++++++ > kernel/sys.c | 3 +++ > 2 files changed, 44 insertions(+) > > diff --git a/include/trace/events/task.h b/include/trace/events/task.h > index 47b527464d1a..9202cb2524c4 100644 > --- a/include/trace/events/task.h > +++ b/include/trace/events/task.h > @@ -56,6 +56,47 @@ TRACE_EVENT(task_rename, > __entry->newcomm, __entry->oom_score_adj) > ); > > +/** > + * task_prctl_unknown - called on unknown prctl() option > + * @task: pointer to the current task > + * @option: option passed > + * @arg2: arg2 passed > + * @arg3: arg3 passed > + * @arg4: arg4 passed > + * @arg5: arg5 passed > + * > + * Called on an unknown prctl() option. > + */ > +TRACE_EVENT(task_prctl_unknown, > + > + TP_PROTO(struct task_struct *task, int option, unsigned long arg2, unsigned long arg3, > + unsigned long arg4, unsigned long arg5), > + > + TP_ARGS(task, option, arg2, arg3, arg4, arg5), > + > + TP_STRUCT__entry( > + __string( comm, task->comm ) > + __field( int, option) > + __field( unsigned long, arg2) > + __field( unsigned long, arg3) > + __field( unsigned long, arg4) > + __field( unsigned long, arg5) > + ), > + > + TP_fast_assign( > + __assign_str(comm); > + __entry->option = option; > + __entry->arg2 = arg2; > + __entry->arg3 = arg3; > + __entry->arg4 = arg4; > + __entry->arg5 = arg5; > + ), > + > + TP_printk("comm=%s option=%d arg2=%ld arg3=%ld arg4=%ld arg5=%ld", > + __get_str(comm), __entry->option, > + __entry->arg2, __entry->arg3, __entry->arg4, __entry->arg5) > +); > + > #endif > > /* This part must be outside protection */ > diff --git a/kernel/sys.c b/kernel/sys.c > index 4da31f28fda8..dd0a71b68558 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -75,6 +75,8 @@ > #include <asm/io.h> > #include <asm/unistd.h> > > +#include <trace/events/task.h> > + > #include "uid16.h" > > #ifndef SET_UNALIGN_CTL > @@ -2785,6 +2787,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, > error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3); > break; > default: > + trace_task_prctl_unknown(me, option, arg2, arg3, arg4, arg5); > error = -EINVAL; > break; > }
On Thu, 7 Nov 2024 at 16:45, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-11-07 07:25, Marco Elver wrote: > > prctl() is a complex syscall which multiplexes its functionality based > > on a large set of PR_* options. Currently we count 64 such options. The > > return value of unknown options is -EINVAL, and doesn't distinguish from > > known options that were passed invalid args that also return -EINVAL. > > > > To understand if programs are attempting to use prctl() options not yet > > available on the running kernel, provide the task_prctl_unknown > > tracepoint. > > > > Note, this tracepoint is in an unlikely cold path, and would therefore > > be suitable for continuous monitoring (e.g. via perf_event_open). > > > > While the above is likely the simplest usecase, additionally this > > tracepoint can help unlock some testing scenarios (where probing > > sys_enter or sys_exit causes undesirable performance overheads): > > > > a. unprivileged triggering of a test module: test modules may register a > > probe to be called back on task_prctl_unknown, and pick a very large > > unknown prctl() option upon which they perform a test function for an > > unprivileged user; > > > > b. unprivileged triggering of an eBPF program function: similar > > as idea (a). > > > > Example trace_pipe output: > > > > test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104 > > > > My concern is that we start adding tons of special-case > tracepoints to the implementation of system calls which > are redundant with the sys_enter/exit tracepoints. > > Why favor this approach rather than hooking on sys_enter/exit ? It's __extremely__ expensive when deployed at scale. See note in commit description above.
On Thu, 7 Nov 2024 at 16:34, Steven Rostedt <rostedt@goodmis.org> wrote: ... > > + TP_PROTO(struct task_struct *task, int option, unsigned long arg2, unsigned long arg3, > > + unsigned long arg4, unsigned long arg5), > > + > > + TP_ARGS(task, option, arg2, arg3, arg4, arg5), > > + > > + TP_STRUCT__entry( > > + __string( comm, task->comm ) > > The question is, do we really need comm? From your example, it's redundant: > > test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104 > ^^^^ ^^^^ Ack, let's remove it. I will also remove the "task" argument.
On Thu, 7 Nov 2024 16:46:47 +0100 Marco Elver <elver@google.com> wrote: > > My concern is that we start adding tons of special-case > > tracepoints to the implementation of system calls which > > are redundant with the sys_enter/exit tracepoints. > > > > Why favor this approach rather than hooking on sys_enter/exit ? > > It's __extremely__ expensive when deployed at scale. See note in > commit description above. Agreed. The sys_enter/exit trace events make all syscalls go the slow path, which can be quite expensive. -- Steve
On 2024-11-07 10:46, Marco Elver wrote: > On Thu, 7 Nov 2024 at 16:45, Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2024-11-07 07:25, Marco Elver wrote: >>> prctl() is a complex syscall which multiplexes its functionality based >>> on a large set of PR_* options. Currently we count 64 such options. The >>> return value of unknown options is -EINVAL, and doesn't distinguish from >>> known options that were passed invalid args that also return -EINVAL. >>> >>> To understand if programs are attempting to use prctl() options not yet >>> available on the running kernel, provide the task_prctl_unknown >>> tracepoint. >>> >>> Note, this tracepoint is in an unlikely cold path, and would therefore >>> be suitable for continuous monitoring (e.g. via perf_event_open). >>> >>> While the above is likely the simplest usecase, additionally this >>> tracepoint can help unlock some testing scenarios (where probing >>> sys_enter or sys_exit causes undesirable performance overheads): >>> >>> a. unprivileged triggering of a test module: test modules may register a >>> probe to be called back on task_prctl_unknown, and pick a very large >>> unknown prctl() option upon which they perform a test function for an >>> unprivileged user; >>> >>> b. unprivileged triggering of an eBPF program function: similar >>> as idea (a). >>> >>> Example trace_pipe output: >>> >>> test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104 >>> >> >> My concern is that we start adding tons of special-case >> tracepoints to the implementation of system calls which >> are redundant with the sys_enter/exit tracepoints. >> >> Why favor this approach rather than hooking on sys_enter/exit ? > > It's __extremely__ expensive when deployed at scale. See note in > commit description above. I suspect you base the overhead analysis on the x86-64 implementation of sys_enter/exit tracepoint and especially the overhead caused by the SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag, am I correct ? If that is causing a too large overhead, we should investigate if those can be improved instead of adding tracepoints in the implementation of system calls. Thanks, Mathieu
On Thu, 7 Nov 2024 at 16:54, Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-11-07 10:46, Marco Elver wrote: > > On Thu, 7 Nov 2024 at 16:45, Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 2024-11-07 07:25, Marco Elver wrote: > >>> prctl() is a complex syscall which multiplexes its functionality based > >>> on a large set of PR_* options. Currently we count 64 such options. The > >>> return value of unknown options is -EINVAL, and doesn't distinguish from > >>> known options that were passed invalid args that also return -EINVAL. > >>> > >>> To understand if programs are attempting to use prctl() options not yet > >>> available on the running kernel, provide the task_prctl_unknown > >>> tracepoint. > >>> > >>> Note, this tracepoint is in an unlikely cold path, and would therefore > >>> be suitable for continuous monitoring (e.g. via perf_event_open). > >>> > >>> While the above is likely the simplest usecase, additionally this > >>> tracepoint can help unlock some testing scenarios (where probing > >>> sys_enter or sys_exit causes undesirable performance overheads): > >>> > >>> a. unprivileged triggering of a test module: test modules may register a > >>> probe to be called back on task_prctl_unknown, and pick a very large > >>> unknown prctl() option upon which they perform a test function for an > >>> unprivileged user; > >>> > >>> b. unprivileged triggering of an eBPF program function: similar > >>> as idea (a). > >>> > >>> Example trace_pipe output: > >>> > >>> test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104 > >>> > >> > >> My concern is that we start adding tons of special-case > >> tracepoints to the implementation of system calls which > >> are redundant with the sys_enter/exit tracepoints. > >> > >> Why favor this approach rather than hooking on sys_enter/exit ? > > > > It's __extremely__ expensive when deployed at scale. See note in > > commit description above. > > I suspect you base the overhead analysis on the x86-64 implementation > of sys_enter/exit tracepoint and especially the overhead caused by > the SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag, am I correct ? > > If that is causing a too large overhead, we should investigate if > those can be improved instead of adding tracepoints in the > implementation of system calls. Doing that may be generally useful, but even if you improve it somehow, there's always some additional bit of work needed on sys_enter/exit as soon as a tracepoint is attached. Even if that's just a few cycles, it's too much (for me at least). Also: if you just hook sys_enter/exit, you don't know if the prctl was handled or not by inspecting the return code (-EINVAL). I want the kernel to tell me if it handled the prctl() or not, and I also think it's very bad design to copy-paste the prctl() option checking of the running kernel in a sys_enter/exit hook. This doesn't scale in terms of performance nor maintainability.
On Thu, 7 Nov 2024 10:52:37 -0500 Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > I suspect you base the overhead analysis on the x86-64 implementation > of sys_enter/exit tracepoint and especially the overhead caused by > the SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag, am I correct ? > > If that is causing a too large overhead, we should investigate if > those can be improved instead of adding tracepoints in the > implementation of system calls. That would be great to get better, but the reason I'm not against this patch is because prctl() is not a normal system call. It's basically an ioctl() for Linux, and very vague. It's basically the garbage system call when you don't know what to do. It's even being proposed for the sframe work. I understand your sentiment and agree. I don't want any random system call to get a tracepoint attached to it. But here I'd make an exception. -- Steve
On 2024-11-07 11:04, Steven Rostedt wrote: > On Thu, 7 Nov 2024 10:52:37 -0500 > Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> I suspect you base the overhead analysis on the x86-64 implementation >> of sys_enter/exit tracepoint and especially the overhead caused by >> the SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag, am I correct ? >> >> If that is causing a too large overhead, we should investigate if >> those can be improved instead of adding tracepoints in the >> implementation of system calls. > > That would be great to get better, but the reason I'm not against this > patch is because prctl() is not a normal system call. It's basically an > ioctl() for Linux, and very vague. It's basically the garbage system call > when you don't know what to do. It's even being proposed for the sframe > work. > > I understand your sentiment and agree. I don't want any random system call > to get a tracepoint attached to it. But here I'd make an exception. Should we document this as an "instrumentation good practice" then ? When the system call is a multiplexor such as ioctl(2) and prctl(2), then instrumenting it with tracepoints within each of the "op" case makes sense for overall maintainability. For non-multiplexor system calls, using the existing sys_enter/exit tracepoints should be favored. This opens the following question for non-multiplexors system calls: considering that the overhead of the current sys_enter/exit instrumentation is deemed to large to use in production, perhaps we should consider a few alternatives, namely: A) Modify SYSCALL_DEFINE so it emits a function wrapper with tracepoints for each system call enter/exit, except for multiplexors, or B) Add the plumbing required to allow system call tracing to be activated for specific system calls only, more fine-grained than the current system-wide for_each_process_thread() SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag big hammer. Another scenario to consider is system calls that have iovec arguments. Should we add tracepoint within the iovec iteration, or should it target the entire iovec as input/output at system call enter/exit ? Thanks, Mathieu
diff --git a/include/trace/events/task.h b/include/trace/events/task.h index 47b527464d1a..9202cb2524c4 100644 --- a/include/trace/events/task.h +++ b/include/trace/events/task.h @@ -56,6 +56,47 @@ TRACE_EVENT(task_rename, __entry->newcomm, __entry->oom_score_adj) ); +/** + * task_prctl_unknown - called on unknown prctl() option + * @task: pointer to the current task + * @option: option passed + * @arg2: arg2 passed + * @arg3: arg3 passed + * @arg4: arg4 passed + * @arg5: arg5 passed + * + * Called on an unknown prctl() option. + */ +TRACE_EVENT(task_prctl_unknown, + + TP_PROTO(struct task_struct *task, int option, unsigned long arg2, unsigned long arg3, + unsigned long arg4, unsigned long arg5), + + TP_ARGS(task, option, arg2, arg3, arg4, arg5), + + TP_STRUCT__entry( + __string( comm, task->comm ) + __field( int, option) + __field( unsigned long, arg2) + __field( unsigned long, arg3) + __field( unsigned long, arg4) + __field( unsigned long, arg5) + ), + + TP_fast_assign( + __assign_str(comm); + __entry->option = option; + __entry->arg2 = arg2; + __entry->arg3 = arg3; + __entry->arg4 = arg4; + __entry->arg5 = arg5; + ), + + TP_printk("comm=%s option=%d arg2=%ld arg3=%ld arg4=%ld arg5=%ld", + __get_str(comm), __entry->option, + __entry->arg2, __entry->arg3, __entry->arg4, __entry->arg5) +); + #endif /* This part must be outside protection */ diff --git a/kernel/sys.c b/kernel/sys.c index 4da31f28fda8..dd0a71b68558 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -75,6 +75,8 @@ #include <asm/io.h> #include <asm/unistd.h> +#include <trace/events/task.h> + #include "uid16.h" #ifndef SET_UNALIGN_CTL @@ -2785,6 +2787,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = RISCV_SET_ICACHE_FLUSH_CTX(arg2, arg3); break; default: + trace_task_prctl_unknown(me, option, arg2, arg3, arg4, arg5); error = -EINVAL; break; }
prctl() is a complex syscall which multiplexes its functionality based on a large set of PR_* options. Currently we count 64 such options. The return value of unknown options is -EINVAL, and doesn't distinguish from known options that were passed invalid args that also return -EINVAL. To understand if programs are attempting to use prctl() options not yet available on the running kernel, provide the task_prctl_unknown tracepoint. Note, this tracepoint is in an unlikely cold path, and would therefore be suitable for continuous monitoring (e.g. via perf_event_open). While the above is likely the simplest usecase, additionally this tracepoint can help unlock some testing scenarios (where probing sys_enter or sys_exit causes undesirable performance overheads): a. unprivileged triggering of a test module: test modules may register a probe to be called back on task_prctl_unknown, and pick a very large unknown prctl() option upon which they perform a test function for an unprivileged user; b. unprivileged triggering of an eBPF program function: similar as idea (a). Example trace_pipe output: test-484 [000] ..... 631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104 Signed-off-by: Marco Elver <elver@google.com> --- v2: * Remove "pid" in trace output (suggested by Steven). --- include/trace/events/task.h | 41 +++++++++++++++++++++++++++++++++++++ kernel/sys.c | 3 +++ 2 files changed, 44 insertions(+)