diff mbox series

[v1,4/8] tracing/bpf: guard syscall probe with preempt_notrace

Message ID 20241003151638.1608537-5-mathieu.desnoyers@efficios.com (mailing list archive)
State Superseded
Headers show
Series tracing: Allow system call tracepoints to handle page faults | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Mathieu Desnoyers Oct. 3, 2024, 3:16 p.m. UTC
In preparation for allowing system call enter/exit instrumentation to
handle page faults, make sure that bpf can handle this change by
explicitly disabling preemption within the bpf system call tracepoint
probes to respect the current expectations within bpf tracing code.

This change does not yet allow bpf to take page faults per se within its
probe, but allows its existing probes to adapt to the upcoming change.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
---
 include/trace/bpf_probe.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

Comments

Steven Rostedt Oct. 3, 2024, 10:26 p.m. UTC | #1
On Thu,  3 Oct 2024 11:16:34 -0400
Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:

> In preparation for allowing system call enter/exit instrumentation to
> handle page faults, make sure that bpf can handle this change by
> explicitly disabling preemption within the bpf system call tracepoint
> probes to respect the current expectations within bpf tracing code.
> 
> This change does not yet allow bpf to take page faults per se within its
> probe, but allows its existing probes to adapt to the upcoming change.
> 

I guess the BPF folks should state if this is needed or not?

Does the BPF hooks into the tracepoints expect preemption to be disabled
when called?

-- Steve


> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
> Tested-by: Andrii Nakryiko <andrii@kernel.org> # BPF parts
> Cc: Michael Jeanson <mjeanson@efficios.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Alexei Starovoitov <ast@kernel.org>
> Cc: Yonghong Song <yhs@fb.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> Cc: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> Cc: Namhyung Kim <namhyung@kernel.org>
> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
> Cc: bpf@vger.kernel.org
> Cc: Joel Fernandes <joel@joelfernandes.org>
> ---
>  include/trace/bpf_probe.h | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> index c85bbce5aaa5..211b98d45fc6 100644
> --- a/include/trace/bpf_probe.h
> +++ b/include/trace/bpf_probe.h
> @@ -53,8 +53,17 @@ __bpf_trace_##call(void *__data, proto)					\
>  #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
>  	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
>  
> +#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)			\
> +static notrace void							\
> +__bpf_trace_##call(void *__data, proto)					\
> +{									\
> +	guard(preempt_notrace)();					\
> +	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));	\
> +}
> +
>  #undef DECLARE_EVENT_SYSCALL_CLASS
> -#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
> +#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print)	\
> +	__BPF_DECLARE_TRACE_SYSCALL(call, PARAMS(proto), PARAMS(args))
>  
>  /*
>   * This part is compiled out, it is only here as a build time check
Alexei Starovoitov Oct. 3, 2024, 11:05 p.m. UTC | #2
On Thu, Oct 3, 2024 at 3:25 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Thu,  3 Oct 2024 11:16:34 -0400
> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>
> > In preparation for allowing system call enter/exit instrumentation to
> > handle page faults, make sure that bpf can handle this change by
> > explicitly disabling preemption within the bpf system call tracepoint
> > probes to respect the current expectations within bpf tracing code.
> >
> > This change does not yet allow bpf to take page faults per se within its
> > probe, but allows its existing probes to adapt to the upcoming change.
> >
>
> I guess the BPF folks should state if this is needed or not?
>
> Does the BPF hooks into the tracepoints expect preemption to be disabled
> when called?

Andrii pointed it out already.
bpf doesn't need preemption to be disabled.
Only migration needs to be disabled.
Mathieu Desnoyers Oct. 4, 2024, 12:30 a.m. UTC | #3
On 2024-10-04 01:05, Alexei Starovoitov wrote:
> On Thu, Oct 3, 2024 at 3:25 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>>
>> On Thu,  3 Oct 2024 11:16:34 -0400
>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>
>>> In preparation for allowing system call enter/exit instrumentation to
>>> handle page faults, make sure that bpf can handle this change by
>>> explicitly disabling preemption within the bpf system call tracepoint
>>> probes to respect the current expectations within bpf tracing code.
>>>
>>> This change does not yet allow bpf to take page faults per se within its
>>> probe, but allows its existing probes to adapt to the upcoming change.
>>>
>>
>> I guess the BPF folks should state if this is needed or not?
>>
>> Does the BPF hooks into the tracepoints expect preemption to be disabled
>> when called?
> 
> Andrii pointed it out already.
> bpf doesn't need preemption to be disabled.
> Only migration needs to be disabled.

I'm well aware of this. Feel free to relax those constraints in
follow up patches in your own tracers. I'm simply not introducing
any behavior change in the "big switch" patch introducing faultable
syscall tracepoints. It's just too easy to overlook a dependency on
preempt off deep inside some tracer code for me to make assumptions
at the tracepoint level.

If a regression happens, it will be caused by the tracer-specific
patch that relaxes the constraints, not by the tracepoint change
that affects multiple tracers at once.

Thanks,

Mathieu
Mathieu Desnoyers Oct. 4, 2024, 1:28 a.m. UTC | #4
On 2024-10-04 02:30, Mathieu Desnoyers wrote:
> On 2024-10-04 01:05, Alexei Starovoitov wrote:
>> On Thu, Oct 3, 2024 at 3:25 PM Steven Rostedt <rostedt@goodmis.org> 
>> wrote:
>>>
>>> On Thu,  3 Oct 2024 11:16:34 -0400
>>> Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote:
>>>
>>>> In preparation for allowing system call enter/exit instrumentation to
>>>> handle page faults, make sure that bpf can handle this change by
>>>> explicitly disabling preemption within the bpf system call tracepoint
>>>> probes to respect the current expectations within bpf tracing code.
>>>>
>>>> This change does not yet allow bpf to take page faults per se within 
>>>> its
>>>> probe, but allows its existing probes to adapt to the upcoming change.
>>>>
>>>
>>> I guess the BPF folks should state if this is needed or not?
>>>
>>> Does the BPF hooks into the tracepoints expect preemption to be disabled
>>> when called?
>>
>> Andrii pointed it out already.
>> bpf doesn't need preemption to be disabled.
>> Only migration needs to be disabled.
> 
> I'm well aware of this. Feel free to relax those constraints in
> follow up patches in your own tracers. I'm simply not introducing
> any behavior change in the "big switch" patch introducing faultable
> syscall tracepoints. It's just too easy to overlook a dependency on
> preempt off deep inside some tracer code for me to make assumptions
> at the tracepoint level.
> 
> If a regression happens, it will be caused by the tracer-specific
> patch that relaxes the constraints, not by the tracepoint change
> that affects multiple tracers at once.

I also notice that the bpf verifier checks a "active_preempt_lock"
state to make sure sleepable functions are not called while within
preempt off region. So I would expect that the verifier has some
knowledge about the fact that tracepoint probes are called with
preempt off already.

Likewise in reverse for functions which deal with per-cpu data: those
would expect to be used with preempt off if multiple functions need to
touch the same cpu's data.

So if we make the syscall tracepoint constraints more relax (migrate
off rather than preempt off), I suspect we may have to update the
verifier.

This contributes to my uneasiness towards introducing this kind of
side-effect in a tracepoint change that affects all tracers.

Thanks,

Mathieu
diff mbox series

Patch

diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
index c85bbce5aaa5..211b98d45fc6 100644
--- a/include/trace/bpf_probe.h
+++ b/include/trace/bpf_probe.h
@@ -53,8 +53,17 @@  __bpf_trace_##call(void *__data, proto)					\
 #define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print)	\
 	__BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
 
+#define __BPF_DECLARE_TRACE_SYSCALL(call, proto, args)			\
+static notrace void							\
+__bpf_trace_##call(void *__data, proto)					\
+{									\
+	guard(preempt_notrace)();					\
+	CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(__data, CAST_TO_U64(args));	\
+}
+
 #undef DECLARE_EVENT_SYSCALL_CLASS
-#define DECLARE_EVENT_SYSCALL_CLASS DECLARE_EVENT_CLASS
+#define DECLARE_EVENT_SYSCALL_CLASS(call, proto, args, tstruct, assign, print)	\
+	__BPF_DECLARE_TRACE_SYSCALL(call, PARAMS(proto), PARAMS(args))
 
 /*
  * This part is compiled out, it is only here as a build time check