Message ID | 20241022151804.284424-1-mathieu.desnoyers@efficios.com (mailing list archive) |
---|---|
State | RFC |
Headers | show |
Series | [RFC] tracing: Fix syscall tracepoint use-after-free | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
I assume this patch isn't meant to fix the related issues with freeing BPF programs/links with call_rcu? On the BPF side I think there needs to be some smarter handling of when to use call_rcu or call_rcu_tasks_trace to free links/programs based on whether or not the program type can be executed in this context. Right now call_rcu_tasks_trace is used if the program is sleepable, but that isn't necessarily the case here. Off the top of my head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in here, as I'm not entirely sure. -Jordan
On 2024-10-22 12:14, Jordan Rife wrote: > I assume this patch isn't meant to fix the related issues with freeing > BPF programs/links with call_rcu? No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to choose between: if (sleepable) call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); else call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); But the faultable syscall tracepoint series does not require syscall programs to be sleepable. So some changes may be needed on the ebpf side there. > > On the BPF side I think there needs to be some smarter handling of > when to use call_rcu or call_rcu_tasks_trace to free links/programs > based on whether or not the program type can be executed in this > context. Right now call_rcu_tasks_trace is used if the program is > sleepable, but that isn't necessarily the case here. Off the top of my > head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and > BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to > BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in > here, as I'm not entirely sure. A big hammer solution would be to make all grace periods waited for after a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace. Else, if we properly tag all programs attached to syscall tracepoints as sleepable, then keeping the call_rcu_tasks_trace() only for those would work. Thanks, Mathieu
On Tue, Oct 22, 2024 at 10:55 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-10-22 12:14, Jordan Rife wrote: > > I assume this patch isn't meant to fix the related issues with freeing > > BPF programs/links with call_rcu? > > No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to > choose between: > > if (sleepable) > call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); > else > call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); > > But the faultable syscall tracepoint series does not require syscall programs > to be sleepable. So some changes may be needed on the ebpf side there. Your fix now adds a chain of call_rcu -> call_rcu_tasks_trace -> kfree, which should work regardless of sleepable/non-sleepable. For the BPF-side, yes, we do different things depending on prog->sleepable (adding extra call_rcu_tasks_trace for sleepable, while still keeping call_rcu in the chain), so the BPF side should be good, I think. > > > > > On the BPF side I think there needs to be some smarter handling of > > when to use call_rcu or call_rcu_tasks_trace to free links/programs > > based on whether or not the program type can be executed in this > > context. Right now call_rcu_tasks_trace is used if the program is > > sleepable, but that isn't necessarily the case here. Off the top of my > > head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and > > BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to > > BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in > > here, as I'm not entirely sure. > From the BPF standpoint, as of right now, neither of RAW_TRACEPOINT or TRACEPOINT programs are sleepable. So a single RCU grace period is fine. But even if they were (and we'll allow that later on), we handle sleepable programs with the same call_rcu_tasks_trace -> call_rcu chain. That's just to say that I don't think that we need any BPF-specific fix beyond what Mathieu is doing in this patch, so: Acked-by: Andrii Nakryiko <andrii@kernel.org> > A big hammer solution would be to make all grace periods waited for after > a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace. > > Else, if we properly tag all programs attached to syscall tracepoints as > sleepable, then keeping the call_rcu_tasks_trace() only for those would > work. > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2024-10-22 15:53, Andrii Nakryiko wrote: > On Tue, Oct 22, 2024 at 10:55 AM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2024-10-22 12:14, Jordan Rife wrote: >>> I assume this patch isn't meant to fix the related issues with freeing >>> BPF programs/links with call_rcu? >> >> No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to >> choose between: >> >> if (sleepable) >> call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp); >> else >> call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp); >> >> But the faultable syscall tracepoint series does not require syscall programs >> to be sleepable. So some changes may be needed on the ebpf side there. > > Your fix now adds a chain of call_rcu -> call_rcu_tasks_trace -> > kfree, which should work regardless of sleepable/non-sleepable. For > the BPF-side, yes, we do different things depending on prog->sleepable > (adding extra call_rcu_tasks_trace for sleepable, while still keeping > call_rcu in the chain), so the BPF side should be good, I think. > >> >>> >>> On the BPF side I think there needs to be some smarter handling of >>> when to use call_rcu or call_rcu_tasks_trace to free links/programs >>> based on whether or not the program type can be executed in this >>> context. Right now call_rcu_tasks_trace is used if the program is >>> sleepable, but that isn't necessarily the case here. Off the top of my >>> head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and >>> BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to >>> BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in >>> here, as I'm not entirely sure. >> > > From the BPF standpoint, as of right now, neither of RAW_TRACEPOINT or > TRACEPOINT programs are sleepable. So a single RCU grace period is > fine. But even if they were (and we'll allow that later on), we handle > sleepable programs with the same call_rcu_tasks_trace -> call_rcu > chain. Good points, in this commit: commit 4aadde89d8 ("tracing/bpf: disable preemption in syscall probe") I took care to disable preemption around use of the bpf program attached to a syscall tracepoint, which makes this change a no-op from the tracers' perspective. It's only when you'll decide to remove this preempt-off and allow syscall tracepoints to sleep in bpf that you'll need to tweak that. > > That's just to say that I don't think that we need any BPF-specific > fix beyond what Mathieu is doing in this patch, so: > > Acked-by: Andrii Nakryiko <andrii@kernel.org> Thanks! Mathieu > > >> A big hammer solution would be to make all grace periods waited for after >> a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace. >> >> Else, if we properly tag all programs attached to syscall tracepoints as >> sleepable, then keeping the call_rcu_tasks_trace() only for those would >> work. >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c index 6474e2cf22c9..33f6fa94d383 100644 --- a/kernel/tracepoint.c +++ b/kernel/tracepoint.c @@ -101,11 +101,16 @@ static inline void *allocate_probes(int count) return p == NULL ? NULL : p->probes; } -static void rcu_free_old_probes(struct rcu_head *head) +static void rcu_tasks_trace_free_old_probes(struct rcu_head *head) { kfree(container_of(head, struct tp_probes, rcu)); } +static void rcu_free_old_probes(struct rcu_head *head) +{ + call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes); +} + static inline void release_probes(struct tracepoint_func *old) { if (old) {
The grace period used internally within tracepoint.c:release_probes() uses call_rcu() to batch waiting for quiescence of old probe arrays, rather than using the tracepoint_synchronize_unregister() which blocks while waiting for quiescence. This causes use-after-free issues reproduced with syzkaller. Fix this by introducing the following call_rcu chaining: call_rcu() -> rcu_free_old_probes -> call_rcu_tasks_trace() -> rcu_tasks_trace_free_old_probes. Just like it was done when SRCU was used. Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michael Jeanson <mjeanson@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Yonghong Song <yhs@fb.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com> Cc: bpf@vger.kernel.org Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Jordan Rife <jrife@google.com> --- kernel/tracepoint.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)