diff mbox series

[RFC] tracing: Fix syscall tracepoint use-after-free

Message ID 20241022151804.284424-1-mathieu.desnoyers@efficios.com (mailing list archive)
State RFC
Headers show
Series [RFC] tracing: Fix syscall tracepoint use-after-free | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Mathieu Desnoyers Oct. 22, 2024, 3:18 p.m. UTC
The grace period used internally within tracepoint.c:release_probes()
uses call_rcu() to batch waiting for quiescence of old probe arrays,
rather than using the tracepoint_synchronize_unregister() which blocks
while waiting for quiescence.

This causes use-after-free issues reproduced with syzkaller.

Fix this by introducing the following call_rcu chaining:

   call_rcu() -> rcu_free_old_probes -> call_rcu_tasks_trace() -> rcu_tasks_trace_free_old_probes.

Just like it was done when SRCU was used.

Reported-by: syzbot+b390c8062d8387b6272a@syzkaller.appspotmail.com
Fixes: a363d27cdbc2 ("tracing: Allow system call tracepoints to handle page faults")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michael Jeanson <mjeanson@efficios.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Yonghong Song <yhs@fb.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: bpf@vger.kernel.org
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Jordan Rife <jrife@google.com>
---
 kernel/tracepoint.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Jordan Rife Oct. 22, 2024, 4:14 p.m. UTC | #1
I assume this patch isn't meant to fix the related issues with freeing
BPF programs/links with call_rcu?

On the BPF side I think there needs to be some smarter handling of
when to use call_rcu or call_rcu_tasks_trace to free links/programs
based on whether or not the program type can be executed in this
context. Right now call_rcu_tasks_trace is used if the program is
sleepable, but that isn't necessarily the case here. Off the top of my
head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to
BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in
here, as I'm not entirely sure.

-Jordan
Mathieu Desnoyers Oct. 22, 2024, 5:54 p.m. UTC | #2
On 2024-10-22 12:14, Jordan Rife wrote:
> I assume this patch isn't meant to fix the related issues with freeing
> BPF programs/links with call_rcu?

No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to
choose between:

                 if (sleepable)
                         call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
                 else
                         call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);

But the faultable syscall tracepoint series does not require syscall programs
to be sleepable. So some changes may be needed on the ebpf side there.

> 
> On the BPF side I think there needs to be some smarter handling of
> when to use call_rcu or call_rcu_tasks_trace to free links/programs
> based on whether or not the program type can be executed in this
> context. Right now call_rcu_tasks_trace is used if the program is
> sleepable, but that isn't necessarily the case here. Off the top of my
> head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and
> BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to
> BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in
> here, as I'm not entirely sure.

A big hammer solution would be to make all grace periods waited for after
a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace.

Else, if we properly tag all programs attached to syscall tracepoints as
sleepable, then keeping the call_rcu_tasks_trace() only for those would
work.

Thanks,

Mathieu
Andrii Nakryiko Oct. 22, 2024, 7:53 p.m. UTC | #3
On Tue, Oct 22, 2024 at 10:55 AM Mathieu Desnoyers
<mathieu.desnoyers@efficios.com> wrote:
>
> On 2024-10-22 12:14, Jordan Rife wrote:
> > I assume this patch isn't meant to fix the related issues with freeing
> > BPF programs/links with call_rcu?
>
> No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to
> choose between:
>
>                  if (sleepable)
>                          call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
>                  else
>                          call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
>
> But the faultable syscall tracepoint series does not require syscall programs
> to be sleepable. So some changes may be needed on the ebpf side there.

Your fix now adds a chain of call_rcu -> call_rcu_tasks_trace ->
kfree, which should work regardless of sleepable/non-sleepable. For
the BPF-side, yes, we do different things depending on prog->sleepable
(adding extra call_rcu_tasks_trace for sleepable, while still keeping
call_rcu in the chain), so the BPF side should be good, I think.

>
> >
> > On the BPF side I think there needs to be some smarter handling of
> > when to use call_rcu or call_rcu_tasks_trace to free links/programs
> > based on whether or not the program type can be executed in this
> > context. Right now call_rcu_tasks_trace is used if the program is
> > sleepable, but that isn't necessarily the case here. Off the top of my
> > head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and
> > BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to
> > BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in
> > here, as I'm not entirely sure.
>

From the BPF standpoint, as of right now, neither of RAW_TRACEPOINT or
TRACEPOINT programs are sleepable. So a single RCU grace period is
fine. But even if they were (and we'll allow that later on), we handle
sleepable programs with the same call_rcu_tasks_trace -> call_rcu
chain.

That's just to say that I don't think that we need any BPF-specific
fix beyond what Mathieu is doing in this patch, so:

Acked-by: Andrii Nakryiko <andrii@kernel.org>


> A big hammer solution would be to make all grace periods waited for after
> a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace.
>
> Else, if we properly tag all programs attached to syscall tracepoints as
> sleepable, then keeping the call_rcu_tasks_trace() only for those would
> work.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
>
Mathieu Desnoyers Oct. 22, 2024, 8:04 p.m. UTC | #4
On 2024-10-22 15:53, Andrii Nakryiko wrote:
> On Tue, Oct 22, 2024 at 10:55 AM Mathieu Desnoyers
> <mathieu.desnoyers@efficios.com> wrote:
>>
>> On 2024-10-22 12:14, Jordan Rife wrote:
>>> I assume this patch isn't meant to fix the related issues with freeing
>>> BPF programs/links with call_rcu?
>>
>> No, indeed. I notice that bpf_link_free() uses a prog->sleepable flag to
>> choose between:
>>
>>                   if (sleepable)
>>                           call_rcu_tasks_trace(&link->rcu, bpf_link_defer_dealloc_mult_rcu_gp);
>>                   else
>>                           call_rcu(&link->rcu, bpf_link_defer_dealloc_rcu_gp);
>>
>> But the faultable syscall tracepoint series does not require syscall programs
>> to be sleepable. So some changes may be needed on the ebpf side there.
> 
> Your fix now adds a chain of call_rcu -> call_rcu_tasks_trace ->
> kfree, which should work regardless of sleepable/non-sleepable. For
> the BPF-side, yes, we do different things depending on prog->sleepable
> (adding extra call_rcu_tasks_trace for sleepable, while still keeping
> call_rcu in the chain), so the BPF side should be good, I think.
> 
>>
>>>
>>> On the BPF side I think there needs to be some smarter handling of
>>> when to use call_rcu or call_rcu_tasks_trace to free links/programs
>>> based on whether or not the program type can be executed in this
>>> context. Right now call_rcu_tasks_trace is used if the program is
>>> sleepable, but that isn't necessarily the case here. Off the top of my
>>> head this would be BPF_PROG_TYPE_RAW_TRACEPOINT and
>>> BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, but may extend to
>>> BPF_PROG_TYPE_TRACEPOINT? I'll let some of the BPF folks chime in
>>> here, as I'm not entirely sure.
>>
> 
>  From the BPF standpoint, as of right now, neither of RAW_TRACEPOINT or
> TRACEPOINT programs are sleepable. So a single RCU grace period is
> fine. But even if they were (and we'll allow that later on), we handle
> sleepable programs with the same call_rcu_tasks_trace -> call_rcu
> chain.

Good points, in this commit:

commit 4aadde89d8 ("tracing/bpf: disable preemption in syscall probe")
I took care to disable preemption around use of the bpf program attached
to a syscall tracepoint, which makes this change a no-op from the
tracers' perspective.

It's only when you'll decide to remove this preempt-off and allow
syscall tracepoints to sleep in bpf that you'll need to tweak that.

> 
> That's just to say that I don't think that we need any BPF-specific
> fix beyond what Mathieu is doing in this patch, so:
> 
> Acked-by: Andrii Nakryiko <andrii@kernel.org>

Thanks!

Mathieu

> 
> 
>> A big hammer solution would be to make all grace periods waited for after
>> a bpf tracepoint probe unregister chain call_rcu and call_rcu_tasks_trace.
>>
>> Else, if we properly tag all programs attached to syscall tracepoints as
>> sleepable, then keeping the call_rcu_tasks_trace() only for those would
>> work.
>>
>> Thanks,
>>
>> Mathieu
>>
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>
diff mbox series

Patch

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 6474e2cf22c9..33f6fa94d383 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -101,11 +101,16 @@  static inline void *allocate_probes(int count)
 	return p == NULL ? NULL : p->probes;
 }
 
-static void rcu_free_old_probes(struct rcu_head *head)
+static void rcu_tasks_trace_free_old_probes(struct rcu_head *head)
 {
 	kfree(container_of(head, struct tp_probes, rcu));
 }
 
+static void rcu_free_old_probes(struct rcu_head *head)
+{
+	call_rcu_tasks_trace(head, rcu_tasks_trace_free_old_probes);
+}
+
 static inline void release_probes(struct tracepoint_func *old)
 {
 	if (old) {