Message ID | 20240729114608.1792954-2-radoslaw.zielonek@gmail.com (mailing list archive) |
---|---|
State | RFC |
Delegated to: | BPF |
Headers | show |
Series | [RFC] Printk deadlock in bpf trace called from scheduler context | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
bpf/vmtest-bpf-PR | fail | merge-conflict |
On Mon, Jul 29, 2024 at 01:46:09PM +0200, Radoslaw Zielonek wrote: > I am currently working on a syzbot-reported bug where bpf > is called from trace_sched_switch. In this scenario, we are still within > the scheduler context, and calling printk can create a deadlock. > > I am uncertain about the best approach to fix this issue. It's been like this forever, it doesn't need fixing, because tracepoints shouldn't be doing printk() in the first place. > Should we simply forbid such calls, or perhaps we should replace printk > with printk_deferred in the bpf where we are still in scheduler context? Not doing printk() is best.
On Mon, 29 Jul 2024 at 14:27, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, Jul 29, 2024 at 01:46:09PM +0200, Radoslaw Zielonek wrote: > > I am currently working on a syzbot-reported bug where bpf > > is called from trace_sched_switch. In this scenario, we are still within > > the scheduler context, and calling printk can create a deadlock. > > > > I am uncertain about the best approach to fix this issue. > > It's been like this forever, it doesn't need fixing, because tracepoints > shouldn't be doing printk() in the first place. > > > Should we simply forbid such calls, or perhaps we should replace printk > > with printk_deferred in the bpf where we are still in scheduler context? > > Not doing printk() is best. And teaching more debugging tools to behave. This particular case originates from fault injection: > [ 60.265518][ T8343] should_fail_ex+0x383/0x4d0 > [ 60.265547][ T8343] strncpy_from_user+0x36/0x2d0 > [ 60.265601][ T8343] strncpy_from_user_nofault+0x70/0x140 > [ 60.265637][ T8343] bpf_probe_read_user_str+0x2a/0x70 Probably the fail_dump() function in lib/fault-inject.c being a little too verbose in this case. Radoslaw, the fix should be in lib/fault-inject.c. Similar to other debugging tools (like KFENCE, which you discovered) adding lockdep_off()/lockdep_on(), prink_deferred, or not being as verbose in this context may be more appropriate. Fault injection does not need to print a message to inject a fault - the message is for debugging purposes. Probably a reasonable compromise is to use printk_deferred() in fail_dump() if in this context to still help with debugging on a best effort basis. You also need to take care to avoid dumping the stack in fail_dump().
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h index 9ea4c404bd4e..a5eb2bdd832c 100644 --- a/include/trace/events/sched.h +++ b/include/trace/events/sched.h @@ -10,6 +10,8 @@ #include <linux/tracepoint.h> #include <linux/binfmts.h> +extern DEFINE_PER_CPU(bool, is_sched_trace_context); + /* * Tracepoint for calling kthread_stop, performed to end a kthread: */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index ae5ef3013a55..f990b1aa0169 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -119,6 +119,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(sched_compute_energy_tp); DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); +DEFINE_PER_CPU(bool, is_sched_trace_context) = false; + #ifdef CONFIG_SCHED_DEBUG /* * Debugging: various feature bits @@ -6523,7 +6525,9 @@ static void __sched notrace __schedule(unsigned int sched_mode) psi_account_irqtime(rq, prev, next); psi_sched_switch(prev, next, !task_on_rq_queued(prev)); + this_cpu_write(is_sched_trace_context, true); trace_sched_switch(sched_mode & SM_MASK_PREEMPT, prev, next, prev_state); + this_cpu_write(is_sched_trace_context, false); /* Also unlocks the rq: */ rq = context_switch(rq, prev, next, &rf); diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index cd098846e251..35a9b401d28a 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -36,6 +36,8 @@ #include "trace_probe.h" #include "trace.h" +extern DEFINE_PER_CPU(bool, is_sched_trace_context); + #define CREATE_TRACE_POINTS #include "bpf_trace.h" @@ -2402,10 +2404,27 @@ void __bpf_trace_run(struct bpf_raw_tp_link *link, u64 *args) run_ctx.bpf_cookie = link->cookie; old_run_ctx = bpf_set_run_ctx(&run_ctx.run_ctx); + + /* + * STOLEN from kfence_report_error + * Because we may generate reports in printk-unfriendly parts of the + * kernel, such as scheduler code, the use of printk() could deadlock. + * Until such time that all printing code here is safe in all parts of + * the kernel, accept the risk, and just get our message out (given the + * system might already behave unpredictably due to the memory error). + * As such, also disable lockdep to hide warnings, and avoid disabling + * lockdep for the rest of the kernel. + */ + if (this_cpu_read(is_sched_trace_context)) + lockdep_off(); + rcu_read_lock(); (void) bpf_prog_run(prog, args); rcu_read_unlock(); + if (this_cpu_read(is_sched_trace_context)) + lockdep_on(); + bpf_reset_run_ctx(old_run_ctx); out: this_cpu_dec(*(prog->active));