Message ID | 20210324112503.623833-2-elver@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Add support for synchronous signals on perf events | expand |
On Wed, Mar 24, 2021 at 12:24PM +0100, Marco Elver wrote: > From: Peter Zijlstra <peterz@infradead.org> > > Make perf_event_exit_event() more robust, such that we can use it from > other contexts. Specifically the up and coming remove_on_exec. > > For this to work we need to address a few issues. Remove_on_exec will > not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to > disable event_function_call() and we thus have to use > perf_remove_from_context(). > > When using perf_remove_from_context(), there's two races to consider. > The first is against close(), where we can have concurrent tear-down > of the event. The second is against child_list iteration, which should > not find a half baked event. > > To address this, teach perf_remove_from_context() to special case > !ctx->is_active and about DETACH_CHILD. > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > Signed-off-by: Marco Elver <elver@google.com> > --- > v3: > * New dependency for series: > https://lkml.kernel.org/r/YFn/I3aKF+TOjGcl@hirez.programming.kicks-ass.net > --- syzkaller found a crash with stack trace pointing at changes in this patch. Can't tell if this is an old issue or introduced in this series. It looks like task_pid_ptr() wants to access task_struct::signal, but the task_struct pointer is NULL. Any ideas? general protection fault, probably for non-canonical address 0xdffffc0000000103: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000818-0x000000000000081f] CPU: 2 PID: 15084 Comm: syz-executor.1 Not tainted 5.12.0-rc4+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 RIP: 0010:task_pid_ptr kernel/pid.c:325 [inline] RIP: 0010:__task_pid_nr_ns+0x137/0x3e0 kernel/pid.c:500 Code: 8b 75 00 eb 08 e8 59 28 29 00 45 31 f6 31 ff 44 89 fe e8 5c 2c 29 00 45 85 ff 74 49 48 81 c3 20 08 00 00 48 89 d8 48 c1 e8 03 <42> 80 3c 20 00 74 08 48 89 df e8 aa 03 6d 00 48 8b 2b 44 89 fb bf RSP: 0018:ffffc9000c76f6d0 EFLAGS: 00010007 RAX: 0000000000000103 RBX: 000000000000081f RCX: ffff8880717d8000 RDX: ffff8880717d8000 RSI: 0000000000000001 RDI: 0000000000000000 RBP: 0000000000000001 R08: ffffffff814fe814 R09: fffffbfff1f296b1 R10: fffffbfff1f296b1 R11: 0000000000000000 R12: dffffc0000000000 R13: 1ffff1100e6dfc5c R14: ffff888057fba108 R15: 0000000000000001 FS: 0000000000000000(0000) GS:ffff88802cf00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffcc3b05bc0 CR3: 0000000040ac0000 CR4: 0000000000750ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600 PKRU: 55555554 Call Trace: perf_event_pid_type kernel/events/core.c:1412 [inline] perf_event_pid kernel/events/core.c:1421 [inline] perf_event_read_event kernel/events/core.c:7511 [inline] sync_child_event kernel/events/core.c:12521 [inline] perf_child_detach kernel/events/core.c:2223 [inline] __perf_remove_from_context+0x569/0xd30 kernel/events/core.c:2359 perf_remove_from_context+0x19d/0x220 kernel/events/core.c:2395 perf_event_exit_event+0x76/0x950 kernel/events/core.c:12559 perf_event_exit_task_context kernel/events/core.c:12640 [inline] perf_event_exit_task+0x715/0xa40 kernel/events/core.c:12673 do_exit+0x6c2/0x2290 kernel/exit.c:834 do_group_exit+0x168/0x2d0 kernel/exit.c:922 get_signal+0x1734/0x1ef0 kernel/signal.c:2779 arch_do_signal_or_restart+0x41/0x620 arch/x86/kernel/signal.c:789 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:208 irqentry_exit_to_user_mode+0x6/0x40 kernel/entry/common.c:314 exc_general_protection+0x222/0x370 arch/x86/kernel/traps.c:530 asm_exc_general_protection+0x1e/0x30 arch/x86/include/asm/idtentry.h:571
On Thu, Mar 25, 2021 at 11:17AM +0100, Marco Elver wrote: > On Wed, Mar 24, 2021 at 12:24PM +0100, Marco Elver wrote: > > From: Peter Zijlstra <peterz@infradead.org> > > > > Make perf_event_exit_event() more robust, such that we can use it from > > other contexts. Specifically the up and coming remove_on_exec. > > > > For this to work we need to address a few issues. Remove_on_exec will > > not destroy the entire context, so we cannot rely on TASK_TOMBSTONE to > > disable event_function_call() and we thus have to use > > perf_remove_from_context(). > > > > When using perf_remove_from_context(), there's two races to consider. > > The first is against close(), where we can have concurrent tear-down > > of the event. The second is against child_list iteration, which should > > not find a half baked event. > > > > To address this, teach perf_remove_from_context() to special case > > !ctx->is_active and about DETACH_CHILD. > > > > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> > > Signed-off-by: Marco Elver <elver@google.com> > > --- > > v3: > > * New dependency for series: > > https://lkml.kernel.org/r/YFn/I3aKF+TOjGcl@hirez.programming.kicks-ass.net > > --- > > syzkaller found a crash with stack trace pointing at changes in this > patch. Can't tell if this is an old issue or introduced in this series. Yay, I found a reproducer. v5.12-rc4 is good, and sadly with this patch only we crash. :-/ Here's a stacktrace with just this patch applied: | BUG: kernel NULL pointer dereference, address: 00000000000007af | #PF: supervisor read access in kernel mode | #PF: error_code(0x0000) - not-present page | PGD 0 P4D 0 | Oops: 0000 [#1] PREEMPT SMP PTI | CPU: 7 PID: 465 Comm: a.out Not tainted 5.12.0-rc4+ #25 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 | RIP: 0010:task_pid_ptr kernel/pid.c:324 [inline] | RIP: 0010:__task_pid_nr_ns+0x112/0x240 kernel/pid.c:500 | Code: e8 13 55 07 00 e8 1e a6 0e 00 48 c7 c6 83 1e 0b 81 48 c7 c7 a0 2e d5 82 e8 4b 08 04 00 44 89 e0 5b 5d 41 5c c3 e8 fe a5 0e 00 <48> 8b 85 b0 07 00 00 4a 8d ac e0 98 01 00 00 e9 5a ff ff ff e8 e5 | RSP: 0000:ffffc90001b73a60 EFLAGS: 00010093 | RAX: 0000000000000000 RBX: ffffffff82c69820 RCX: ffffffff810b1eb2 | RDX: ffff888108d143c0 RSI: 0000000000000000 RDI: ffffffff8299ccc6 | RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000000 | R10: ffff888108d14db8 R11: 0000000000000000 R12: 0000000000000001 | R13: ffffffffffffffff R14: ffffffffffffffff R15: ffff888108e05240 | FS: 0000000000000000(0000) GS:ffff88842fdc0000(0000) knlGS:0000000000000000 | CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 | CR2: 00000000000007af CR3: 0000000002c22002 CR4: 0000000000770ee0 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 | PKRU: 55555554 | Call Trace: | perf_event_pid_type kernel/events/core.c:1412 [inline] | perf_event_pid kernel/events/core.c:1421 [inline] | perf_event_read_event+0x78/0x1d0 kernel/events/core.c:7406 | sync_child_event kernel/events/core.c:12404 [inline] | perf_child_detach kernel/events/core.c:2223 [inline] | __perf_remove_from_context+0x14d/0x280 kernel/events/core.c:2359 | perf_remove_from_context+0x9f/0xf0 kernel/events/core.c:2395 | perf_event_exit_event kernel/events/core.c:12442 [inline] | perf_event_exit_task_context kernel/events/core.c:12523 [inline] | perf_event_exit_task+0x276/0x4c0 kernel/events/core.c:12556 | do_exit+0x4cd/0xed0 kernel/exit.c:834 | do_group_exit+0x4d/0xf0 kernel/exit.c:922 | get_signal+0x1d2/0xf30 kernel/signal.c:2777 | arch_do_signal_or_restart+0xf7/0x750 arch/x86/kernel/signal.c:789 | handle_signal_work kernel/entry/common.c:147 [inline] | exit_to_user_mode_loop kernel/entry/common.c:171 [inline] | exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:208 | irqentry_exit_to_user_mode+0x6/0x30 kernel/entry/common.c:314 | asm_exc_general_protection+0x1e/0x30 arch/x86/include/asm/idtentry.h:571 Attached is a C reproducer of the syzkaller program that crashes us. Thanks, -- Marco
On Thu, Mar 25, 2021 at 05:17PM +0100, Marco Elver wrote: [...] > > syzkaller found a crash with stack trace pointing at changes in this > > patch. Can't tell if this is an old issue or introduced in this series. > > Yay, I found a reproducer. v5.12-rc4 is good, and sadly with this patch only we > crash. :-/ > > Here's a stacktrace with just this patch applied: > > | BUG: kernel NULL pointer dereference, address: 00000000000007af [...] > | RIP: 0010:task_pid_ptr kernel/pid.c:324 [inline] > | RIP: 0010:__task_pid_nr_ns+0x112/0x240 kernel/pid.c:500 [...] > | Call Trace: > | perf_event_pid_type kernel/events/core.c:1412 [inline] > | perf_event_pid kernel/events/core.c:1421 [inline] > | perf_event_read_event+0x78/0x1d0 kernel/events/core.c:7406 > | sync_child_event kernel/events/core.c:12404 [inline] > | perf_child_detach kernel/events/core.c:2223 [inline] > | __perf_remove_from_context+0x14d/0x280 kernel/events/core.c:2359 > | perf_remove_from_context+0x9f/0xf0 kernel/events/core.c:2395 > | perf_event_exit_event kernel/events/core.c:12442 [inline] > | perf_event_exit_task_context kernel/events/core.c:12523 [inline] > | perf_event_exit_task+0x276/0x4c0 kernel/events/core.c:12556 > | do_exit+0x4cd/0xed0 kernel/exit.c:834 > | do_group_exit+0x4d/0xf0 kernel/exit.c:922 > | get_signal+0x1d2/0xf30 kernel/signal.c:2777 > | arch_do_signal_or_restart+0xf7/0x750 arch/x86/kernel/signal.c:789 > | handle_signal_work kernel/entry/common.c:147 [inline] > | exit_to_user_mode_loop kernel/entry/common.c:171 [inline] > | exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:208 > | irqentry_exit_to_user_mode+0x6/0x30 kernel/entry/common.c:314 > | asm_exc_general_protection+0x1e/0x30 arch/x86/include/asm/idtentry.h:571 I spun up gdb, and it showed me this: | #0 perf_event_read_event (event=event@entry=0xffff888107cd5000, task=task@entry=0xffffffffffffffff) | at kernel/events/core.c:7397 ^^^ TASK_TOMBSTONE | #1 0xffffffff811fc9cd in sync_child_event (child_event=0xffff888107cd5000) at kernel/events/core.c:12404 | #2 perf_child_detach (event=0xffff888107cd5000) at kernel/events/core.c:2223 | #3 __perf_remove_from_context (event=event@entry=0xffff888107cd5000, cpuctx=cpuctx@entry=0xffff88842fdf0c00, | ctx=ctx@entry=0xffff8881073cb800, info=info@entry=0x3 <fixed_percpu_data+3>) at kernel/events/core.c:2359 | #4 0xffffffff811fcb9f in perf_remove_from_context (event=event@entry=0xffff888107cd5000, flags=flags@entry=3) | at kernel/events/core.c:2395 | #5 0xffffffff81204526 in perf_event_exit_event (ctx=0xffff8881073cb800, event=0xffff888107cd5000) | at kernel/events/core.c:12442 | #6 perf_event_exit_task_context (ctxn=0, child=0xffff88810531a200) at kernel/events/core.c:12523 | #7 perf_event_exit_task (child=0xffff88810531a200) at kernel/events/core.c:12556 | #8 0xffffffff8108838d in do_exit (code=code@entry=11) at kernel/exit.c:834 | #9 0xffffffff81088e4d in do_group_exit (exit_code=11) at kernel/exit.c:922 and therefore synthesized this fix on top: diff --git a/kernel/events/core.c b/kernel/events/core.c index 57de8d436efd..e77294c7e654 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -12400,7 +12400,7 @@ static void sync_child_event(struct perf_event *child_event) if (child_event->attr.inherit_stat) { struct task_struct *task = child_event->ctx->task; - if (task) + if (task && task != TASK_TOMBSTONE) perf_event_read_event(child_event, task); } which fixes the problem. My guess is that the parent and child are both racing to exit? Does that make any sense? Thanks, -- Marco
On Thu, Mar 25, 2021 at 08:10:51PM +0100, Marco Elver wrote: > and therefore synthesized this fix on top: > > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 57de8d436efd..e77294c7e654 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -12400,7 +12400,7 @@ static void sync_child_event(struct perf_event *child_event) > if (child_event->attr.inherit_stat) { > struct task_struct *task = child_event->ctx->task; > > - if (task) > + if (task && task != TASK_TOMBSTONE) > perf_event_read_event(child_event, task); > } > > which fixes the problem. My guess is that the parent and child are both > racing to exit? > > Does that make any sense? Yes, I think it does. ACK
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 3f7f89ea5e51..3d478abf411c 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -607,6 +607,7 @@ struct swevent_hlist { #define PERF_ATTACH_TASK_DATA 0x08 #define PERF_ATTACH_ITRACE 0x10 #define PERF_ATTACH_SCHED_CB 0x20 +#define PERF_ATTACH_CHILD 0x40 struct perf_cgroup; struct perf_buffer; diff --git a/kernel/events/core.c b/kernel/events/core.c index 03db40f6cba9..57de8d436efd 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -2204,6 +2204,26 @@ static void perf_group_detach(struct perf_event *event) perf_event__header_size(leader); } +static void sync_child_event(struct perf_event *child_event); + +static void perf_child_detach(struct perf_event *event) +{ + struct perf_event *parent_event = event->parent; + + if (!(event->attach_state & PERF_ATTACH_CHILD)) + return; + + event->attach_state &= ~PERF_ATTACH_CHILD; + + if (WARN_ON_ONCE(!parent_event)) + return; + + lockdep_assert_held(&parent_event->child_mutex); + + sync_child_event(event); + list_del_init(&event->child_list); +} + static bool is_orphaned_event(struct perf_event *event) { return event->state == PERF_EVENT_STATE_DEAD; @@ -2311,6 +2331,7 @@ group_sched_out(struct perf_event *group_event, } #define DETACH_GROUP 0x01UL +#define DETACH_CHILD 0x02UL /* * Cross CPU call to remove a performance event @@ -2334,6 +2355,8 @@ __perf_remove_from_context(struct perf_event *event, event_sched_out(event, cpuctx, ctx); if (flags & DETACH_GROUP) perf_group_detach(event); + if (flags & DETACH_CHILD) + perf_child_detach(event); list_del_event(event, ctx); if (!ctx->nr_events && ctx->is_active) { @@ -2362,25 +2385,21 @@ static void perf_remove_from_context(struct perf_event *event, unsigned long fla lockdep_assert_held(&ctx->mutex); - event_function_call(event, __perf_remove_from_context, (void *)flags); - /* - * The above event_function_call() can NO-OP when it hits - * TASK_TOMBSTONE. In that case we must already have been detached - * from the context (by perf_event_exit_event()) but the grouping - * might still be in-tact. + * Because of perf_event_exit_task(), perf_remove_from_context() ought + * to work in the face of TASK_TOMBSTONE, unlike every other + * event_function_call() user. */ - WARN_ON_ONCE(event->attach_state & PERF_ATTACH_CONTEXT); - if ((flags & DETACH_GROUP) && - (event->attach_state & PERF_ATTACH_GROUP)) { - /* - * Since in that case we cannot possibly be scheduled, simply - * detach now. - */ - raw_spin_lock_irq(&ctx->lock); - perf_group_detach(event); + raw_spin_lock_irq(&ctx->lock); + if (!ctx->is_active) { + __perf_remove_from_context(event, __get_cpu_context(ctx), + ctx, (void *)flags); raw_spin_unlock_irq(&ctx->lock); + return; } + raw_spin_unlock_irq(&ctx->lock); + + event_function_call(event, __perf_remove_from_context, (void *)flags); } /* @@ -12373,14 +12392,17 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu) } EXPORT_SYMBOL_GPL(perf_pmu_migrate_context); -static void sync_child_event(struct perf_event *child_event, - struct task_struct *child) +static void sync_child_event(struct perf_event *child_event) { struct perf_event *parent_event = child_event->parent; u64 child_val; - if (child_event->attr.inherit_stat) - perf_event_read_event(child_event, child); + if (child_event->attr.inherit_stat) { + struct task_struct *task = child_event->ctx->task; + + if (task) + perf_event_read_event(child_event, task); + } child_val = perf_event_count(child_event); @@ -12395,60 +12417,53 @@ static void sync_child_event(struct perf_event *child_event, } static void -perf_event_exit_event(struct perf_event *child_event, - struct perf_event_context *child_ctx, - struct task_struct *child) +perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx) { - struct perf_event *parent_event = child_event->parent; + struct perf_event *parent_event = event->parent; + unsigned long detach_flags = 0; - /* - * Do not destroy the 'original' grouping; because of the context - * switch optimization the original events could've ended up in a - * random child task. - * - * If we were to destroy the original group, all group related - * operations would cease to function properly after this random - * child dies. - * - * Do destroy all inherited groups, we don't care about those - * and being thorough is better. - */ - raw_spin_lock_irq(&child_ctx->lock); - WARN_ON_ONCE(child_ctx->is_active); + if (parent_event) { + /* + * Do not destroy the 'original' grouping; because of the + * context switch optimization the original events could've + * ended up in a random child task. + * + * If we were to destroy the original group, all group related + * operations would cease to function properly after this + * random child dies. + * + * Do destroy all inherited groups, we don't care about those + * and being thorough is better. + */ + detach_flags = DETACH_GROUP | DETACH_CHILD; + mutex_lock(&parent_event->child_mutex); + } - if (parent_event) - perf_group_detach(child_event); - list_del_event(child_event, child_ctx); - perf_event_set_state(child_event, PERF_EVENT_STATE_EXIT); /* is_event_hup() */ - raw_spin_unlock_irq(&child_ctx->lock); + perf_remove_from_context(event, detach_flags); + + raw_spin_lock_irq(&ctx->lock); + if (event->state > PERF_EVENT_STATE_EXIT) + perf_event_set_state(event, PERF_EVENT_STATE_EXIT); + raw_spin_unlock_irq(&ctx->lock); /* - * Parent events are governed by their filedesc, retain them. + * Child events can be freed. */ - if (!parent_event) { - perf_event_wakeup(child_event); + if (parent_event) { + mutex_unlock(&parent_event->child_mutex); + /* + * Kick perf_poll() for is_event_hup(); + */ + perf_event_wakeup(parent_event); + free_event(event); + put_event(parent_event); return; } - /* - * Child events can be cleaned up. - */ - - sync_child_event(child_event, child); /* - * Remove this event from the parent's list - */ - WARN_ON_ONCE(parent_event->ctx->parent_ctx); - mutex_lock(&parent_event->child_mutex); - list_del_init(&child_event->child_list); - mutex_unlock(&parent_event->child_mutex); - - /* - * Kick perf_poll() for is_event_hup(). + * Parent events are governed by their filedesc, retain them. */ - perf_event_wakeup(parent_event); - free_event(child_event); - put_event(parent_event); + perf_event_wakeup(event); } static void perf_event_exit_task_context(struct task_struct *child, int ctxn) @@ -12505,7 +12520,7 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn) perf_event_task(child, child_ctx, 0); list_for_each_entry_safe(child_event, next, &child_ctx->event_list, event_entry) - perf_event_exit_event(child_event, child_ctx, child); + perf_event_exit_event(child_event, child_ctx); mutex_unlock(&child_ctx->mutex); @@ -12765,6 +12780,7 @@ inherit_event(struct perf_event *parent_event, */ raw_spin_lock_irqsave(&child_ctx->lock, flags); add_event_to_ctx(child_event, child_ctx); + child_event->attach_state |= PERF_ATTACH_CHILD; raw_spin_unlock_irqrestore(&child_ctx->lock, flags); /*