Message ID | 20240731214256.3588718-3-andrii@kernel.org (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | uprobes: RCU-protected hot path optimizations | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote: SNIP > static void put_uprobe(struct uprobe *uprobe) > { > - if (refcount_dec_and_test(&uprobe->ref)) { > - /* > - * If application munmap(exec_vma) before uprobe_unregister() > - * gets called, we don't get a chance to remove uprobe from > - * delayed_uprobe_list from remove_breakpoint(). Do it here. > - */ > - mutex_lock(&delayed_uprobe_lock); > - delayed_uprobe_remove(uprobe, NULL); > - mutex_unlock(&delayed_uprobe_lock); > - kfree(uprobe); > - } > + if (!refcount_dec_and_test(&uprobe->ref)) > + return; > + > + write_lock(&uprobes_treelock); > + > + if (uprobe_is_active(uprobe)) > + rb_erase(&uprobe->rb_node, &uprobes_tree); > + > + write_unlock(&uprobes_treelock); > + > + /* > + * If application munmap(exec_vma) before uprobe_unregister() > + * gets called, we don't get a chance to remove uprobe from > + * delayed_uprobe_list from remove_breakpoint(). Do it here. > + */ > + mutex_lock(&delayed_uprobe_lock); > + delayed_uprobe_remove(uprobe, NULL); > + mutex_unlock(&delayed_uprobe_lock); we should do kfree(uprobe) in here, right? I think this is fixed later on when uprobe_free_rcu is introduced SNIP > @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode, > if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) > return ERR_PTR(-EINVAL); > > - retry: > uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); > if (IS_ERR(uprobe)) > return uprobe; > > - /* > - * We can race with uprobe_unregister()->delete_uprobe(). > - * Check uprobe_is_active() and retry if it is false. > - */ > down_write(&uprobe->register_rwsem); > - ret = -EAGAIN; > - if (likely(uprobe_is_active(uprobe))) { > - consumer_add(uprobe, uc); > - ret = register_for_each_vma(uprobe, uc); > - } > + consumer_add(uprobe, uc); > + ret = register_for_each_vma(uprobe, uc); > up_write(&uprobe->register_rwsem); > - put_uprobe(uprobe); > > if (ret) { > - if (unlikely(ret == -EAGAIN)) > - goto retry; nice, I like getting rid of this.. so far lgtm ;-) jirka > uprobe_unregister(uprobe, uc); > return ERR_PTR(ret); > } > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode, > u = rb_entry(t, struct uprobe, rb_node); > if (u->inode != inode || u->offset < min) > break; > + u = try_get_uprobe(u); > + if (!u) /* uprobe already went away, safe to ignore */ > + continue; > list_add(&u->pending_list, head); > - get_uprobe(u); > } > for (t = n; (t = rb_next(t)); ) { > u = rb_entry(t, struct uprobe, rb_node); > if (u->inode != inode || u->offset > max) > break; > + u = try_get_uprobe(u); > + if (!u) /* uprobe already went away, safe to ignore */ > + continue; > list_add(&u->pending_list, head); > - get_uprobe(u); > } > } > read_unlock(&uprobes_treelock); > @@ -1752,6 +1766,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask) > return -ENOMEM; > > *n = *o; > + /* > + * uprobe's refcnt has to be positive at this point, kept by > + * utask->return_instances items; return_instances can't be > + * removed right now, as task is blocked due to duping; so > + * get_uprobe() is safe to use here. > + */ > get_uprobe(n->uprobe); > n->next = NULL; > > @@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs) > } > orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; > } > - > + /* > + * uprobe's refcnt is positive, held by caller, so it's safe to > + * unconditionally bump it one more time here > + */ > ri->uprobe = get_uprobe(uprobe); > ri->func = instruction_pointer(regs); > ri->stack = user_stack_pointer(regs); > -- > 2.43.0 >
On Thu, Aug 1, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote: > > SNIP > > > static void put_uprobe(struct uprobe *uprobe) > > { > > - if (refcount_dec_and_test(&uprobe->ref)) { > > - /* > > - * If application munmap(exec_vma) before uprobe_unregister() > > - * gets called, we don't get a chance to remove uprobe from > > - * delayed_uprobe_list from remove_breakpoint(). Do it here. > > - */ > > - mutex_lock(&delayed_uprobe_lock); > > - delayed_uprobe_remove(uprobe, NULL); > > - mutex_unlock(&delayed_uprobe_lock); > > - kfree(uprobe); > > - } > > + if (!refcount_dec_and_test(&uprobe->ref)) > > + return; > > + > > + write_lock(&uprobes_treelock); > > + > > + if (uprobe_is_active(uprobe)) > > + rb_erase(&uprobe->rb_node, &uprobes_tree); > > + > > + write_unlock(&uprobes_treelock); > > + > > + /* > > + * If application munmap(exec_vma) before uprobe_unregister() > > + * gets called, we don't get a chance to remove uprobe from > > + * delayed_uprobe_list from remove_breakpoint(). Do it here. > > + */ > > + mutex_lock(&delayed_uprobe_lock); > > + delayed_uprobe_remove(uprobe, NULL); > > + mutex_unlock(&delayed_uprobe_lock); > > we should do kfree(uprobe) in here, right? heh, yep, seems like I lost it while rebasing or something, good catch! fixed. > > I think this is fixed later on when uprobe_free_rcu is introduced > > SNIP > > > @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode, > > if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) > > return ERR_PTR(-EINVAL); > > > > - retry: > > uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); > > if (IS_ERR(uprobe)) > > return uprobe; > > > > - /* > > - * We can race with uprobe_unregister()->delete_uprobe(). > > - * Check uprobe_is_active() and retry if it is false. > > - */ > > down_write(&uprobe->register_rwsem); > > - ret = -EAGAIN; > > - if (likely(uprobe_is_active(uprobe))) { > > - consumer_add(uprobe, uc); > > - ret = register_for_each_vma(uprobe, uc); > > - } > > + consumer_add(uprobe, uc); > > + ret = register_for_each_vma(uprobe, uc); > > up_write(&uprobe->register_rwsem); > > - put_uprobe(uprobe); > > > > if (ret) { > > - if (unlikely(ret == -EAGAIN)) > > - goto retry; > > nice, I like getting rid of this.. so far lgtm ;-) > > jirka > > > > uprobe_unregister(uprobe, uc); > > return ERR_PTR(ret); > > } > > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode, > > u = rb_entry(t, struct uprobe, rb_node); > > if (u->inode != inode || u->offset < min) > > break; > > + u = try_get_uprobe(u); > > + if (!u) /* uprobe already went away, safe to ignore */ > > + continue; > > list_add(&u->pending_list, head); > > - get_uprobe(u); > > } > > for (t = n; (t = rb_next(t)); ) { > > u = rb_entry(t, struct uprobe, rb_node); > > if (u->inode != inode || u->offset > max) > > break; > > + u = try_get_uprobe(u); > > + if (!u) /* uprobe already went away, safe to ignore */ > > + continue; > > list_add(&u->pending_list, head); > > - get_uprobe(u); > > } > > } > > read_unlock(&uprobes_treelock); > > @@ -1752,6 +1766,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask) > > return -ENOMEM; > > > > *n = *o; > > + /* > > + * uprobe's refcnt has to be positive at this point, kept by > > + * utask->return_instances items; return_instances can't be > > + * removed right now, as task is blocked due to duping; so > > + * get_uprobe() is safe to use here. > > + */ > > get_uprobe(n->uprobe); > > n->next = NULL; > > > > @@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs) > > } > > orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; > > } > > - > > + /* > > + * uprobe's refcnt is positive, held by caller, so it's safe to > > + * unconditionally bump it one more time here > > + */ > > ri->uprobe = get_uprobe(uprobe); > > ri->func = instruction_pointer(regs); > > ri->stack = user_stack_pointer(regs); > > -- > > 2.43.0 > >
On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@kernel.org> wrote: > > Revamp how struct uprobe is refcounted, and thus how its lifetime is > managed. > > Right now, there are a few possible "owners" of uprobe refcount: > - uprobes_tree RB tree assumes one refcount when uprobe is registered > and added to the lookup tree; > - while uprobe is triggered and kernel is handling it in the breakpoint > handler code, temporary refcount bump is done to keep uprobe from > being freed; > - if we have uretprobe requested on a given struct uprobe instance, we > take another refcount to keep uprobe alive until user space code > returns from the function and triggers return handler. > > The uprobe_tree's extra refcount of 1 is confusing and problematic. No > matter how many actual consumers are attached, they all share the same > refcount, and we have an extra logic to drop the "last" (which might not > really be last) refcount once uprobe's consumer list becomes empty. > > This is unconventional and has to be kept in mind as a special case all > the time. Further, because of this design we have the situations where > find_uprobe() will find uprobe, bump refcount, return it to the caller, > but that uprobe will still need uprobe_is_active() check, after which > the caller is required to drop refcount and try again. This is just too > many details leaking to the higher level logic. > > This patch changes refcounting scheme in such a way as to not have > uprobes_tree keeping extra refcount for struct uprobe. Instead, each > uprobe_consumer is assuming its own refcount, which will be dropped > when consumer is unregistered. Other than that, all the active users of > uprobe (entry and return uprobe handling code) keeps exactly the same > refcounting approach. > > With the above setup, once uprobe's refcount drops to zero, we need to > make sure that uprobe's "destructor" removes uprobe from uprobes_tree, > of course. This, though, races with uprobe entry handling code in > handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup, > can race with uprobe being destroyed after refcount drops to zero (e.g., > due to uprobe_consumer unregistering). So we add try_get_uprobe(), which > will attempt to bump refcount, unless it already is zero. Caller needs > to guarantee that uprobe instance won't be freed in parallel, which is > the case while we keep uprobes_treelock (for read or write, doesn't > matter). > > Note also, we now don't leak the race between registration and > unregistration, so we remove the retry logic completely. If > find_uprobe() returns valid uprobe, it's guaranteed to remain in > uprobes_tree with properly incremented refcount. The race is handled > inside __insert_uprobe() and put_uprobe() working together: > __insert_uprobe() will remove uprobe from RB-tree, if it can't bump > refcount and will retry to insert the new uprobe instance. put_uprobe() > won't attempt to remove uprobe from RB-tree, if it's already not there. > All that is protected by uprobes_treelock, which keeps things simple. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > kernel/events/uprobes.c | 163 +++++++++++++++++++++++----------------- > 1 file changed, 93 insertions(+), 70 deletions(-) > [...] > @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) > int err; > > down_write(&uprobe->register_rwsem); > - if (WARN_ON(!consumer_del(uprobe, uc))) > + if (WARN_ON(!consumer_del(uprobe, uc))) { > err = -ENOENT; > - else > + } else { > err = register_for_each_vma(uprobe, NULL); > - > - /* TODO : cant unregister? schedule a worker thread */ > - if (!err) { > - if (!uprobe->consumers) > - delete_uprobe(uprobe); > - else > - err = -EBUSY; > + /* TODO : cant unregister? schedule a worker thread */ > + WARN(err, "leaking uprobe due to failed unregistration"); Ok, so now that I added this very loud warning if register_for_each_vma(uprobe, NULL) returns error, it turns out it's not that unusual for this unregistration to fail. If I run my uprobe-stress for just a little bit, and then terminate it with ^C, I get this splat: [ 1980.854229] leaking uprobe due to failed unregistration [ 1980.854244] WARNING: CPU: 3 PID: 23013 at kernel/events/uprobes.c:1123 uprobe_unregister_nosync+0x68/0x80 [ 1980.855356] Modules linked in: aesni_intel(E) crypto_simd(E) cryptd(E) kvm_intel(E) kvm(E) floppy(E) i2c_piix4(E) i2c_] [ 1980.856746] CPU: 3 UID: 0 PID: 23013 Comm: exe Tainted: G W OE 6.11.0-rc1-00032-g308d1f294b79 #129 [ 1980.857407] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE [ 1980.857788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04 [ 1980.858489] RIP: 0010:uprobe_unregister_nosync+0x68/0x80 [ 1980.858826] Code: 6e fb ff ff 4c 89 e7 89 c3 e8 24 e8 e3 ff 85 db 75 0c 5b 48 89 ef 5d 41 5c e9 84 e5 ff ff 48 c7 c7 d0 [ 1980.860052] RSP: 0018:ffffc90002fb7e58 EFLAGS: 00010296 [ 1980.860428] RAX: 000000000000002b RBX: 00000000fffffffc RCX: 0000000000000000 [ 1980.860913] RDX: 0000000000000002 RSI: 0000000000000027 RDI: 00000000ffffffff [ 1980.861379] RBP: ffff88811159ac00 R08: 00000000fffeffff R09: 0000000000000001 [ 1980.861871] R10: 0000000000000000 R11: ffffffff83299920 R12: ffff88811159ac20 [ 1980.862340] R13: ffff88810153c7a0 R14: ffff88810c3fe000 R15: 0000000000000000 [ 1980.862830] FS: 0000000000000000(0000) GS:ffff88881ca00000(0000) knlGS:0000000000000000 [ 1980.863370] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1980.863758] CR2: 00007fa08aea8276 CR3: 000000010f59c005 CR4: 0000000000370ef0 [ 1980.864239] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1980.864708] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1980.865202] Call Trace: [ 1980.865356] <TASK> [ 1980.865524] ? __warn+0x80/0x180 [ 1980.865745] ? uprobe_unregister_nosync+0x68/0x80 [ 1980.866074] ? report_bug+0x18d/0x1c0 [ 1980.866326] ? handle_bug+0x3a/0x70 [ 1980.866568] ? exc_invalid_op+0x13/0x60 [ 1980.866836] ? asm_exc_invalid_op+0x16/0x20 [ 1980.867098] ? uprobe_unregister_nosync+0x68/0x80 [ 1980.867390] ? uprobe_unregister_nosync+0x68/0x80 [ 1980.867726] bpf_uprobe_multi_link_release+0x31/0xd0 [ 1980.868044] bpf_link_free+0x54/0xd0 [ 1980.868267] bpf_link_release+0x17/0x20 [ 1980.868542] __fput+0x102/0x2e0 [ 1980.868760] task_work_run+0x55/0xa0 [ 1980.869027] syscall_exit_to_user_mode+0x1dd/0x1f0 [ 1980.869344] do_syscall_64+0x70/0x140 [ 1980.869603] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 1980.869923] RIP: 0033:0x7fa08aea82a0 [ 1980.870171] Code: Unable to access opcode bytes at 0x7fa08aea8276. [ 1980.870587] RSP: 002b:00007ffe838cd030 EFLAGS: 00000202 ORIG_RAX: 000000000000003b [ 1980.871098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 1980.871563] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1980.872055] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 1980.872526] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 1980.873044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 1980.873568] </TASK> I traced it a little bit with retsnoop to figure out where this is coming from, and here we go: 14:53:18.897165 -> 14:53:18.897171 TID/PID 23013/23013 (exe/exe): FUNCTION CALLS RESULT DURATION --------------------------- -------- -------- → uprobe_write_opcode → get_user_pages_remote ↔ __get_user_pages [-EINTR] 1.382us ← get_user_pages_remote [-EINTR] 4.889us ← uprobe_write_opcode [-EINTR] 6.908us entry_SYSCALL_64_after_hwframe+0x76 (entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:130) do_syscall_64+0x70 (arch/x86/entry/common.c:102) syscall_exit_to_user_mode+0x1dd (kernel/entry/common.c:218) . __syscall_exit_to_user_mode_work (kernel/entry/common.c:207) . exit_to_user_mode_prepare (include/linux/entry-common.h:328) . exit_to_user_mode_loop (kernel/entry/common.c:114) . resume_user_mode_work (include/linux/resume_user_mode.h:50) task_work_run+0x55 (kernel/task_work.c:222) __fput+0x102 (fs/file_table.c:422) bpf_link_release+0x17 (kernel/bpf/syscall.c:3116) bpf_link_free+0x54 (kernel/bpf/syscall.c:3067) bpf_uprobe_multi_link_release+0x31 (kernel/trace/bpf_trace.c:3198) . bpf_uprobe_unregister (kernel/trace/bpf_trace.c:3186) uprobe_unregister_nosync+0x42 (kernel/events/uprobes.c:1120) register_for_each_vma+0x427 (kernel/events/uprobes.c:1092) 6us [-EINTR] uprobe_write_opcode+0x79 (kernel/events/uprobes.c:478) . get_user_page_vma_remote (include/linux/mm.h:2489) 4us [-EINTR] get_user_pages_remote+0x109 (mm/gup.c:2627) . __get_user_pages_locked (mm/gup.c:1762) ! 1us [-EINTR] __get_user_pages So, we do uprobe_unregister -> register_for_each_vma -> remove_breakpoint -> set_orig_insn -> uprobe_write_opcode -> get_user_page_vma_remote -> get_user_pages_remote -> __get_user_pages_locked -> __get_user_pages and I think we then hit `if (fatal_signal_pending(current)) return -EINTR;` check. So, is there something smarter we can do in this case besides leaking an uprobe (and note, my changes don't change this behavior)? I can of course just drop the WARN given it's sort of expected now, but if we can handle that more gracefully it would be good. I don't think that should block optimization work, but just something to keep in mind and maybe fix as a follow up. > } > up_write(&uprobe->register_rwsem); > [...]
On 08/01, Andrii Nakryiko wrote: > > > + /* TODO : cant unregister? schedule a worker thread */ > > + WARN(err, "leaking uprobe due to failed unregistration"); > Ok, so now that I added this very loud warning if > register_for_each_vma(uprobe, NULL) returns error, it turns out it's > not that unusual for this unregistration to fail. ... > So, is there something smarter we can do in this case besides leaking > an uprobe (and note, my changes don't change this behavior)? Something like schedule_work() which retries register_for_each_vma()... > I can of course just drop the WARN given it's sort of expected now, Or least replace it with pr_warn() or uprobe_warn(), WARN() certainly makes no sense imo... > I don't > think that should block optimization work, but just something to keep > in mind and maybe fix as a follow up. Agreed, lets do this separately. Oleg.
On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote: SNIP > -/* > - * There could be threads that have already hit the breakpoint. They > - * will recheck the current insn and restart if find_uprobe() fails. > - * See find_active_uprobe(). > - */ > -static void delete_uprobe(struct uprobe *uprobe) > -{ > - if (WARN_ON(!uprobe_is_active(uprobe))) > - return; > - > - write_lock(&uprobes_treelock); > - rb_erase(&uprobe->rb_node, &uprobes_tree); > - write_unlock(&uprobes_treelock); > - RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > -} > - > struct map_info { > struct map_info *next; > struct mm_struct *mm; > @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) > int err; > > down_write(&uprobe->register_rwsem); > - if (WARN_ON(!consumer_del(uprobe, uc))) > + if (WARN_ON(!consumer_del(uprobe, uc))) { > err = -ENOENT; > - else > + } else { > err = register_for_each_vma(uprobe, NULL); > - > - /* TODO : cant unregister? schedule a worker thread */ > - if (!err) { > - if (!uprobe->consumers) > - delete_uprobe(uprobe); ok, so removing this call is why the consumer test is failing, right? IIUC the previous behaviour was to remove uprobe from the tree even when there's active uprobe ref for installed uretprobe so following scenario will now behaves differently: install uretprobe/consumer-1 on foo foo { remove uretprobe/consumer-1 (A) install uretprobe/consumer-2 on foo (B) } before the removal of consumer-1 (A) would remove the uprobe object from the tree, so the installation of consumer-2 (b) would create new uprobe object which would not be triggered at foo return because it got installed too late (after foo uprobe was triggered) the behaviour with this patch is that removal of consumer-1 (A) will not remove the uprobe object (that happens only when we run out of refs), and the following install of consumer-2 will use the existing uprobe object so the consumer-2 will be triggered on foo return uff ;-) but I think it's better, because we get more hits jirka > - else > - err = -EBUSY; > + /* TODO : cant unregister? schedule a worker thread */ > + WARN(err, "leaking uprobe due to failed unregistration"); > } > up_write(&uprobe->register_rwsem); > > @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode, > if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) > return ERR_PTR(-EINVAL); > > - retry: > uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); > if (IS_ERR(uprobe)) > return uprobe; > > - /* > - * We can race with uprobe_unregister()->delete_uprobe(). > - * Check uprobe_is_active() and retry if it is false. > - */ > down_write(&uprobe->register_rwsem); > - ret = -EAGAIN; > - if (likely(uprobe_is_active(uprobe))) { > - consumer_add(uprobe, uc); > - ret = register_for_each_vma(uprobe, uc); > - } > + consumer_add(uprobe, uc); > + ret = register_for_each_vma(uprobe, uc); > up_write(&uprobe->register_rwsem); > - put_uprobe(uprobe); > > if (ret) { > - if (unlikely(ret == -EAGAIN)) > - goto retry; > uprobe_unregister(uprobe, uc); > return ERR_PTR(ret); SNIP
On Fri, Aug 2, 2024 at 1:50 AM Oleg Nesterov <oleg@redhat.com> wrote: > > On 08/01, Andrii Nakryiko wrote: > > > > > + /* TODO : cant unregister? schedule a worker thread */ > > > + WARN(err, "leaking uprobe due to failed unregistration"); > > > Ok, so now that I added this very loud warning if > > register_for_each_vma(uprobe, NULL) returns error, it turns out it's > > not that unusual for this unregistration to fail. > > ... > > > So, is there something smarter we can do in this case besides leaking > > an uprobe (and note, my changes don't change this behavior)? > > Something like schedule_work() which retries register_for_each_vma()... And if that fails again, what do we do? Because I don't think we even need schedule_work(), we can just keep some list of "pending to be retried" items and check them after each uprobe_register()/uprobe_unregister() call. I'm just not clear how we should handle stubborn cases (but honestly I haven't even tried to understand all the details about this just yet). > > > I can of course just drop the WARN given it's sort of expected now, > > Or least replace it with pr_warn() or uprobe_warn(), WARN() certainly > makes no sense imo... > ok, makes sense, will change to uprobe_warn() > > I don't > > think that should block optimization work, but just something to keep > > in mind and maybe fix as a follow up. > > Agreed, lets do this separately. > yep > Oleg. >
On Fri, Aug 2, 2024 at 4:11 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote: > > SNIP > > > -/* > > - * There could be threads that have already hit the breakpoint. They > > - * will recheck the current insn and restart if find_uprobe() fails. > > - * See find_active_uprobe(). > > - */ > > -static void delete_uprobe(struct uprobe *uprobe) > > -{ > > - if (WARN_ON(!uprobe_is_active(uprobe))) > > - return; > > - > > - write_lock(&uprobes_treelock); > > - rb_erase(&uprobe->rb_node, &uprobes_tree); > > - write_unlock(&uprobes_treelock); > > - RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > -} > > - > > struct map_info { > > struct map_info *next; > > struct mm_struct *mm; > > @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) > > int err; > > > > down_write(&uprobe->register_rwsem); > > - if (WARN_ON(!consumer_del(uprobe, uc))) > > + if (WARN_ON(!consumer_del(uprobe, uc))) { > > err = -ENOENT; > > - else > > + } else { > > err = register_for_each_vma(uprobe, NULL); > > - > > - /* TODO : cant unregister? schedule a worker thread */ > > - if (!err) { > > - if (!uprobe->consumers) > > - delete_uprobe(uprobe); > > ok, so removing this call is why the consumer test is failing, right? > > IIUC the previous behaviour was to remove uprobe from the tree > even when there's active uprobe ref for installed uretprobe > > so following scenario will now behaves differently: > > install uretprobe/consumer-1 on foo > foo { > remove uretprobe/consumer-1 (A) > install uretprobe/consumer-2 on foo (B) > } > > before the removal of consumer-1 (A) would remove the uprobe object > from the tree, so the installation of consumer-2 (b) would create > new uprobe object which would not be triggered at foo return because > it got installed too late (after foo uprobe was triggered) > > the behaviour with this patch is that removal of consumer-1 (A) will > not remove the uprobe object (that happens only when we run out of > refs), and the following install of consumer-2 will use the existing > uprobe object so the consumer-2 will be triggered on foo return > > uff ;-) yep, something like that > > but I think it's better, because we get more hits note, with the next patch set that makes uretprobes SRCU protected (but with timeout) the behavior becomes a bit time-sensitive :) so I think we'll have to change your selftest to first attach all the new uretprobes, then detach all the uretprobes that had to be detached, and then do a bit more relaxed logic of the sort "if there were some uretprobes before and after, then we *might* get uretprobe triggered (but we might not as well, unless the same uretprobe stayed attached at all times)". Anyways, something to take care of in the bpf-next tree separately. All this is very much an implementation detail, so I think we can change these aspects freely. > > jirka > > > - else > > - err = -EBUSY; > > + /* TODO : cant unregister? schedule a worker thread */ > > + WARN(err, "leaking uprobe due to failed unregistration"); > > } > > up_write(&uprobe->register_rwsem); > > > > @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode, > > if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) > > return ERR_PTR(-EINVAL); > > > > - retry: > > uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); > > if (IS_ERR(uprobe)) > > return uprobe; > > > > - /* > > - * We can race with uprobe_unregister()->delete_uprobe(). > > - * Check uprobe_is_active() and retry if it is false. > > - */ > > down_write(&uprobe->register_rwsem); > > - ret = -EAGAIN; > > - if (likely(uprobe_is_active(uprobe))) { > > - consumer_add(uprobe, uc); > > - ret = register_for_each_vma(uprobe, uc); > > - } > > + consumer_add(uprobe, uc); > > + ret = register_for_each_vma(uprobe, uc); > > up_write(&uprobe->register_rwsem); > > - put_uprobe(uprobe); > > > > if (ret) { > > - if (unlikely(ret == -EAGAIN)) > > - goto retry; > > uprobe_unregister(uprobe, uc); > > return ERR_PTR(ret); > > SNIP
On 08/02, Andrii Nakryiko wrote: > > On Fri, Aug 2, 2024 at 1:50 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > On 08/01, Andrii Nakryiko wrote: > > > > > > > + /* TODO : cant unregister? schedule a worker thread */ > > > > + WARN(err, "leaking uprobe due to failed unregistration"); > > > > > Ok, so now that I added this very loud warning if > > > register_for_each_vma(uprobe, NULL) returns error, it turns out it's > > > not that unusual for this unregistration to fail. > > > > ... > > > > > So, is there something smarter we can do in this case besides leaking > > > an uprobe (and note, my changes don't change this behavior)? > > > > Something like schedule_work() which retries register_for_each_vma()... > > And if that fails again, what do we do? try again after some timeout ;) > Because I don't think we even > need schedule_work(), we can just keep some list of "pending to be > retried" items and check them after each > uprobe_register()/uprobe_unregister() call. Agreed, we need a list of "pending to be retried", but rightly or not I think this should be done from work_struct->func. Lets discuss this later? We seem to agree this is a known problem, and this has nothing to do with your optimizations. > I'm just not clear how we > should handle stubborn cases (but honestly I haven't even tried to > understand all the details about this just yet). Same here. Oleg.
On 07/31, Andrii Nakryiko wrote: > > @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, > uprobe->ref_ctr_offset = ref_ctr_offset; > init_rwsem(&uprobe->register_rwsem); > init_rwsem(&uprobe->consumer_rwsem); > + RB_CLEAR_NODE(&uprobe->rb_node); I guess RB_CLEAR_NODE() is not necessary? > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode, > u = rb_entry(t, struct uprobe, rb_node); > if (u->inode != inode || u->offset < min) > break; > + u = try_get_uprobe(u); > + if (!u) /* uprobe already went away, safe to ignore */ > + continue; > list_add(&u->pending_list, head); cosmetic nit, feel to ignore, but to me if (try_get_uprobe(u)) list_add(&u->pending_list, head); looks more readable. Other than the lack of kfree() in put_uprobe() and WARN() in _unregister() the patch looks good to me. Oleg.
On Mon, Aug 5, 2024 at 6:44 AM Oleg Nesterov <oleg@redhat.com> wrote: > > On 07/31, Andrii Nakryiko wrote: > > > > @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, > > uprobe->ref_ctr_offset = ref_ctr_offset; > > init_rwsem(&uprobe->register_rwsem); > > init_rwsem(&uprobe->consumer_rwsem); > > + RB_CLEAR_NODE(&uprobe->rb_node); > > I guess RB_CLEAR_NODE() is not necessary? I definitely needed that with my batch API changes, but it might be that I don't need it anymore. But I'm a bit hesitant to remove it, because if we ever get put_uprobe() on an uprobe that hasn't been inserted into RB-tree yet, this will cause a hard to understand crash. RB_CLEAR_NODE() in __insert_uprobe() is critical to have, this one is kind of optional (but still feels right to initialize the field properly). Let me know if you feel strongly about this, though. > > > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode, > > u = rb_entry(t, struct uprobe, rb_node); > > if (u->inode != inode || u->offset < min) > > break; > > + u = try_get_uprobe(u); > > + if (!u) /* uprobe already went away, safe to ignore */ > > + continue; > > list_add(&u->pending_list, head); > > cosmetic nit, feel to ignore, but to me > > if (try_get_uprobe(u)) > list_add(&u->pending_list, head); > > looks more readable. It's not my code base to enforce my preferences, but I'll at least explain why I disagree. To me, something like `if (some condition) <break/continue>;` is a very clear indication that this item (or even the rest of items in case of break) won't be processed anymore. While if (some inverted condition) <do some something useful> <might be some more code> ... is a pattern that requires double-checking that we really are not going to use that item later on. So I'll invert this just to not be PITA, but I disagree :) > > Other than the lack of kfree() in put_uprobe() and WARN() in _unregister() > the patch looks good to me. yep, fixed that locally already. Thanks for the review! > > Oleg. >
On 08/05, Andrii Nakryiko wrote: > > On Mon, Aug 5, 2024 at 6:44 AM Oleg Nesterov <oleg@redhat.com> wrote: > > > > On 07/31, Andrii Nakryiko wrote: > > > > > > @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, > > > uprobe->ref_ctr_offset = ref_ctr_offset; > > > init_rwsem(&uprobe->register_rwsem); > > > init_rwsem(&uprobe->consumer_rwsem); > > > + RB_CLEAR_NODE(&uprobe->rb_node); > > > > I guess RB_CLEAR_NODE() is not necessary? > > I definitely needed that with my batch API changes, but it might be > that I don't need it anymore. But I'm a bit hesitant to remove it, OK, lets keep it, it doesn't hurt. Just it wasn't clear to me why did you add this initialization in this patch. > > > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode, > > > u = rb_entry(t, struct uprobe, rb_node); > > > if (u->inode != inode || u->offset < min) > > > break; > > > + u = try_get_uprobe(u); > > > + if (!u) /* uprobe already went away, safe to ignore */ > > > + continue; > > > list_add(&u->pending_list, head); > > > > cosmetic nit, feel to ignore, but to me > > > > if (try_get_uprobe(u)) > > list_add(&u->pending_list, head); > > > > looks more readable. > > It's not my code base to enforce my preferences, but I'll at least > explain why I disagree. To me, something like `if (some condition) > <break/continue>;` is a very clear indication that this item (or even > the rest of items in case of break) won't be processed anymore. > > While > > if (some inverted condition) > <do some something useful> > <might be some more code> OK, I won't insist. To me the most confusing part is u = try_get_uprobe(u); if (!u) ... If you read this code for the 1st time (or you are trying to recall it after 10 years ;) it looks as if try_get_uprobe() can return another uprobe. > So I'll invert this just to not be PITA, but I disagree :) If you disagree, then don't change it ;) Oleg.
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index f88b7ff20587..23dde3ec5b09 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -587,25 +587,51 @@ set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v *(uprobe_opcode_t *)&auprobe->insn); } +/* uprobe should have guaranteed positive refcount */ static struct uprobe *get_uprobe(struct uprobe *uprobe) { refcount_inc(&uprobe->ref); return uprobe; } +/* + * uprobe should have guaranteed lifetime, which can be either of: + * - caller already has refcount taken (and wants an extra one); + * - uprobe is RCU protected and won't be freed until after grace period; + * - we are holding uprobes_treelock (for read or write, doesn't matter). + */ +static struct uprobe *try_get_uprobe(struct uprobe *uprobe) +{ + if (refcount_inc_not_zero(&uprobe->ref)) + return uprobe; + return NULL; +} + +static inline bool uprobe_is_active(struct uprobe *uprobe) +{ + return !RB_EMPTY_NODE(&uprobe->rb_node); +} + static void put_uprobe(struct uprobe *uprobe) { - if (refcount_dec_and_test(&uprobe->ref)) { - /* - * If application munmap(exec_vma) before uprobe_unregister() - * gets called, we don't get a chance to remove uprobe from - * delayed_uprobe_list from remove_breakpoint(). Do it here. - */ - mutex_lock(&delayed_uprobe_lock); - delayed_uprobe_remove(uprobe, NULL); - mutex_unlock(&delayed_uprobe_lock); - kfree(uprobe); - } + if (!refcount_dec_and_test(&uprobe->ref)) + return; + + write_lock(&uprobes_treelock); + + if (uprobe_is_active(uprobe)) + rb_erase(&uprobe->rb_node, &uprobes_tree); + + write_unlock(&uprobes_treelock); + + /* + * If application munmap(exec_vma) before uprobe_unregister() + * gets called, we don't get a chance to remove uprobe from + * delayed_uprobe_list from remove_breakpoint(). Do it here. + */ + mutex_lock(&delayed_uprobe_lock); + delayed_uprobe_remove(uprobe, NULL); + mutex_unlock(&delayed_uprobe_lock); } static __always_inline @@ -656,7 +682,7 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset) struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key); if (node) - return get_uprobe(__node_2_uprobe(node)); + return try_get_uprobe(__node_2_uprobe(node)); return NULL; } @@ -676,26 +702,44 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) return uprobe; } +/* + * Attempt to insert a new uprobe into uprobes_tree. + * + * If uprobe already exists (for given inode+offset), we just increment + * refcount of previously existing uprobe. + * + * If not, a provided new instance of uprobe is inserted into the tree (with + * assumed initial refcount == 1). + * + * In any case, we return a uprobe instance that ends up being in uprobes_tree. + * Caller has to clean up new uprobe instance, if it ended up not being + * inserted into the tree. + * + * We assume that uprobes_treelock is held for writing. + */ static struct uprobe *__insert_uprobe(struct uprobe *uprobe) { struct rb_node *node; - +again: node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp); - if (node) - return get_uprobe(__node_2_uprobe(node)); + if (node) { + struct uprobe *u = __node_2_uprobe(node); - /* get access + creation ref */ - refcount_set(&uprobe->ref, 2); - return NULL; + if (!try_get_uprobe(u)) { + rb_erase(node, &uprobes_tree); + RB_CLEAR_NODE(&u->rb_node); + goto again; + } + + return u; + } + + return uprobe; } /* - * Acquire uprobes_treelock. - * Matching uprobe already exists in rbtree; - * increment (access refcount) and return the matching uprobe. - * - * No matching uprobe; insert the uprobe in rb_tree; - * get a double refcount (access + creation) and return NULL. + * Acquire uprobes_treelock and insert uprobe into uprobes_tree + * (or reuse existing one, see __insert_uprobe() comments above). */ static struct uprobe *insert_uprobe(struct uprobe *uprobe) { @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset, uprobe->ref_ctr_offset = ref_ctr_offset; init_rwsem(&uprobe->register_rwsem); init_rwsem(&uprobe->consumer_rwsem); + RB_CLEAR_NODE(&uprobe->rb_node); + refcount_set(&uprobe->ref, 1); /* add to uprobes_tree, sorted on inode:offset */ cur_uprobe = insert_uprobe(uprobe); /* a uprobe exists for this inode:offset combination */ - if (cur_uprobe) { + if (cur_uprobe != uprobe) { if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) { ref_ctr_mismatch_warn(cur_uprobe, uprobe); put_uprobe(cur_uprobe); @@ -921,26 +967,6 @@ remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad return set_orig_insn(&uprobe->arch, mm, vaddr); } -static inline bool uprobe_is_active(struct uprobe *uprobe) -{ - return !RB_EMPTY_NODE(&uprobe->rb_node); -} -/* - * There could be threads that have already hit the breakpoint. They - * will recheck the current insn and restart if find_uprobe() fails. - * See find_active_uprobe(). - */ -static void delete_uprobe(struct uprobe *uprobe) -{ - if (WARN_ON(!uprobe_is_active(uprobe))) - return; - - write_lock(&uprobes_treelock); - rb_erase(&uprobe->rb_node, &uprobes_tree); - write_unlock(&uprobes_treelock); - RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ -} - struct map_info { struct map_info *next; struct mm_struct *mm; @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc) int err; down_write(&uprobe->register_rwsem); - if (WARN_ON(!consumer_del(uprobe, uc))) + if (WARN_ON(!consumer_del(uprobe, uc))) { err = -ENOENT; - else + } else { err = register_for_each_vma(uprobe, NULL); - - /* TODO : cant unregister? schedule a worker thread */ - if (!err) { - if (!uprobe->consumers) - delete_uprobe(uprobe); - else - err = -EBUSY; + /* TODO : cant unregister? schedule a worker thread */ + WARN(err, "leaking uprobe due to failed unregistration"); } up_write(&uprobe->register_rwsem); @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode, if (!IS_ALIGNED(ref_ctr_offset, sizeof(short))) return ERR_PTR(-EINVAL); - retry: uprobe = alloc_uprobe(inode, offset, ref_ctr_offset); if (IS_ERR(uprobe)) return uprobe; - /* - * We can race with uprobe_unregister()->delete_uprobe(). - * Check uprobe_is_active() and retry if it is false. - */ down_write(&uprobe->register_rwsem); - ret = -EAGAIN; - if (likely(uprobe_is_active(uprobe))) { - consumer_add(uprobe, uc); - ret = register_for_each_vma(uprobe, uc); - } + consumer_add(uprobe, uc); + ret = register_for_each_vma(uprobe, uc); up_write(&uprobe->register_rwsem); - put_uprobe(uprobe); if (ret) { - if (unlikely(ret == -EAGAIN)) - goto retry; uprobe_unregister(uprobe, uc); return ERR_PTR(ret); } @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode, u = rb_entry(t, struct uprobe, rb_node); if (u->inode != inode || u->offset < min) break; + u = try_get_uprobe(u); + if (!u) /* uprobe already went away, safe to ignore */ + continue; list_add(&u->pending_list, head); - get_uprobe(u); } for (t = n; (t = rb_next(t)); ) { u = rb_entry(t, struct uprobe, rb_node); if (u->inode != inode || u->offset > max) break; + u = try_get_uprobe(u); + if (!u) /* uprobe already went away, safe to ignore */ + continue; list_add(&u->pending_list, head); - get_uprobe(u); } } read_unlock(&uprobes_treelock); @@ -1752,6 +1766,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask) return -ENOMEM; *n = *o; + /* + * uprobe's refcnt has to be positive at this point, kept by + * utask->return_instances items; return_instances can't be + * removed right now, as task is blocked due to duping; so + * get_uprobe() is safe to use here. + */ get_uprobe(n->uprobe); n->next = NULL; @@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs) } orig_ret_vaddr = utask->return_instances->orig_ret_vaddr; } - + /* + * uprobe's refcnt is positive, held by caller, so it's safe to + * unconditionally bump it one more time here + */ ri->uprobe = get_uprobe(uprobe); ri->func = instruction_pointer(regs); ri->stack = user_stack_pointer(regs);
Revamp how struct uprobe is refcounted, and thus how its lifetime is managed. Right now, there are a few possible "owners" of uprobe refcount: - uprobes_tree RB tree assumes one refcount when uprobe is registered and added to the lookup tree; - while uprobe is triggered and kernel is handling it in the breakpoint handler code, temporary refcount bump is done to keep uprobe from being freed; - if we have uretprobe requested on a given struct uprobe instance, we take another refcount to keep uprobe alive until user space code returns from the function and triggers return handler. The uprobe_tree's extra refcount of 1 is confusing and problematic. No matter how many actual consumers are attached, they all share the same refcount, and we have an extra logic to drop the "last" (which might not really be last) refcount once uprobe's consumer list becomes empty. This is unconventional and has to be kept in mind as a special case all the time. Further, because of this design we have the situations where find_uprobe() will find uprobe, bump refcount, return it to the caller, but that uprobe will still need uprobe_is_active() check, after which the caller is required to drop refcount and try again. This is just too many details leaking to the higher level logic. This patch changes refcounting scheme in such a way as to not have uprobes_tree keeping extra refcount for struct uprobe. Instead, each uprobe_consumer is assuming its own refcount, which will be dropped when consumer is unregistered. Other than that, all the active users of uprobe (entry and return uprobe handling code) keeps exactly the same refcounting approach. With the above setup, once uprobe's refcount drops to zero, we need to make sure that uprobe's "destructor" removes uprobe from uprobes_tree, of course. This, though, races with uprobe entry handling code in handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup, can race with uprobe being destroyed after refcount drops to zero (e.g., due to uprobe_consumer unregistering). So we add try_get_uprobe(), which will attempt to bump refcount, unless it already is zero. Caller needs to guarantee that uprobe instance won't be freed in parallel, which is the case while we keep uprobes_treelock (for read or write, doesn't matter). Note also, we now don't leak the race between registration and unregistration, so we remove the retry logic completely. If find_uprobe() returns valid uprobe, it's guaranteed to remain in uprobes_tree with properly incremented refcount. The race is handled inside __insert_uprobe() and put_uprobe() working together: __insert_uprobe() will remove uprobe from RB-tree, if it can't bump refcount and will retry to insert the new uprobe instance. put_uprobe() won't attempt to remove uprobe from RB-tree, if it's already not there. All that is protected by uprobes_treelock, which keeps things simple. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- kernel/events/uprobes.c | 163 +++++++++++++++++++++++----------------- 1 file changed, 93 insertions(+), 70 deletions(-)