diff mbox series

[2/8] uprobes: revamp uprobe refcounting and lifetime management

Message ID 20240731214256.3588718-3-andrii@kernel.org (mailing list archive)
State Superseded
Delegated to: Masami Hiramatsu
Headers show
Series uprobes: RCU-protected hot path optimizations | expand

Commit Message

Andrii Nakryiko July 31, 2024, 9:42 p.m. UTC
Revamp how struct uprobe is refcounted, and thus how its lifetime is
managed.

Right now, there are a few possible "owners" of uprobe refcount:
  - uprobes_tree RB tree assumes one refcount when uprobe is registered
    and added to the lookup tree;
  - while uprobe is triggered and kernel is handling it in the breakpoint
    handler code, temporary refcount bump is done to keep uprobe from
    being freed;
  - if we have uretprobe requested on a given struct uprobe instance, we
    take another refcount to keep uprobe alive until user space code
    returns from the function and triggers return handler.

The uprobe_tree's extra refcount of 1 is confusing and problematic. No
matter how many actual consumers are attached, they all share the same
refcount, and we have an extra logic to drop the "last" (which might not
really be last) refcount once uprobe's consumer list becomes empty.

This is unconventional and has to be kept in mind as a special case all
the time. Further, because of this design we have the situations where
find_uprobe() will find uprobe, bump refcount, return it to the caller,
but that uprobe will still need uprobe_is_active() check, after which
the caller is required to drop refcount and try again. This is just too
many details leaking to the higher level logic.

This patch changes refcounting scheme in such a way as to not have
uprobes_tree keeping extra refcount for struct uprobe. Instead, each
uprobe_consumer is assuming its own refcount, which will be dropped
when consumer is unregistered. Other than that, all the active users of
uprobe (entry and return uprobe handling code) keeps exactly the same
refcounting approach.

With the above setup, once uprobe's refcount drops to zero, we need to
make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
of course. This, though, races with uprobe entry handling code in
handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
can race with uprobe being destroyed after refcount drops to zero (e.g.,
due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
will attempt to bump refcount, unless it already is zero. Caller needs
to guarantee that uprobe instance won't be freed in parallel, which is
the case while we keep uprobes_treelock (for read or write, doesn't
matter).

Note also, we now don't leak the race between registration and
unregistration, so we remove the retry logic completely. If
find_uprobe() returns valid uprobe, it's guaranteed to remain in
uprobes_tree with properly incremented refcount. The race is handled
inside __insert_uprobe() and put_uprobe() working together:
__insert_uprobe() will remove uprobe from RB-tree, if it can't bump
refcount and will retry to insert the new uprobe instance. put_uprobe()
won't attempt to remove uprobe from RB-tree, if it's already not there.
All that is protected by uprobes_treelock, which keeps things simple.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/events/uprobes.c | 163 +++++++++++++++++++++++-----------------
 1 file changed, 93 insertions(+), 70 deletions(-)

Comments

Jiri Olsa Aug. 1, 2024, 11:09 a.m. UTC | #1
On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote:

SNIP

>  static void put_uprobe(struct uprobe *uprobe)
>  {
> -	if (refcount_dec_and_test(&uprobe->ref)) {
> -		/*
> -		 * If application munmap(exec_vma) before uprobe_unregister()
> -		 * gets called, we don't get a chance to remove uprobe from
> -		 * delayed_uprobe_list from remove_breakpoint(). Do it here.
> -		 */
> -		mutex_lock(&delayed_uprobe_lock);
> -		delayed_uprobe_remove(uprobe, NULL);
> -		mutex_unlock(&delayed_uprobe_lock);
> -		kfree(uprobe);
> -	}
> +	if (!refcount_dec_and_test(&uprobe->ref))
> +		return;
> +
> +	write_lock(&uprobes_treelock);
> +
> +	if (uprobe_is_active(uprobe))
> +		rb_erase(&uprobe->rb_node, &uprobes_tree);
> +
> +	write_unlock(&uprobes_treelock);
> +
> +	/*
> +	 * If application munmap(exec_vma) before uprobe_unregister()
> +	 * gets called, we don't get a chance to remove uprobe from
> +	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
> +	 */
> +	mutex_lock(&delayed_uprobe_lock);
> +	delayed_uprobe_remove(uprobe, NULL);
> +	mutex_unlock(&delayed_uprobe_lock);

we should do kfree(uprobe) in here, right?

I think this is fixed later on when uprobe_free_rcu is introduced

SNIP

> @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
>  	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
>  		return ERR_PTR(-EINVAL);
>  
> - retry:
>  	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
>  	if (IS_ERR(uprobe))
>  		return uprobe;
>  
> -	/*
> -	 * We can race with uprobe_unregister()->delete_uprobe().
> -	 * Check uprobe_is_active() and retry if it is false.
> -	 */
>  	down_write(&uprobe->register_rwsem);
> -	ret = -EAGAIN;
> -	if (likely(uprobe_is_active(uprobe))) {
> -		consumer_add(uprobe, uc);
> -		ret = register_for_each_vma(uprobe, uc);
> -	}
> +	consumer_add(uprobe, uc);
> +	ret = register_for_each_vma(uprobe, uc);
>  	up_write(&uprobe->register_rwsem);
> -	put_uprobe(uprobe);
>  
>  	if (ret) {
> -		if (unlikely(ret == -EAGAIN))
> -			goto retry;

nice, I like getting rid of this.. so far lgtm ;-)

jirka


>  		uprobe_unregister(uprobe, uc);
>  		return ERR_PTR(ret);
>  	}
> @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
>  			u = rb_entry(t, struct uprobe, rb_node);
>  			if (u->inode != inode || u->offset < min)
>  				break;
> +			u = try_get_uprobe(u);
> +			if (!u) /* uprobe already went away, safe to ignore */
> +				continue;
>  			list_add(&u->pending_list, head);
> -			get_uprobe(u);
>  		}
>  		for (t = n; (t = rb_next(t)); ) {
>  			u = rb_entry(t, struct uprobe, rb_node);
>  			if (u->inode != inode || u->offset > max)
>  				break;
> +			u = try_get_uprobe(u);
> +			if (!u) /* uprobe already went away, safe to ignore */
> +				continue;
>  			list_add(&u->pending_list, head);
> -			get_uprobe(u);
>  		}
>  	}
>  	read_unlock(&uprobes_treelock);
> @@ -1752,6 +1766,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
>  			return -ENOMEM;
>  
>  		*n = *o;
> +		/*
> +		 * uprobe's refcnt has to be positive at this point, kept by
> +		 * utask->return_instances items; return_instances can't be
> +		 * removed right now, as task is blocked due to duping; so
> +		 * get_uprobe() is safe to use here.
> +		 */
>  		get_uprobe(n->uprobe);
>  		n->next = NULL;
>  
> @@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
>  		}
>  		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
>  	}
> -
> +	 /*
> +	  * uprobe's refcnt is positive, held by caller, so it's safe to
> +	  * unconditionally bump it one more time here
> +	  */
>  	ri->uprobe = get_uprobe(uprobe);
>  	ri->func = instruction_pointer(regs);
>  	ri->stack = user_stack_pointer(regs);
> -- 
> 2.43.0
>
Andrii Nakryiko Aug. 1, 2024, 4:49 p.m. UTC | #2
On Thu, Aug 1, 2024 at 4:09 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote:
>
> SNIP
>
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> > -     if (refcount_dec_and_test(&uprobe->ref)) {
> > -             /*
> > -              * If application munmap(exec_vma) before uprobe_unregister()
> > -              * gets called, we don't get a chance to remove uprobe from
> > -              * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > -              */
> > -             mutex_lock(&delayed_uprobe_lock);
> > -             delayed_uprobe_remove(uprobe, NULL);
> > -             mutex_unlock(&delayed_uprobe_lock);
> > -             kfree(uprobe);
> > -     }
> > +     if (!refcount_dec_and_test(&uprobe->ref))
> > +             return;
> > +
> > +     write_lock(&uprobes_treelock);
> > +
> > +     if (uprobe_is_active(uprobe))
> > +             rb_erase(&uprobe->rb_node, &uprobes_tree);
> > +
> > +     write_unlock(&uprobes_treelock);
> > +
> > +     /*
> > +      * If application munmap(exec_vma) before uprobe_unregister()
> > +      * gets called, we don't get a chance to remove uprobe from
> > +      * delayed_uprobe_list from remove_breakpoint(). Do it here.
> > +      */
> > +     mutex_lock(&delayed_uprobe_lock);
> > +     delayed_uprobe_remove(uprobe, NULL);
> > +     mutex_unlock(&delayed_uprobe_lock);
>
> we should do kfree(uprobe) in here, right?

heh, yep, seems like I lost it while rebasing or something, good catch! fixed.


>
> I think this is fixed later on when uprobe_free_rcu is introduced
>
> SNIP
>
> > @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
> >       if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
> >               return ERR_PTR(-EINVAL);
> >
> > - retry:
> >       uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
> >       if (IS_ERR(uprobe))
> >               return uprobe;
> >
> > -     /*
> > -      * We can race with uprobe_unregister()->delete_uprobe().
> > -      * Check uprobe_is_active() and retry if it is false.
> > -      */
> >       down_write(&uprobe->register_rwsem);
> > -     ret = -EAGAIN;
> > -     if (likely(uprobe_is_active(uprobe))) {
> > -             consumer_add(uprobe, uc);
> > -             ret = register_for_each_vma(uprobe, uc);
> > -     }
> > +     consumer_add(uprobe, uc);
> > +     ret = register_for_each_vma(uprobe, uc);
> >       up_write(&uprobe->register_rwsem);
> > -     put_uprobe(uprobe);
> >
> >       if (ret) {
> > -             if (unlikely(ret == -EAGAIN))
> > -                     goto retry;
>
> nice, I like getting rid of this.. so far lgtm ;-)
>
> jirka
>
>
> >               uprobe_unregister(uprobe, uc);
> >               return ERR_PTR(ret);
> >       }
> > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
> >                       u = rb_entry(t, struct uprobe, rb_node);
> >                       if (u->inode != inode || u->offset < min)
> >                               break;
> > +                     u = try_get_uprobe(u);
> > +                     if (!u) /* uprobe already went away, safe to ignore */
> > +                             continue;
> >                       list_add(&u->pending_list, head);
> > -                     get_uprobe(u);
> >               }
> >               for (t = n; (t = rb_next(t)); ) {
> >                       u = rb_entry(t, struct uprobe, rb_node);
> >                       if (u->inode != inode || u->offset > max)
> >                               break;
> > +                     u = try_get_uprobe(u);
> > +                     if (!u) /* uprobe already went away, safe to ignore */
> > +                             continue;
> >                       list_add(&u->pending_list, head);
> > -                     get_uprobe(u);
> >               }
> >       }
> >       read_unlock(&uprobes_treelock);
> > @@ -1752,6 +1766,12 @@ static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
> >                       return -ENOMEM;
> >
> >               *n = *o;
> > +             /*
> > +              * uprobe's refcnt has to be positive at this point, kept by
> > +              * utask->return_instances items; return_instances can't be
> > +              * removed right now, as task is blocked due to duping; so
> > +              * get_uprobe() is safe to use here.
> > +              */
> >               get_uprobe(n->uprobe);
> >               n->next = NULL;
> >
> > @@ -1894,7 +1914,10 @@ static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
> >               }
> >               orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
> >       }
> > -
> > +      /*
> > +       * uprobe's refcnt is positive, held by caller, so it's safe to
> > +       * unconditionally bump it one more time here
> > +       */
> >       ri->uprobe = get_uprobe(uprobe);
> >       ri->func = instruction_pointer(regs);
> >       ri->stack = user_stack_pointer(regs);
> > --
> > 2.43.0
> >
Andrii Nakryiko Aug. 1, 2024, 10:07 p.m. UTC | #3
On Wed, Jul 31, 2024 at 2:43 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Revamp how struct uprobe is refcounted, and thus how its lifetime is
> managed.
>
> Right now, there are a few possible "owners" of uprobe refcount:
>   - uprobes_tree RB tree assumes one refcount when uprobe is registered
>     and added to the lookup tree;
>   - while uprobe is triggered and kernel is handling it in the breakpoint
>     handler code, temporary refcount bump is done to keep uprobe from
>     being freed;
>   - if we have uretprobe requested on a given struct uprobe instance, we
>     take another refcount to keep uprobe alive until user space code
>     returns from the function and triggers return handler.
>
> The uprobe_tree's extra refcount of 1 is confusing and problematic. No
> matter how many actual consumers are attached, they all share the same
> refcount, and we have an extra logic to drop the "last" (which might not
> really be last) refcount once uprobe's consumer list becomes empty.
>
> This is unconventional and has to be kept in mind as a special case all
> the time. Further, because of this design we have the situations where
> find_uprobe() will find uprobe, bump refcount, return it to the caller,
> but that uprobe will still need uprobe_is_active() check, after which
> the caller is required to drop refcount and try again. This is just too
> many details leaking to the higher level logic.
>
> This patch changes refcounting scheme in such a way as to not have
> uprobes_tree keeping extra refcount for struct uprobe. Instead, each
> uprobe_consumer is assuming its own refcount, which will be dropped
> when consumer is unregistered. Other than that, all the active users of
> uprobe (entry and return uprobe handling code) keeps exactly the same
> refcounting approach.
>
> With the above setup, once uprobe's refcount drops to zero, we need to
> make sure that uprobe's "destructor" removes uprobe from uprobes_tree,
> of course. This, though, races with uprobe entry handling code in
> handle_swbp(), which, through find_active_uprobe()->find_uprobe() lookup,
> can race with uprobe being destroyed after refcount drops to zero (e.g.,
> due to uprobe_consumer unregistering). So we add try_get_uprobe(), which
> will attempt to bump refcount, unless it already is zero. Caller needs
> to guarantee that uprobe instance won't be freed in parallel, which is
> the case while we keep uprobes_treelock (for read or write, doesn't
> matter).
>
> Note also, we now don't leak the race between registration and
> unregistration, so we remove the retry logic completely. If
> find_uprobe() returns valid uprobe, it's guaranteed to remain in
> uprobes_tree with properly incremented refcount. The race is handled
> inside __insert_uprobe() and put_uprobe() working together:
> __insert_uprobe() will remove uprobe from RB-tree, if it can't bump
> refcount and will retry to insert the new uprobe instance. put_uprobe()
> won't attempt to remove uprobe from RB-tree, if it's already not there.
> All that is protected by uprobes_treelock, which keeps things simple.
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/events/uprobes.c | 163 +++++++++++++++++++++++-----------------
>  1 file changed, 93 insertions(+), 70 deletions(-)
>

[...]

> @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>         int err;
>
>         down_write(&uprobe->register_rwsem);
> -       if (WARN_ON(!consumer_del(uprobe, uc)))
> +       if (WARN_ON(!consumer_del(uprobe, uc))) {
>                 err = -ENOENT;
> -       else
> +       } else {
>                 err = register_for_each_vma(uprobe, NULL);
> -
> -       /* TODO : cant unregister? schedule a worker thread */
> -       if (!err) {
> -               if (!uprobe->consumers)
> -                       delete_uprobe(uprobe);
> -               else
> -                       err = -EBUSY;
> +               /* TODO : cant unregister? schedule a worker thread */
> +               WARN(err, "leaking uprobe due to failed unregistration");

Ok, so now that I added this very loud warning if
register_for_each_vma(uprobe, NULL) returns error, it turns out it's
not that unusual for this unregistration to fail. If I run my
uprobe-stress for just a little bit, and then terminate it with ^C, I
get this splat:

[ 1980.854229] leaking uprobe due to failed unregistration
[ 1980.854244] WARNING: CPU: 3 PID: 23013 at
kernel/events/uprobes.c:1123 uprobe_unregister_nosync+0x68/0x80
[ 1980.855356] Modules linked in: aesni_intel(E) crypto_simd(E)
cryptd(E) kvm_intel(E) kvm(E) floppy(E) i2c_piix4(E) i2c_]
[ 1980.856746] CPU: 3 UID: 0 PID: 23013 Comm: exe Tainted: G        W
OE      6.11.0-rc1-00032-g308d1f294b79 #129
[ 1980.857407] Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 1980.857788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04
[ 1980.858489] RIP: 0010:uprobe_unregister_nosync+0x68/0x80
[ 1980.858826] Code: 6e fb ff ff 4c 89 e7 89 c3 e8 24 e8 e3 ff 85 db
75 0c 5b 48 89 ef 5d 41 5c e9 84 e5 ff ff 48 c7 c7 d0
[ 1980.860052] RSP: 0018:ffffc90002fb7e58 EFLAGS: 00010296
[ 1980.860428] RAX: 000000000000002b RBX: 00000000fffffffc RCX: 0000000000000000
[ 1980.860913] RDX: 0000000000000002 RSI: 0000000000000027 RDI: 00000000ffffffff
[ 1980.861379] RBP: ffff88811159ac00 R08: 00000000fffeffff R09: 0000000000000001
[ 1980.861871] R10: 0000000000000000 R11: ffffffff83299920 R12: ffff88811159ac20
[ 1980.862340] R13: ffff88810153c7a0 R14: ffff88810c3fe000 R15: 0000000000000000
[ 1980.862830] FS:  0000000000000000(0000) GS:ffff88881ca00000(0000)
knlGS:0000000000000000
[ 1980.863370] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1980.863758] CR2: 00007fa08aea8276 CR3: 000000010f59c005 CR4: 0000000000370ef0
[ 1980.864239] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1980.864708] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1980.865202] Call Trace:
[ 1980.865356]  <TASK>
[ 1980.865524]  ? __warn+0x80/0x180
[ 1980.865745]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.866074]  ? report_bug+0x18d/0x1c0
[ 1980.866326]  ? handle_bug+0x3a/0x70
[ 1980.866568]  ? exc_invalid_op+0x13/0x60
[ 1980.866836]  ? asm_exc_invalid_op+0x16/0x20
[ 1980.867098]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.867390]  ? uprobe_unregister_nosync+0x68/0x80
[ 1980.867726]  bpf_uprobe_multi_link_release+0x31/0xd0
[ 1980.868044]  bpf_link_free+0x54/0xd0
[ 1980.868267]  bpf_link_release+0x17/0x20
[ 1980.868542]  __fput+0x102/0x2e0
[ 1980.868760]  task_work_run+0x55/0xa0
[ 1980.869027]  syscall_exit_to_user_mode+0x1dd/0x1f0
[ 1980.869344]  do_syscall_64+0x70/0x140
[ 1980.869603]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1980.869923] RIP: 0033:0x7fa08aea82a0
[ 1980.870171] Code: Unable to access opcode bytes at 0x7fa08aea8276.
[ 1980.870587] RSP: 002b:00007ffe838cd030 EFLAGS: 00000202 ORIG_RAX:
000000000000003b
[ 1980.871098] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 1980.871563] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1980.872055] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 1980.872526] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 1980.873044] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1980.873568]  </TASK>


I traced it a little bit with retsnoop to figure out where this is
coming from, and here we go:

14:53:18.897165 -> 14:53:18.897171 TID/PID 23013/23013 (exe/exe):

FUNCTION CALLS                RESULT    DURATION
---------------------------   --------  --------
→ uprobe_write_opcode
    → get_user_pages_remote
        ↔ __get_user_pages    [-EINTR]   1.382us
    ← get_user_pages_remote   [-EINTR]   4.889us
← uprobe_write_opcode         [-EINTR]   6.908us

                   entry_SYSCALL_64_after_hwframe+0x76
(entry_SYSCALL_64 @ arch/x86/entry/entry_64.S:130)
                   do_syscall_64+0x70
(arch/x86/entry/common.c:102)
                   syscall_exit_to_user_mode+0x1dd
(kernel/entry/common.c:218)
                   . __syscall_exit_to_user_mode_work
(kernel/entry/common.c:207)
                   . exit_to_user_mode_prepare
(include/linux/entry-common.h:328)
                   . exit_to_user_mode_loop
(kernel/entry/common.c:114)
                   . resume_user_mode_work
(include/linux/resume_user_mode.h:50)
                   task_work_run+0x55                   (kernel/task_work.c:222)
                   __fput+0x102                         (fs/file_table.c:422)
                   bpf_link_release+0x17
(kernel/bpf/syscall.c:3116)
                   bpf_link_free+0x54
(kernel/bpf/syscall.c:3067)
                   bpf_uprobe_multi_link_release+0x31
(kernel/trace/bpf_trace.c:3198)
                   . bpf_uprobe_unregister
(kernel/trace/bpf_trace.c:3186)
                   uprobe_unregister_nosync+0x42
(kernel/events/uprobes.c:1120)
                   register_for_each_vma+0x427
(kernel/events/uprobes.c:1092)
     6us [-EINTR]  uprobe_write_opcode+0x79
(kernel/events/uprobes.c:478)
                   . get_user_page_vma_remote
(include/linux/mm.h:2489)
     4us [-EINTR]  get_user_pages_remote+0x109          (mm/gup.c:2627)
                   . __get_user_pages_locked            (mm/gup.c:1762)
!    1us [-EINTR]  __get_user_pages


So, we do uprobe_unregister -> register_for_each_vma ->
remove_breakpoint -> set_orig_insn -> uprobe_write_opcode ->
get_user_page_vma_remote -> get_user_pages_remote ->
__get_user_pages_locked -> __get_user_pages and I think we then hit
`if (fatal_signal_pending(current)) return -EINTR;` check.


So, is there something smarter we can do in this case besides leaking
an uprobe (and note, my changes don't change this behavior)?

I can of course just drop the WARN given it's sort of expected now,
but if we can handle that more gracefully it would be good. I don't
think that should block optimization work, but just something to keep
in mind and maybe fix as a follow up.


>         }
>         up_write(&uprobe->register_rwsem);
>

[...]
Oleg Nesterov Aug. 2, 2024, 8:50 a.m. UTC | #4
On 08/01, Andrii Nakryiko wrote:
>
> > +               /* TODO : cant unregister? schedule a worker thread */
> > +               WARN(err, "leaking uprobe due to failed unregistration");

> Ok, so now that I added this very loud warning if
> register_for_each_vma(uprobe, NULL) returns error, it turns out it's
> not that unusual for this unregistration to fail.

...

> So, is there something smarter we can do in this case besides leaking
> an uprobe (and note, my changes don't change this behavior)?

Something like schedule_work() which retries register_for_each_vma()...

> I can of course just drop the WARN given it's sort of expected now,

Or least replace it with pr_warn() or uprobe_warn(), WARN() certainly
makes no sense imo...

> I don't
> think that should block optimization work, but just something to keep
> in mind and maybe fix as a follow up.

Agreed, lets do this separately.

Oleg.
Jiri Olsa Aug. 2, 2024, 11:11 a.m. UTC | #5
On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote:

SNIP

> -/*
> - * There could be threads that have already hit the breakpoint. They
> - * will recheck the current insn and restart if find_uprobe() fails.
> - * See find_active_uprobe().
> - */
> -static void delete_uprobe(struct uprobe *uprobe)
> -{
> -	if (WARN_ON(!uprobe_is_active(uprobe)))
> -		return;
> -
> -	write_lock(&uprobes_treelock);
> -	rb_erase(&uprobe->rb_node, &uprobes_tree);
> -	write_unlock(&uprobes_treelock);
> -	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> -}
> -
>  struct map_info {
>  	struct map_info *next;
>  	struct mm_struct *mm;
> @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
>  	int err;
>  
>  	down_write(&uprobe->register_rwsem);
> -	if (WARN_ON(!consumer_del(uprobe, uc)))
> +	if (WARN_ON(!consumer_del(uprobe, uc))) {
>  		err = -ENOENT;
> -	else
> +	} else {
>  		err = register_for_each_vma(uprobe, NULL);
> -
> -	/* TODO : cant unregister? schedule a worker thread */
> -	if (!err) {
> -		if (!uprobe->consumers)
> -			delete_uprobe(uprobe);

ok, so removing this call is why the consumer test is failing, right?

IIUC the previous behaviour was to remove uprobe from the tree
even when there's active uprobe ref for installed uretprobe

so following scenario will now behaves differently:

  install uretprobe/consumer-1 on foo
  foo {
    remove uretprobe/consumer-1                (A)
    install uretprobe/consumer-2 on foo        (B)
  }

before the removal of consumer-1 (A) would remove the uprobe object
from the tree, so the installation of consumer-2 (b) would create
new uprobe object which would not be triggered at foo return because
it got installed too late (after foo uprobe was triggered)

the behaviour with this patch is that removal of consumer-1 (A) will
not remove the uprobe object (that happens only when we run out of
refs), and the following install of consumer-2 will use the existing
uprobe object so the consumer-2 will be triggered on foo return

uff ;-)

but I think it's better, because we get more hits

jirka

> -		else
> -			err = -EBUSY;
> +		/* TODO : cant unregister? schedule a worker thread */
> +		WARN(err, "leaking uprobe due to failed unregistration");
>  	}
>  	up_write(&uprobe->register_rwsem);
>  
> @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
>  	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
>  		return ERR_PTR(-EINVAL);
>  
> - retry:
>  	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
>  	if (IS_ERR(uprobe))
>  		return uprobe;
>  
> -	/*
> -	 * We can race with uprobe_unregister()->delete_uprobe().
> -	 * Check uprobe_is_active() and retry if it is false.
> -	 */
>  	down_write(&uprobe->register_rwsem);
> -	ret = -EAGAIN;
> -	if (likely(uprobe_is_active(uprobe))) {
> -		consumer_add(uprobe, uc);
> -		ret = register_for_each_vma(uprobe, uc);
> -	}
> +	consumer_add(uprobe, uc);
> +	ret = register_for_each_vma(uprobe, uc);
>  	up_write(&uprobe->register_rwsem);
> -	put_uprobe(uprobe);
>  
>  	if (ret) {
> -		if (unlikely(ret == -EAGAIN))
> -			goto retry;
>  		uprobe_unregister(uprobe, uc);
>  		return ERR_PTR(ret);

SNIP
Andrii Nakryiko Aug. 2, 2024, 2:58 p.m. UTC | #6
On Fri, Aug 2, 2024 at 1:50 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 08/01, Andrii Nakryiko wrote:
> >
> > > +               /* TODO : cant unregister? schedule a worker thread */
> > > +               WARN(err, "leaking uprobe due to failed unregistration");
>
> > Ok, so now that I added this very loud warning if
> > register_for_each_vma(uprobe, NULL) returns error, it turns out it's
> > not that unusual for this unregistration to fail.
>
> ...
>
> > So, is there something smarter we can do in this case besides leaking
> > an uprobe (and note, my changes don't change this behavior)?
>
> Something like schedule_work() which retries register_for_each_vma()...

And if that fails again, what do we do? Because I don't think we even
need schedule_work(), we can just keep some list of "pending to be
retried" items and check them after each
uprobe_register()/uprobe_unregister() call. I'm just not clear how we
should handle stubborn cases (but honestly I haven't even tried to
understand all the details about this just yet).

>
> > I can of course just drop the WARN given it's sort of expected now,
>
> Or least replace it with pr_warn() or uprobe_warn(), WARN() certainly
> makes no sense imo...
>

ok, makes sense, will change to uprobe_warn()

> > I don't
> > think that should block optimization work, but just something to keep
> > in mind and maybe fix as a follow up.
>
> Agreed, lets do this separately.
>

yep

> Oleg.
>
Andrii Nakryiko Aug. 2, 2024, 3:03 p.m. UTC | #7
On Fri, Aug 2, 2024 at 4:11 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 02:42:50PM -0700, Andrii Nakryiko wrote:
>
> SNIP
>
> > -/*
> > - * There could be threads that have already hit the breakpoint. They
> > - * will recheck the current insn and restart if find_uprobe() fails.
> > - * See find_active_uprobe().
> > - */
> > -static void delete_uprobe(struct uprobe *uprobe)
> > -{
> > -     if (WARN_ON(!uprobe_is_active(uprobe)))
> > -             return;
> > -
> > -     write_lock(&uprobes_treelock);
> > -     rb_erase(&uprobe->rb_node, &uprobes_tree);
> > -     write_unlock(&uprobes_treelock);
> > -     RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > -}
> > -
> >  struct map_info {
> >       struct map_info *next;
> >       struct mm_struct *mm;
> > @@ -1094,17 +1120,12 @@ void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
> >       int err;
> >
> >       down_write(&uprobe->register_rwsem);
> > -     if (WARN_ON(!consumer_del(uprobe, uc)))
> > +     if (WARN_ON(!consumer_del(uprobe, uc))) {
> >               err = -ENOENT;
> > -     else
> > +     } else {
> >               err = register_for_each_vma(uprobe, NULL);
> > -
> > -     /* TODO : cant unregister? schedule a worker thread */
> > -     if (!err) {
> > -             if (!uprobe->consumers)
> > -                     delete_uprobe(uprobe);
>
> ok, so removing this call is why the consumer test is failing, right?
>
> IIUC the previous behaviour was to remove uprobe from the tree
> even when there's active uprobe ref for installed uretprobe
>
> so following scenario will now behaves differently:
>
>   install uretprobe/consumer-1 on foo
>   foo {
>     remove uretprobe/consumer-1                (A)
>     install uretprobe/consumer-2 on foo        (B)
>   }
>
> before the removal of consumer-1 (A) would remove the uprobe object
> from the tree, so the installation of consumer-2 (b) would create
> new uprobe object which would not be triggered at foo return because
> it got installed too late (after foo uprobe was triggered)
>
> the behaviour with this patch is that removal of consumer-1 (A) will
> not remove the uprobe object (that happens only when we run out of
> refs), and the following install of consumer-2 will use the existing
> uprobe object so the consumer-2 will be triggered on foo return
>
> uff ;-)

yep, something like that

>
> but I think it's better, because we get more hits

note, with the next patch set that makes uretprobes SRCU protected
(but with timeout) the behavior becomes a bit time-sensitive :) so I
think we'll have to change your selftest to first attach all the new
uretprobes, then detach all the uretprobes that had to be detached,
and then do a bit more relaxed logic of the sort "if there were some
uretprobes before and after, then we *might* get uretprobe triggered
(but we might not as well, unless the same uretprobe stayed attached
at all times)".

Anyways, something to take care of in the bpf-next tree separately.

All this is very much an implementation detail, so I think we can
change these aspects freely.

>
> jirka
>
> > -             else
> > -                     err = -EBUSY;
> > +             /* TODO : cant unregister? schedule a worker thread */
> > +             WARN(err, "leaking uprobe due to failed unregistration");
> >       }
> >       up_write(&uprobe->register_rwsem);
> >
> > @@ -1159,27 +1180,16 @@ struct uprobe *uprobe_register(struct inode *inode,
> >       if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
> >               return ERR_PTR(-EINVAL);
> >
> > - retry:
> >       uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
> >       if (IS_ERR(uprobe))
> >               return uprobe;
> >
> > -     /*
> > -      * We can race with uprobe_unregister()->delete_uprobe().
> > -      * Check uprobe_is_active() and retry if it is false.
> > -      */
> >       down_write(&uprobe->register_rwsem);
> > -     ret = -EAGAIN;
> > -     if (likely(uprobe_is_active(uprobe))) {
> > -             consumer_add(uprobe, uc);
> > -             ret = register_for_each_vma(uprobe, uc);
> > -     }
> > +     consumer_add(uprobe, uc);
> > +     ret = register_for_each_vma(uprobe, uc);
> >       up_write(&uprobe->register_rwsem);
> > -     put_uprobe(uprobe);
> >
> >       if (ret) {
> > -             if (unlikely(ret == -EAGAIN))
> > -                     goto retry;
> >               uprobe_unregister(uprobe, uc);
> >               return ERR_PTR(ret);
>
> SNIP
Oleg Nesterov Aug. 2, 2024, 10:19 p.m. UTC | #8
On 08/02, Andrii Nakryiko wrote:
>
> On Fri, Aug 2, 2024 at 1:50 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 08/01, Andrii Nakryiko wrote:
> > >
> > > > +               /* TODO : cant unregister? schedule a worker thread */
> > > > +               WARN(err, "leaking uprobe due to failed unregistration");
> >
> > > Ok, so now that I added this very loud warning if
> > > register_for_each_vma(uprobe, NULL) returns error, it turns out it's
> > > not that unusual for this unregistration to fail.
> >
> > ...
> >
> > > So, is there something smarter we can do in this case besides leaking
> > > an uprobe (and note, my changes don't change this behavior)?
> >
> > Something like schedule_work() which retries register_for_each_vma()...
>
> And if that fails again, what do we do?

try again after some timeout ;)

> Because I don't think we even
> need schedule_work(), we can just keep some list of "pending to be
> retried" items and check them after each
> uprobe_register()/uprobe_unregister() call.

Agreed, we need a list of "pending to be retried", but rightly or not
I think this should be done from work_struct->func.

Lets discuss this later? We seem to agree this is a known problem, and
this has nothing to do with your optimizations.

> I'm just not clear how we
> should handle stubborn cases (but honestly I haven't even tried to
> understand all the details about this just yet).

Same here.

Oleg.
Oleg Nesterov Aug. 5, 2024, 1:44 p.m. UTC | #9
On 07/31, Andrii Nakryiko wrote:
>
> @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
>  	uprobe->ref_ctr_offset = ref_ctr_offset;
>  	init_rwsem(&uprobe->register_rwsem);
>  	init_rwsem(&uprobe->consumer_rwsem);
> +	RB_CLEAR_NODE(&uprobe->rb_node);

I guess RB_CLEAR_NODE() is not necessary?

> @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
>  			u = rb_entry(t, struct uprobe, rb_node);
>  			if (u->inode != inode || u->offset < min)
>  				break;
> +			u = try_get_uprobe(u);
> +			if (!u) /* uprobe already went away, safe to ignore */
> +				continue;
>  			list_add(&u->pending_list, head);

cosmetic nit, feel to ignore, but to me

			if (try_get_uprobe(u))
				list_add(&u->pending_list, head);

looks more readable.

Other than the lack of kfree() in put_uprobe() and WARN() in _unregister()
the patch looks good to me.

Oleg.
Andrii Nakryiko Aug. 5, 2024, 5:29 p.m. UTC | #10
On Mon, Aug 5, 2024 at 6:44 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 07/31, Andrii Nakryiko wrote:
> >
> > @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> >       uprobe->ref_ctr_offset = ref_ctr_offset;
> >       init_rwsem(&uprobe->register_rwsem);
> >       init_rwsem(&uprobe->consumer_rwsem);
> > +     RB_CLEAR_NODE(&uprobe->rb_node);
>
> I guess RB_CLEAR_NODE() is not necessary?

I definitely needed that with my batch API changes, but it might be
that I don't need it anymore. But I'm a bit hesitant to remove it,
because if we ever get put_uprobe() on an uprobe that hasn't been
inserted into RB-tree yet, this will cause a hard to understand crash.
RB_CLEAR_NODE() in __insert_uprobe() is critical to have, this one is
kind of optional (but still feels right to initialize the field
properly).

Let me know if you feel strongly about this, though.

>
> > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
> >                       u = rb_entry(t, struct uprobe, rb_node);
> >                       if (u->inode != inode || u->offset < min)
> >                               break;
> > +                     u = try_get_uprobe(u);
> > +                     if (!u) /* uprobe already went away, safe to ignore */
> > +                             continue;
> >                       list_add(&u->pending_list, head);
>
> cosmetic nit, feel to ignore, but to me
>
>                         if (try_get_uprobe(u))
>                                 list_add(&u->pending_list, head);
>
> looks more readable.

It's not my code base to enforce my preferences, but I'll at least
explain why I disagree. To me, something like `if (some condition)
<break/continue>;` is a very clear indication that this item (or even
the rest of items in case of break) won't be processed anymore.

While

if (some inverted condition)
   <do some something useful>
<might be some more code>

... is a pattern that requires double-checking that we really are not
going to use that item later on.

So I'll invert this just to not be PITA, but I disagree :)

>
> Other than the lack of kfree() in put_uprobe() and WARN() in _unregister()
> the patch looks good to me.

yep, fixed that locally already. Thanks for the review!

>
> Oleg.
>
Oleg Nesterov Aug. 6, 2024, 10:45 a.m. UTC | #11
On 08/05, Andrii Nakryiko wrote:
>
> On Mon, Aug 5, 2024 at 6:44 AM Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > On 07/31, Andrii Nakryiko wrote:
> > >
> > > @@ -732,11 +776,13 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> > >       uprobe->ref_ctr_offset = ref_ctr_offset;
> > >       init_rwsem(&uprobe->register_rwsem);
> > >       init_rwsem(&uprobe->consumer_rwsem);
> > > +     RB_CLEAR_NODE(&uprobe->rb_node);
> >
> > I guess RB_CLEAR_NODE() is not necessary?
>
> I definitely needed that with my batch API changes, but it might be
> that I don't need it anymore. But I'm a bit hesitant to remove it,

OK, lets keep it, it doesn't hurt. Just it wasn't clear to me why did
you add this initialization in this patch.

> > > @@ -1286,15 +1296,19 @@ static void build_probe_list(struct inode *inode,
> > >                       u = rb_entry(t, struct uprobe, rb_node);
> > >                       if (u->inode != inode || u->offset < min)
> > >                               break;
> > > +                     u = try_get_uprobe(u);
> > > +                     if (!u) /* uprobe already went away, safe to ignore */
> > > +                             continue;
> > >                       list_add(&u->pending_list, head);
> >
> > cosmetic nit, feel to ignore, but to me
> >
> >                         if (try_get_uprobe(u))
> >                                 list_add(&u->pending_list, head);
> >
> > looks more readable.
>
> It's not my code base to enforce my preferences, but I'll at least
> explain why I disagree. To me, something like `if (some condition)
> <break/continue>;` is a very clear indication that this item (or even
> the rest of items in case of break) won't be processed anymore.
>
> While
>
> if (some inverted condition)
>    <do some something useful>
> <might be some more code>

OK, I won't insist. To me the most confusing part is

	u = try_get_uprobe(u);
	if (!u)
	...

If you read this code for the 1st time (or you are trying to recall it
after 10 years ;) it looks as if try_get_uprobe() can return another uprobe.

> So I'll invert this just to not be PITA, but I disagree :)

If you disagree, then don't change it ;)

Oleg.
diff mbox series

Patch

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index f88b7ff20587..23dde3ec5b09 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -587,25 +587,51 @@  set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long v
 			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
+/* uprobe should have guaranteed positive refcount */
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
 {
 	refcount_inc(&uprobe->ref);
 	return uprobe;
 }
 
+/*
+ * uprobe should have guaranteed lifetime, which can be either of:
+ *   - caller already has refcount taken (and wants an extra one);
+ *   - uprobe is RCU protected and won't be freed until after grace period;
+ *   - we are holding uprobes_treelock (for read or write, doesn't matter).
+ */
+static struct uprobe *try_get_uprobe(struct uprobe *uprobe)
+{
+	if (refcount_inc_not_zero(&uprobe->ref))
+		return uprobe;
+	return NULL;
+}
+
+static inline bool uprobe_is_active(struct uprobe *uprobe)
+{
+	return !RB_EMPTY_NODE(&uprobe->rb_node);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
-	if (refcount_dec_and_test(&uprobe->ref)) {
-		/*
-		 * If application munmap(exec_vma) before uprobe_unregister()
-		 * gets called, we don't get a chance to remove uprobe from
-		 * delayed_uprobe_list from remove_breakpoint(). Do it here.
-		 */
-		mutex_lock(&delayed_uprobe_lock);
-		delayed_uprobe_remove(uprobe, NULL);
-		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
-	}
+	if (!refcount_dec_and_test(&uprobe->ref))
+		return;
+
+	write_lock(&uprobes_treelock);
+
+	if (uprobe_is_active(uprobe))
+		rb_erase(&uprobe->rb_node, &uprobes_tree);
+
+	write_unlock(&uprobes_treelock);
+
+	/*
+	 * If application munmap(exec_vma) before uprobe_unregister()
+	 * gets called, we don't get a chance to remove uprobe from
+	 * delayed_uprobe_list from remove_breakpoint(). Do it here.
+	 */
+	mutex_lock(&delayed_uprobe_lock);
+	delayed_uprobe_remove(uprobe, NULL);
+	mutex_unlock(&delayed_uprobe_lock);
 }
 
 static __always_inline
@@ -656,7 +682,7 @@  static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 	struct rb_node *node = rb_find(&key, &uprobes_tree, __uprobe_cmp_key);
 
 	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+		return try_get_uprobe(__node_2_uprobe(node));
 
 	return NULL;
 }
@@ -676,26 +702,44 @@  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/*
+ * Attempt to insert a new uprobe into uprobes_tree.
+ *
+ * If uprobe already exists (for given inode+offset), we just increment
+ * refcount of previously existing uprobe.
+ *
+ * If not, a provided new instance of uprobe is inserted into the tree (with
+ * assumed initial refcount == 1).
+ *
+ * In any case, we return a uprobe instance that ends up being in uprobes_tree.
+ * Caller has to clean up new uprobe instance, if it ended up not being
+ * inserted into the tree.
+ *
+ * We assume that uprobes_treelock is held for writing.
+ */
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 {
 	struct rb_node *node;
-
+again:
 	node = rb_find_add(&uprobe->rb_node, &uprobes_tree, __uprobe_cmp);
-	if (node)
-		return get_uprobe(__node_2_uprobe(node));
+	if (node) {
+		struct uprobe *u = __node_2_uprobe(node);
 
-	/* get access + creation ref */
-	refcount_set(&uprobe->ref, 2);
-	return NULL;
+		if (!try_get_uprobe(u)) {
+			rb_erase(node, &uprobes_tree);
+			RB_CLEAR_NODE(&u->rb_node);
+			goto again;
+		}
+
+		return u;
+	}
+
+	return uprobe;
 }
 
 /*
- * Acquire uprobes_treelock.
- * Matching uprobe already exists in rbtree;
- *	increment (access refcount) and return the matching uprobe.
- *
- * No matching uprobe; insert the uprobe in rb_tree;
- *	get a double refcount (access + creation) and return NULL.
+ * Acquire uprobes_treelock and insert uprobe into uprobes_tree
+ * (or reuse existing one, see __insert_uprobe() comments above).
  */
 static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 {
@@ -732,11 +776,13 @@  static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	uprobe->ref_ctr_offset = ref_ctr_offset;
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
+	RB_CLEAR_NODE(&uprobe->rb_node);
+	refcount_set(&uprobe->ref, 1);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
 	/* a uprobe exists for this inode:offset combination */
-	if (cur_uprobe) {
+	if (cur_uprobe != uprobe) {
 		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset) {
 			ref_ctr_mismatch_warn(cur_uprobe, uprobe);
 			put_uprobe(cur_uprobe);
@@ -921,26 +967,6 @@  remove_breakpoint(struct uprobe *uprobe, struct mm_struct *mm, unsigned long vad
 	return set_orig_insn(&uprobe->arch, mm, vaddr);
 }
 
-static inline bool uprobe_is_active(struct uprobe *uprobe)
-{
-	return !RB_EMPTY_NODE(&uprobe->rb_node);
-}
-/*
- * There could be threads that have already hit the breakpoint. They
- * will recheck the current insn and restart if find_uprobe() fails.
- * See find_active_uprobe().
- */
-static void delete_uprobe(struct uprobe *uprobe)
-{
-	if (WARN_ON(!uprobe_is_active(uprobe)))
-		return;
-
-	write_lock(&uprobes_treelock);
-	rb_erase(&uprobe->rb_node, &uprobes_tree);
-	write_unlock(&uprobes_treelock);
-	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
-}
-
 struct map_info {
 	struct map_info *next;
 	struct mm_struct *mm;
@@ -1094,17 +1120,12 @@  void uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 	int err;
 
 	down_write(&uprobe->register_rwsem);
-	if (WARN_ON(!consumer_del(uprobe, uc)))
+	if (WARN_ON(!consumer_del(uprobe, uc))) {
 		err = -ENOENT;
-	else
+	} else {
 		err = register_for_each_vma(uprobe, NULL);
-
-	/* TODO : cant unregister? schedule a worker thread */
-	if (!err) {
-		if (!uprobe->consumers)
-			delete_uprobe(uprobe);
-		else
-			err = -EBUSY;
+		/* TODO : cant unregister? schedule a worker thread */
+		WARN(err, "leaking uprobe due to failed unregistration");
 	}
 	up_write(&uprobe->register_rwsem);
 
@@ -1159,27 +1180,16 @@  struct uprobe *uprobe_register(struct inode *inode,
 	if (!IS_ALIGNED(ref_ctr_offset, sizeof(short)))
 		return ERR_PTR(-EINVAL);
 
- retry:
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (IS_ERR(uprobe))
 		return uprobe;
 
-	/*
-	 * We can race with uprobe_unregister()->delete_uprobe().
-	 * Check uprobe_is_active() and retry if it is false.
-	 */
 	down_write(&uprobe->register_rwsem);
-	ret = -EAGAIN;
-	if (likely(uprobe_is_active(uprobe))) {
-		consumer_add(uprobe, uc);
-		ret = register_for_each_vma(uprobe, uc);
-	}
+	consumer_add(uprobe, uc);
+	ret = register_for_each_vma(uprobe, uc);
 	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
 
 	if (ret) {
-		if (unlikely(ret == -EAGAIN))
-			goto retry;
 		uprobe_unregister(uprobe, uc);
 		return ERR_PTR(ret);
 	}
@@ -1286,15 +1296,19 @@  static void build_probe_list(struct inode *inode,
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset < min)
 				break;
+			u = try_get_uprobe(u);
+			if (!u) /* uprobe already went away, safe to ignore */
+				continue;
 			list_add(&u->pending_list, head);
-			get_uprobe(u);
 		}
 		for (t = n; (t = rb_next(t)); ) {
 			u = rb_entry(t, struct uprobe, rb_node);
 			if (u->inode != inode || u->offset > max)
 				break;
+			u = try_get_uprobe(u);
+			if (!u) /* uprobe already went away, safe to ignore */
+				continue;
 			list_add(&u->pending_list, head);
-			get_uprobe(u);
 		}
 	}
 	read_unlock(&uprobes_treelock);
@@ -1752,6 +1766,12 @@  static int dup_utask(struct task_struct *t, struct uprobe_task *o_utask)
 			return -ENOMEM;
 
 		*n = *o;
+		/*
+		 * uprobe's refcnt has to be positive at this point, kept by
+		 * utask->return_instances items; return_instances can't be
+		 * removed right now, as task is blocked due to duping; so
+		 * get_uprobe() is safe to use here.
+		 */
 		get_uprobe(n->uprobe);
 		n->next = NULL;
 
@@ -1894,7 +1914,10 @@  static void prepare_uretprobe(struct uprobe *uprobe, struct pt_regs *regs)
 		}
 		orig_ret_vaddr = utask->return_instances->orig_ret_vaddr;
 	}
-
+	 /*
+	  * uprobe's refcnt is positive, held by caller, so it's safe to
+	  * unconditionally bump it one more time here
+	  */
 	ri->uprobe = get_uprobe(uprobe);
 	ri->func = instruction_pointer(regs);
 	ri->stack = user_stack_pointer(regs);