mbox series

[v2,00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

Message ID 20240701223935.3783951-1-andrii@kernel.org (mailing list archive)
Headers show
Series uprobes: add batched register/unregister APIs and per-CPU RW semaphore | expand

Message

Andrii Nakryiko July 1, 2024, 10:39 p.m. UTC
This patch set, ultimately, switches global uprobes_treelock from RW spinlock
to per-CPU RW semaphore, which has better performance and scales better under
contention and multiple parallel threads triggering lots of uprobes.

To make this work well with attaching multiple uprobes (through BPF
multi-uprobe), we need to add batched versions of uprobe register/unregister
APIs. This is what most of the patch set is actually doing. The actual switch
to per-CPU RW semaphore is trivial after that and is done in the very last
patch #12. See commit message with some comparison numbers.

Patch #4 is probably the most important patch in the series, revamping uprobe
lifetime management and refcounting. See patch description and added code
comments for all the details.

With changes in patch #4, we open up the way to refactor uprobe_register() and
uprobe_unregister() implementations in such a way that we can avoid taking
uprobes_treelock many times during a single batched attachment/detachment.
This allows to accommodate a much higher latency of taking per-CPU RW
semaphore for write. The end result of this patch set is that attaching 50
thousand uprobes with BPF multi-uprobes doesn't regress and takes about 200ms
both before and after the changes in this patch set.

Patch #5 updates existing uprobe consumers to put all the relevant necessary
pieces into struct uprobe_consumer, without having to pass around
offset/ref_ctr_offset. Existing consumers already keep this data around, we
just formalize the interface.

Patches #6 through #10 add batched versions of register/unregister APIs and
gradually factor them in such a way as to allow taking single (batched)
uprobes_treelock, splitting the logic into multiple independent phases.

Patch #11 switched BPF multi-uprobes to batched uprobe APIs.

As mentioned, a very straightforward patch #12 takes advantage of all the prep
work and just switches uprobes_treelock to per-CPU RW semaphore.

v1->v2:
  - added RCU-delayed uprobe freeing to put_uprobe() (Masami);
  - fixed clean up handling in uprobe_register_batch (Jiri);
  - adjusted UPROBE_REFCNT_* constants to be more meaningful (Oleg);
  - dropped the "fix" to switch to write-protected mmap_sem, adjusted invalid
    comment instead (Oleg).

Andrii Nakryiko (12):
  uprobes: update outdated comment
  uprobes: correct mmap_sem locking assumptions in uprobe_write_opcode()
  uprobes: simplify error handling for alloc_uprobe()
  uprobes: revamp uprobe refcounting and lifetime management
  uprobes: move offset and ref_ctr_offset into uprobe_consumer
  uprobes: add batch uprobe register/unregister APIs
  uprobes: inline alloc_uprobe() logic into __uprobe_register()
  uprobes: split uprobe allocation and uprobes_tree insertion steps
  uprobes: batch uprobes_treelock during registration
  uprobes: improve lock batching for uprobe_unregister_batch
  uprobes,bpf: switch to batch uprobe APIs for BPF multi-uprobes
  uprobes: switch uprobes_treelock to per-CPU RW semaphore

 include/linux/uprobes.h                       |  29 +-
 kernel/events/uprobes.c                       | 550 ++++++++++++------
 kernel/trace/bpf_trace.c                      |  40 +-
 kernel/trace/trace_uprobe.c                   |  53 +-
 .../selftests/bpf/bpf_testmod/bpf_testmod.c   |  22 +-
 5 files changed, 447 insertions(+), 247 deletions(-)

Comments

Peter Zijlstra July 2, 2024, 10:23 a.m. UTC | #1
On Mon, Jul 01, 2024 at 03:39:23PM -0700, Andrii Nakryiko wrote:
> This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> to per-CPU RW semaphore, which has better performance and scales better under
> contention and multiple parallel threads triggering lots of uprobes.

Why not RCU + normal lock thing?
Peter Zijlstra July 2, 2024, 11:54 a.m. UTC | #2
+LKML

On Tue, Jul 02, 2024 at 12:23:53PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 01, 2024 at 03:39:23PM -0700, Andrii Nakryiko wrote:
> > This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> > to per-CPU RW semaphore, which has better performance and scales better under
> > contention and multiple parallel threads triggering lots of uprobes.
> 
> Why not RCU + normal lock thing?

Something like the *completely* untested below.

---
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 2c83ba776fc7..03b38f3f7be3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
 #define no_uprobe_events()	RB_EMPTY_ROOT(&uprobes_tree)
 
 static DEFINE_RWLOCK(uprobes_treelock);	/* serialize rbtree access */
+static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
 
 #define UPROBES_HASH_SZ	13
 /* serialize uprobe->pending_list */
@@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	refcount_t		ref;
+	struct rcu_head		rcu;
 	struct rw_semaphore	register_rwsem;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
@@ -67,7 +69,7 @@ struct uprobe {
 	 * The generic code assumes that it has two members of unknown type
 	 * owned by the arch-specific code:
 	 *
-	 * 	insn -	copy_insn() saves the original instruction here for
+	 *	insn -	copy_insn() saves the original instruction here for
 	 *		arch_uprobe_analyze_insn().
 	 *
 	 *	ixol -	potentially modified instruction to execute out of
@@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
 	return uprobe;
 }
 
+static void uprobe_free_rcu(struct rcu_head *rcu)
+{
+	struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
+	kfree(uprobe);
+}
+
 static void put_uprobe(struct uprobe *uprobe)
 {
 	if (refcount_dec_and_test(&uprobe->ref)) {
@@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
 		mutex_lock(&delayed_uprobe_lock);
 		delayed_uprobe_remove(uprobe, NULL);
 		mutex_unlock(&delayed_uprobe_lock);
-		kfree(uprobe);
+
+		call_rcu(&uprobe->rcu, uprobe_free_rcu);
 	}
 }
 
@@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
 {
 	struct uprobe *uprobe;
+	unsigned seq;
 
-	read_lock(&uprobes_treelock);
-	uprobe = __find_uprobe(inode, offset);
-	read_unlock(&uprobes_treelock);
+	guard(rcu)();
 
-	return uprobe;
+	do {
+		seq = read_seqcount_begin(&uprobes_seqcount);
+		uprobes = __find_uprobe(inode, offset);
+		if (uprobes) {
+			/*
+			 * Lockless RB-tree lookups are prone to false-negatives.
+			 * If they find something, it's good. If they do not find,
+			 * it needs to be validated.
+			 */
+			return uprobes;
+		}
+	} while (read_seqcount_retry(&uprobes_seqcount, seq));
+
+	/* Really didn't find anything. */
+	return NULL;
 }
 
 static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
@@ -702,7 +724,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	struct uprobe *u;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	u = __insert_uprobe(uprobe);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 
 	return u;
@@ -936,7 +960,9 @@ static void delete_uprobe(struct uprobe *uprobe)
 		return;
 
 	write_lock(&uprobes_treelock);
+	write_seqcount_begin(&uprobes_seqcount);
 	rb_erase(&uprobe->rb_node, &uprobes_tree);
+	write_seqcount_end(&uprobes_seqcount);
 	write_unlock(&uprobes_treelock);
 	RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
 	put_uprobe(uprobe);
Peter Zijlstra July 2, 2024, 12:01 p.m. UTC | #3
On Tue, Jul 02, 2024 at 01:54:47PM +0200, Peter Zijlstra wrote:

> @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
>  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
>  {
>  	struct uprobe *uprobe;
> +	unsigned seq;
>  
> +	guard(rcu)();
>  
> +	do {
> +		seq = read_seqcount_begin(&uprobes_seqcount);
> +		uprobes = __find_uprobe(inode, offset);
> +		if (uprobes) {
> +			/*
> +			 * Lockless RB-tree lookups are prone to false-negatives.
> +			 * If they find something, it's good. If they do not find,
> +			 * it needs to be validated.
> +			 */
> +			return uprobes;
> +		}
> +	} while (read_seqcount_retry(&uprobes_seqcount, seq));
> +
> +	/* Really didn't find anything. */
> +	return NULL;
>  }
>  
>  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> @@ -702,7 +724,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>  	struct uprobe *u;
>  
>  	write_lock(&uprobes_treelock);
> +	write_seqcount_begin(&uprobes_seqcount);
>  	u = __insert_uprobe(uprobe);
> +	write_seqcount_end(&uprobes_seqcount);
>  	write_unlock(&uprobes_treelock);
>  
>  	return u;

Strictly speaking I suppose we should add rb_find_rcu() and
rc_find_add_rcu() that sprinkle some rcu_dereference_raw() and
rb_link_node_rcu() around. See the examples in __lt_find() and
__lt_insert().
Andrii Nakryiko July 2, 2024, 5:54 p.m. UTC | #4
On Tue, Jul 2, 2024 at 4:54 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
>
> +LKML
>
> On Tue, Jul 02, 2024 at 12:23:53PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 01, 2024 at 03:39:23PM -0700, Andrii Nakryiko wrote:
> > > This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> > > to per-CPU RW semaphore, which has better performance and scales better under
> > > contention and multiple parallel threads triggering lots of uprobes.
> >
> > Why not RCU + normal lock thing?
>
> Something like the *completely* untested below.
>
> ---
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 2c83ba776fc7..03b38f3f7be3 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -40,6 +40,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
>  #define no_uprobe_events()     RB_EMPTY_ROOT(&uprobes_tree)
>
>  static DEFINE_RWLOCK(uprobes_treelock);        /* serialize rbtree access */
> +static seqcount_rwlock_t uprobes_seqcount = SEQCNT_RWLOCK_ZERO(uprobes_seqcount, &uprobes_treelock);
>
>  #define UPROBES_HASH_SZ        13
>  /* serialize uprobe->pending_list */
> @@ -54,6 +55,7 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
>  struct uprobe {
>         struct rb_node          rb_node;        /* node in the rb tree */
>         refcount_t              ref;
> +       struct rcu_head         rcu;
>         struct rw_semaphore     register_rwsem;
>         struct rw_semaphore     consumer_rwsem;
>         struct list_head        pending_list;
> @@ -67,7 +69,7 @@ struct uprobe {
>          * The generic code assumes that it has two members of unknown type
>          * owned by the arch-specific code:
>          *
> -        *      insn -  copy_insn() saves the original instruction here for
> +        *      insn -  copy_insn() saves the original instruction here for
>          *              arch_uprobe_analyze_insn().
>          *
>          *      ixol -  potentially modified instruction to execute out of
> @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
>         return uprobe;
>  }
>
> +static void uprobe_free_rcu(struct rcu_head *rcu)
> +{
> +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> +       kfree(uprobe);
> +}
> +
>  static void put_uprobe(struct uprobe *uprobe)
>  {
>         if (refcount_dec_and_test(&uprobe->ref)) {
> @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)

right above this we have roughly this:

percpu_down_write(&uprobes_treelock);

/* refcount check */
rb_erase(&uprobe->rb_node, &uprobes_tree);

percpu_up_write(&uprobes_treelock);


This writer lock is necessary for modification of the RB tree. And I
was under impression that I shouldn't be doing
percpu_(down|up)_write() inside the normal
rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
clarify.

But actually what's wrong with RCU Tasks Trace flavor? I will
ultimately use it anyway to avoid uprobe taking unnecessary refcount
and to protect uprobe->consumers iteration and uc->handler() calls,
which could be sleepable, so would need rcu_read_lock_trace().

>                 mutex_lock(&delayed_uprobe_lock);
>                 delayed_uprobe_remove(uprobe, NULL);
>                 mutex_unlock(&delayed_uprobe_lock);
> -               kfree(uprobe);
> +
> +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
>         }
>  }
>
> @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
>  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
>  {
>         struct uprobe *uprobe;
> +       unsigned seq;
>
> -       read_lock(&uprobes_treelock);
> -       uprobe = __find_uprobe(inode, offset);
> -       read_unlock(&uprobes_treelock);
> +       guard(rcu)();
>
> -       return uprobe;
> +       do {
> +               seq = read_seqcount_begin(&uprobes_seqcount);
> +               uprobes = __find_uprobe(inode, offset);
> +               if (uprobes) {
> +                       /*
> +                        * Lockless RB-tree lookups are prone to false-negatives.
> +                        * If they find something, it's good. If they do not find,
> +                        * it needs to be validated.
> +                        */
> +                       return uprobes;
> +               }
> +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> +
> +       /* Really didn't find anything. */
> +       return NULL;
>  }

Honest question here, as I don't understand the tradeoffs well enough.
Is there a lot of benefit to switching to seqcount lock vs using
percpu RW semaphore (previously recommended by Ingo). The latter is a
nice drop-in replacement and seems to be very fast and scale well.
Right now we are bottlenecked on uprobe->register_rwsem (not
uprobes_treelock anymore), which is currently limiting the scalability
of uprobes and I'm going to work on that next once I'm done with this
series.

>
>  static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> @@ -702,7 +724,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>         struct uprobe *u;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         u = __insert_uprobe(uprobe);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>
>         return u;
> @@ -936,7 +960,9 @@ static void delete_uprobe(struct uprobe *uprobe)
>                 return;
>
>         write_lock(&uprobes_treelock);
> +       write_seqcount_begin(&uprobes_seqcount);
>         rb_erase(&uprobe->rb_node, &uprobes_tree);
> +       write_seqcount_end(&uprobes_seqcount);
>         write_unlock(&uprobes_treelock);
>         RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
>         put_uprobe(uprobe);
Peter Zijlstra July 2, 2024, 7:18 p.m. UTC | #5
On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:

> > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> >         return uprobe;
> >  }
> >
> > +static void uprobe_free_rcu(struct rcu_head *rcu)
> > +{
> > +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > +       kfree(uprobe);
> > +}
> > +
> >  static void put_uprobe(struct uprobe *uprobe)
> >  {
> >         if (refcount_dec_and_test(&uprobe->ref)) {
> > @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
> 
> right above this we have roughly this:
> 
> percpu_down_write(&uprobes_treelock);
> 
> /* refcount check */
> rb_erase(&uprobe->rb_node, &uprobes_tree);
> 
> percpu_up_write(&uprobes_treelock);
> 
> 
> This writer lock is necessary for modification of the RB tree. And I
> was under impression that I shouldn't be doing
> percpu_(down|up)_write() inside the normal
> rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
> might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
> clarify.

preemptible RCU or SRCU would work.

> 
> But actually what's wrong with RCU Tasks Trace flavor? 

Paul, isn't this the RCU flavour you created to deal with
!rcu_is_watching()? The flavour that never should have been created in
favour of just cleaning up the mess instead of making more.

> I will
> ultimately use it anyway to avoid uprobe taking unnecessary refcount
> and to protect uprobe->consumers iteration and uc->handler() calls,
> which could be sleepable, so would need rcu_read_lock_trace().

I don't think you need trace-rcu for that. SRCU would do nicely I think.

> >                 mutex_lock(&delayed_uprobe_lock);
> >                 delayed_uprobe_remove(uprobe, NULL);
> >                 mutex_unlock(&delayed_uprobe_lock);
> > -               kfree(uprobe);
> > +
> > +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
> >         }
> >  }
> >
> > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> >  {
> >         struct uprobe *uprobe;
> > +       unsigned seq;
> >
> > -       read_lock(&uprobes_treelock);
> > -       uprobe = __find_uprobe(inode, offset);
> > -       read_unlock(&uprobes_treelock);
> > +       guard(rcu)();
> >
> > -       return uprobe;
> > +       do {
> > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > +               uprobes = __find_uprobe(inode, offset);
> > +               if (uprobes) {
> > +                       /*
> > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > +                        * If they find something, it's good. If they do not find,
> > +                        * it needs to be validated.
> > +                        */
> > +                       return uprobes;
> > +               }
> > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > +
> > +       /* Really didn't find anything. */
> > +       return NULL;
> >  }
> 
> Honest question here, as I don't understand the tradeoffs well enough.
> Is there a lot of benefit to switching to seqcount lock vs using
> percpu RW semaphore (previously recommended by Ingo). The latter is a
> nice drop-in replacement and seems to be very fast and scale well.

As you noted, that percpu-rwsem write side is quite insane. And you're
creating this batch complexity to mitigate that.

The patches you propose are quite complex, this alternative not so much.

> Right now we are bottlenecked on uprobe->register_rwsem (not
> uprobes_treelock anymore), which is currently limiting the scalability
> of uprobes and I'm going to work on that next once I'm done with this
> series.

Right, but it looks fairly simple to replace that rwsem with a mutex and
srcu.
Paul E. McKenney July 2, 2024, 11:56 p.m. UTC | #6
On Tue, Jul 02, 2024 at 09:18:57PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:
> 
> > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > >         return uprobe;
> > >  }
> > >
> > > +static void uprobe_free_rcu(struct rcu_head *rcu)
> > > +{
> > > +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > > +       kfree(uprobe);
> > > +}
> > > +
> > >  static void put_uprobe(struct uprobe *uprobe)
> > >  {
> > >         if (refcount_dec_and_test(&uprobe->ref)) {
> > > @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
> > 
> > right above this we have roughly this:
> > 
> > percpu_down_write(&uprobes_treelock);
> > 
> > /* refcount check */
> > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > 
> > percpu_up_write(&uprobes_treelock);
> > 
> > 
> > This writer lock is necessary for modification of the RB tree. And I
> > was under impression that I shouldn't be doing
> > percpu_(down|up)_write() inside the normal
> > rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
> > might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
> > clarify.
> 
> preemptible RCU or SRCU would work.

I agree that SRCU would work from a functional viewpoint.  No so for
preemptible RCU, which permits preemption (and on -rt, blocking for
spinlocks), it does not permit full-up blocking, and for good reason.

> > But actually what's wrong with RCU Tasks Trace flavor? 
> 
> Paul, isn't this the RCU flavour you created to deal with
> !rcu_is_watching()? The flavour that never should have been created in
> favour of just cleaning up the mess instead of making more.

My guess is that you are instead thinking of RCU Tasks Rude, which can
be eliminated once all architectures get their entry/exit/deep-idle
functions either inlined or marked noinstr.

> > I will
> > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > and to protect uprobe->consumers iteration and uc->handler() calls,
> > which could be sleepable, so would need rcu_read_lock_trace().
> 
> I don't think you need trace-rcu for that. SRCU would do nicely I think.

From a functional viewpoint, agreed.

However, in the past, the memory-barrier and array-indexing overhead
of SRCU has made it a no-go for lightweight probes into fastpath code.
And these cases were what motivated RCU Tasks Trace (as opposed to RCU
Tasks Rude).

The other rule for RCU Tasks Trace is that although readers are permitted
to block, this blocking can be for no longer than a major page fault.
If you need longer-term blocking, then you should instead use SRCU.

							Thanx, Paul

> > >                 mutex_lock(&delayed_uprobe_lock);
> > >                 delayed_uprobe_remove(uprobe, NULL);
> > >                 mutex_unlock(&delayed_uprobe_lock);
> > > -               kfree(uprobe);
> > > +
> > > +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
> > >         }
> > >  }
> > >
> > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> > >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > >  {
> > >         struct uprobe *uprobe;
> > > +       unsigned seq;
> > >
> > > -       read_lock(&uprobes_treelock);
> > > -       uprobe = __find_uprobe(inode, offset);
> > > -       read_unlock(&uprobes_treelock);
> > > +       guard(rcu)();
> > >
> > > -       return uprobe;
> > > +       do {
> > > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > > +               uprobes = __find_uprobe(inode, offset);
> > > +               if (uprobes) {
> > > +                       /*
> > > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > > +                        * If they find something, it's good. If they do not find,
> > > +                        * it needs to be validated.
> > > +                        */
> > > +                       return uprobes;
> > > +               }
> > > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > > +
> > > +       /* Really didn't find anything. */
> > > +       return NULL;
> > >  }
> > 
> > Honest question here, as I don't understand the tradeoffs well enough.
> > Is there a lot of benefit to switching to seqcount lock vs using
> > percpu RW semaphore (previously recommended by Ingo). The latter is a
> > nice drop-in replacement and seems to be very fast and scale well.
> 
> As you noted, that percpu-rwsem write side is quite insane. And you're
> creating this batch complexity to mitigate that.
> 
> The patches you propose are quite complex, this alternative not so much.
> 
> > Right now we are bottlenecked on uprobe->register_rwsem (not
> > uprobes_treelock anymore), which is currently limiting the scalability
> > of uprobes and I'm going to work on that next once I'm done with this
> > series.
> 
> Right, but it looks fairly simple to replace that rwsem with a mutex and
> srcu.
Andrii Nakryiko July 3, 2024, 4:47 a.m. UTC | #7
On Tue, Jul 2, 2024 at 12:19 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:
>
> > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > >         return uprobe;
> > >  }
> > >

[...]

> > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> > >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > >  {
> > >         struct uprobe *uprobe;
> > > +       unsigned seq;
> > >
> > > -       read_lock(&uprobes_treelock);
> > > -       uprobe = __find_uprobe(inode, offset);
> > > -       read_unlock(&uprobes_treelock);
> > > +       guard(rcu)();
> > >
> > > -       return uprobe;
> > > +       do {
> > > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > > +               uprobes = __find_uprobe(inode, offset);
> > > +               if (uprobes) {
> > > +                       /*
> > > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > > +                        * If they find something, it's good. If they do not find,
> > > +                        * it needs to be validated.
> > > +                        */
> > > +                       return uprobes;
> > > +               }
> > > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > > +
> > > +       /* Really didn't find anything. */
> > > +       return NULL;
> > >  }
> >
> > Honest question here, as I don't understand the tradeoffs well enough.
> > Is there a lot of benefit to switching to seqcount lock vs using
> > percpu RW semaphore (previously recommended by Ingo). The latter is a
> > nice drop-in replacement and seems to be very fast and scale well.
>
> As you noted, that percpu-rwsem write side is quite insane. And you're
> creating this batch complexity to mitigate that.


Note that batch API is needed regardless of percpu RW semaphore or
not. As I mentioned, once uprobes_treelock is mitigated one way or the
other, the next one is uprobe->register_rwsem. For scalability, we
need to get rid of it and preferably not add any locking at all. So
tentatively I'd like to have lockless RCU-protected iteration over
uprobe->consumers list and call consumer->handler(). This means that
on uprobes_unregister we'd need synchronize_rcu (for whatever RCU
flavor we end up using), to ensure that we don't free uprobe_consumer
memory from under handle_swbp() while it is actually triggering
consumers.

So, without batched unregistration we'll be back to the same problem
I'm solving here: doing synchronize_rcu() for each attached uprobe one
by one is prohibitively slow. We went through this exercise with
ftrace/kprobes already and fixed it with batched APIs. Doing that for
uprobes seems unavoidable as well.

>
> The patches you propose are quite complex, this alternative not so much.

I agree that this custom refcounting is not trivial, but at least it's
pretty well contained within two low-level helpers which are all used
within this single .c file.

On the other hand, it actually gives us a) speed and better
scalability (I showed comparisons with refcount_inc_not_zero approach
earlier, I believe) and b) it actually simplifies logic during
registration (which is even more important aspect with batched API),
where we don't need to handle uprobe suddenly going away after we
already looked it up.

I believe overall it's an improvement worth doing.

>
> > Right now we are bottlenecked on uprobe->register_rwsem (not
> > uprobes_treelock anymore), which is currently limiting the scalability
> > of uprobes and I'm going to work on that next once I'm done with this
> > series.
>
> Right, but it looks fairly simple to replace that rwsem with a mutex and
> srcu.

srcu vs RCU Tasks Trace aside (which Paul addressed), see above about
the need for batched API and synchronize_rcu().
Andrii Nakryiko July 3, 2024, 4:54 a.m. UTC | #8
On Tue, Jul 2, 2024 at 4:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Tue, Jul 02, 2024 at 09:18:57PM +0200, Peter Zijlstra wrote:
> > On Tue, Jul 02, 2024 at 10:54:51AM -0700, Andrii Nakryiko wrote:
> >
> > > > @@ -593,6 +595,12 @@ static struct uprobe *get_uprobe(struct uprobe *uprobe)
> > > >         return uprobe;
> > > >  }
> > > >
> > > > +static void uprobe_free_rcu(struct rcu_head *rcu)
> > > > +{
> > > > +       struct uprobe *uprobe = container_of(rcu, struct uprobe, rcu);
> > > > +       kfree(uprobe);
> > > > +}
> > > > +
> > > >  static void put_uprobe(struct uprobe *uprobe)
> > > >  {
> > > >         if (refcount_dec_and_test(&uprobe->ref)) {
> > > > @@ -604,7 +612,8 @@ static void put_uprobe(struct uprobe *uprobe)
> > >
> > > right above this we have roughly this:
> > >
> > > percpu_down_write(&uprobes_treelock);
> > >
> > > /* refcount check */
> > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > >
> > > percpu_up_write(&uprobes_treelock);
> > >
> > >
> > > This writer lock is necessary for modification of the RB tree. And I
> > > was under impression that I shouldn't be doing
> > > percpu_(down|up)_write() inside the normal
> > > rcu_read_lock()/rcu_read_unlock() region (percpu_down_write has
> > > might_sleep() in it). But maybe I'm wrong, hopefully Paul can help to
> > > clarify.
> >
> > preemptible RCU or SRCU would work.
>
> I agree that SRCU would work from a functional viewpoint.  No so for
> preemptible RCU, which permits preemption (and on -rt, blocking for
> spinlocks), it does not permit full-up blocking, and for good reason.
>
> > > But actually what's wrong with RCU Tasks Trace flavor?
> >
> > Paul, isn't this the RCU flavour you created to deal with
> > !rcu_is_watching()? The flavour that never should have been created in
> > favour of just cleaning up the mess instead of making more.
>
> My guess is that you are instead thinking of RCU Tasks Rude, which can
> be eliminated once all architectures get their entry/exit/deep-idle
> functions either inlined or marked noinstr.
>
> > > I will
> > > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > > and to protect uprobe->consumers iteration and uc->handler() calls,
> > > which could be sleepable, so would need rcu_read_lock_trace().
> >
> > I don't think you need trace-rcu for that. SRCU would do nicely I think.
>
> From a functional viewpoint, agreed.
>
> However, in the past, the memory-barrier and array-indexing overhead
> of SRCU has made it a no-go for lightweight probes into fastpath code.
> And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> Tasks Rude).

Yep, and this is a similar case here. I've actually implemented
SRCU-based protection and benchmarked it (all other things being the
same). I see 5% slowdown for the fastest uprobe kind (entry uprobe on
nop) for the single-threaded use case. We go down from 3.15 millions/s
triggerings to slightly below 3 millions/s. With more threads the
difference increases a bit, though numbers vary a bit from run to run,
so I don't want to put out the exact number. But I see that for
SRCU-based implementation total aggregated peak achievable throughput
is about 3.5-3.6 mln/s vs this implementation reaching 4-4.1 mln/s.
Again, some of that could be variability, but I did run multiple
rounds and that's the trend I'm seeing.

>
> The other rule for RCU Tasks Trace is that although readers are permitted
> to block, this blocking can be for no longer than a major page fault.
> If you need longer-term blocking, then you should instead use SRCU.
>

And this is the case here. Right now rcu_read_lock_trace() is
protecting uprobes_treelock, which is only taken for the duration of
RB tree lookup/insert/delete. In my subsequent changes to eliminate
register_rwsem we might be executing uprobe_consumer under this RCU
lock, but those also should be only sleeping for page faults.

On the other hand, hot path (reader side) is quite hot with
millions/second executions and should add as little overhead as
possible (which is why I'm seeing SRCU-based implementation being
slower, as I mentioned above).

>                                                         Thanx, Paul
>
> > > >                 mutex_lock(&delayed_uprobe_lock);
> > > >                 delayed_uprobe_remove(uprobe, NULL);
> > > >                 mutex_unlock(&delayed_uprobe_lock);
> > > > -               kfree(uprobe);
> > > > +
> > > > +               call_rcu(&uprobe->rcu, uprobe_free_rcu);
> > > >         }
> > > >  }
> > > >
> > > > @@ -668,12 +677,25 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
> > > >  static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > >  {
> > > >         struct uprobe *uprobe;
> > > > +       unsigned seq;
> > > >
> > > > -       read_lock(&uprobes_treelock);
> > > > -       uprobe = __find_uprobe(inode, offset);
> > > > -       read_unlock(&uprobes_treelock);
> > > > +       guard(rcu)();
> > > >
> > > > -       return uprobe;
> > > > +       do {
> > > > +               seq = read_seqcount_begin(&uprobes_seqcount);
> > > > +               uprobes = __find_uprobe(inode, offset);
> > > > +               if (uprobes) {
> > > > +                       /*
> > > > +                        * Lockless RB-tree lookups are prone to false-negatives.
> > > > +                        * If they find something, it's good. If they do not find,
> > > > +                        * it needs to be validated.
> > > > +                        */
> > > > +                       return uprobes;
> > > > +               }
> > > > +       } while (read_seqcount_retry(&uprobes_seqcount, seq));
> > > > +
> > > > +       /* Really didn't find anything. */
> > > > +       return NULL;
> > > >  }
> > >
> > > Honest question here, as I don't understand the tradeoffs well enough.
> > > Is there a lot of benefit to switching to seqcount lock vs using
> > > percpu RW semaphore (previously recommended by Ingo). The latter is a
> > > nice drop-in replacement and seems to be very fast and scale well.
> >
> > As you noted, that percpu-rwsem write side is quite insane. And you're
> > creating this batch complexity to mitigate that.
> >
> > The patches you propose are quite complex, this alternative not so much.
> >
> > > Right now we are bottlenecked on uprobe->register_rwsem (not
> > > uprobes_treelock anymore), which is currently limiting the scalability
> > > of uprobes and I'm going to work on that next once I'm done with this
> > > series.
> >
> > Right, but it looks fairly simple to replace that rwsem with a mutex and
> > srcu.
Peter Zijlstra July 3, 2024, 7:50 a.m. UTC | #9
On Tue, Jul 02, 2024 at 04:56:53PM -0700, Paul E. McKenney wrote:

> > Paul, isn't this the RCU flavour you created to deal with
> > !rcu_is_watching()? The flavour that never should have been created in
> > favour of just cleaning up the mess instead of making more.
> 
> My guess is that you are instead thinking of RCU Tasks Rude, which can
> be eliminated once all architectures get their entry/exit/deep-idle
> functions either inlined or marked noinstr.

Would it make sense to disable it for those architectures that have
already done this work?

> > > I will
> > > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > > and to protect uprobe->consumers iteration and uc->handler() calls,
> > > which could be sleepable, so would need rcu_read_lock_trace().
> > 
> > I don't think you need trace-rcu for that. SRCU would do nicely I think.
> 
> From a functional viewpoint, agreed.
> 
> However, in the past, the memory-barrier and array-indexing overhead
> of SRCU has made it a no-go for lightweight probes into fastpath code.
> And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> Tasks Rude).

I'm thinking we're growing too many RCU flavours again :/ I suppose I'll
have to go read up on rcu/tasks.* and see what's what.

> The other rule for RCU Tasks Trace is that although readers are permitted
> to block, this blocking can be for no longer than a major page fault.
> If you need longer-term blocking, then you should instead use SRCU.

I think this would render it unsuitable for uprobes. The whole point of
having a sleepable handler is to be able to take faults.
Peter Zijlstra July 3, 2024, 8:07 a.m. UTC | #10
On Tue, Jul 02, 2024 at 09:47:41PM -0700, Andrii Nakryiko wrote:

> > As you noted, that percpu-rwsem write side is quite insane. And you're
> > creating this batch complexity to mitigate that.
> 
> 
> Note that batch API is needed regardless of percpu RW semaphore or
> not. As I mentioned, once uprobes_treelock is mitigated one way or the
> other, the next one is uprobe->register_rwsem. For scalability, we
> need to get rid of it and preferably not add any locking at all. So
> tentatively I'd like to have lockless RCU-protected iteration over
> uprobe->consumers list and call consumer->handler(). This means that
> on uprobes_unregister we'd need synchronize_rcu (for whatever RCU
> flavor we end up using), to ensure that we don't free uprobe_consumer
> memory from under handle_swbp() while it is actually triggering
> consumers.
> 
> So, without batched unregistration we'll be back to the same problem
> I'm solving here: doing synchronize_rcu() for each attached uprobe one
> by one is prohibitively slow. We went through this exercise with
> ftrace/kprobes already and fixed it with batched APIs. Doing that for
> uprobes seems unavoidable as well.

I'm not immediately seeing how you need that terrible refcount stuff for
the batching though. If all you need is group a few unregisters together
in order to share a sync_rcu() that seems way overkill.

You seem to have muddled the order of things, which makes the actual
reason for doing things utterly unclear.
Paul E. McKenney July 3, 2024, 2:08 p.m. UTC | #11
On Wed, Jul 03, 2024 at 09:50:57AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 02, 2024 at 04:56:53PM -0700, Paul E. McKenney wrote:
> 
> > > Paul, isn't this the RCU flavour you created to deal with
> > > !rcu_is_watching()? The flavour that never should have been created in
> > > favour of just cleaning up the mess instead of making more.
> > 
> > My guess is that you are instead thinking of RCU Tasks Rude, which can
> > be eliminated once all architectures get their entry/exit/deep-idle
> > functions either inlined or marked noinstr.
> 
> Would it make sense to disable it for those architectures that have
> already done this work?

It might well.  Any architectures other than x86 at this point?

But this is still used in common code, so let's see...  In that case,
synchronize_rcu_tasks_rude() becomes a no-op, call_rcu_tasks_rude() can be
a wrapper around something like queue_work(), and rcu_barrier_tasks_rude()
can be a wrapper around something like flush_work().

Except that call_rcu_tasks_rude() and rcu_barrier_tasks_rude() are not
actually used outside of testing, so maybe they can be dropped globally.

Let me see what happens when I do this:

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7d18b90356fd..5c8492a054f5 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -936,8 +936,8 @@ static struct rcu_torture_ops tasks_rude_ops = {
 	.deferred_free	= rcu_tasks_rude_torture_deferred_free,
 	.sync		= synchronize_rcu_tasks_rude,
 	.exp_sync	= synchronize_rcu_tasks_rude,
-	.call		= call_rcu_tasks_rude,
-	.cb_barrier	= rcu_barrier_tasks_rude,
+	// .call		= call_rcu_tasks_rude,
+	// .cb_barrier	= rcu_barrier_tasks_rude,
 	.gp_kthread_dbg	= show_rcu_tasks_rude_gp_kthread,
 	.get_gp_data	= rcu_tasks_rude_get_gp_data,
 	.cbflood_max	= 50000,

It should be at least mildly amusing...

> > > > I will
> > > > ultimately use it anyway to avoid uprobe taking unnecessary refcount
> > > > and to protect uprobe->consumers iteration and uc->handler() calls,
> > > > which could be sleepable, so would need rcu_read_lock_trace().
> > > 
> > > I don't think you need trace-rcu for that. SRCU would do nicely I think.
> > 
> > From a functional viewpoint, agreed.
> > 
> > However, in the past, the memory-barrier and array-indexing overhead
> > of SRCU has made it a no-go for lightweight probes into fastpath code.
> > And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> > Tasks Rude).
> 
> I'm thinking we're growing too many RCU flavours again :/ I suppose I'll
> have to go read up on rcu/tasks.* and see what's what.

Well, you are in luck.  I am well along with the task of putting together
the 2024 LWN RCU API article, which will include RCU Tasks Trace.  ;-)

And I do sympathize with discomfort with lots of RCU flavors.  After all,
hhad you told me 30 years ago that there would be more than one flavor,
I would have been quite surprised.  Of course, I would also have been
surprised by a great many other things (just how many flavors of locking
and reference counting???), so maybe having three flavors (four if we
cannot drop RCU Tasks RUDE) is not so bad.

Oh, and no one is yet using srcu_down_read() and srcu_up_read(), so
if they stay unused for another year or so...

> > The other rule for RCU Tasks Trace is that although readers are permitted
> > to block, this blocking can be for no longer than a major page fault.
> > If you need longer-term blocking, then you should instead use SRCU.
> 
> I think this would render it unsuitable for uprobes. The whole point of
> having a sleepable handler is to be able to take faults.

???

I said "longer than a major page fault", so it is OK to take a fault,
just not one that potentially blocks forever.  (And those faulting onto
things like NFS filesystems have enough other problems that this would
be in the noise for them, right?)

And yes, RCU Tasks Trace is specialized.  I would expect that unexpected
new uses would get serious scrutiny by those currently using it.

							Thanx, Paul
Andrii Nakryiko July 3, 2024, 8:55 p.m. UTC | #12
On Wed, Jul 3, 2024 at 1:07 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Jul 02, 2024 at 09:47:41PM -0700, Andrii Nakryiko wrote:
>
> > > As you noted, that percpu-rwsem write side is quite insane. And you're
> > > creating this batch complexity to mitigate that.
> >
> >
> > Note that batch API is needed regardless of percpu RW semaphore or
> > not. As I mentioned, once uprobes_treelock is mitigated one way or the
> > other, the next one is uprobe->register_rwsem. For scalability, we
> > need to get rid of it and preferably not add any locking at all. So
> > tentatively I'd like to have lockless RCU-protected iteration over
> > uprobe->consumers list and call consumer->handler(). This means that
> > on uprobes_unregister we'd need synchronize_rcu (for whatever RCU
> > flavor we end up using), to ensure that we don't free uprobe_consumer
> > memory from under handle_swbp() while it is actually triggering
> > consumers.
> >
> > So, without batched unregistration we'll be back to the same problem
> > I'm solving here: doing synchronize_rcu() for each attached uprobe one
> > by one is prohibitively slow. We went through this exercise with
> > ftrace/kprobes already and fixed it with batched APIs. Doing that for
> > uprobes seems unavoidable as well.
>
> I'm not immediately seeing how you need that terrible refcount stuff for

Which part is terrible, please be more specific. I can switch to
refcount_inc_not_zero() and leave performance improvement on the
table, but why is that a good idea?

> the batching though. If all you need is group a few unregisters together
> in order to share a sync_rcu() that seems way overkill.
>
> You seem to have muddled the order of things, which makes the actual
> reason for doing things utterly unclear.

See -EGAIN handling in uprobe_register() code in current upstream
kernel. We manage to allocate and insert (or update existing) uprobe
in uprobes_tree. And then when we try to register we can post factum
detect that uprobe was removed from RB tree from under us. And we have
to go on a retry, allocating/inserting/updating it again.

This is quite problematic for batched API, in which I split the whole
attachment into few independent phase:

  - preallocate uprobe instances (for all consumers/uprobes)
  - insert them or reuse pre-existing ones (again, for all consumers
in one batch, protected by single writer lock on uprobes_treelock);
  - then register/apply for each VMA (you get it, for all consumers in one go).

Having this retry for some of uprobes because of this race is hugely
problematic, so I wanted to make it cleaner and simpler: once you
manage to insert/reuse uprobe, it's not going away from under me.
Which is why the change to refcounting schema.

And I think it's a major improvement. We can argue about
refcount_inc_not_zero vs this custom refcounting schema, but I think
the change should be made.

Now, imagine I also did all the seqcount and RCU stuff across entire
uprobe functionality. Wouldn't that be mind bending a little bit to
wrap your head around this?
Andrii Nakryiko July 3, 2024, 9:33 p.m. UTC | #13
On Mon, Jul 1, 2024 at 3:39 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> This patch set, ultimately, switches global uprobes_treelock from RW spinlock
> to per-CPU RW semaphore, which has better performance and scales better under
> contention and multiple parallel threads triggering lots of uprobes.
>
> To make this work well with attaching multiple uprobes (through BPF
> multi-uprobe), we need to add batched versions of uprobe register/unregister
> APIs. This is what most of the patch set is actually doing. The actual switch
> to per-CPU RW semaphore is trivial after that and is done in the very last
> patch #12. See commit message with some comparison numbers.
>

Peter,

I think I've addressed all the questions so far, but I wanted to take
a moment and bring all the discussions into a single palace, summarize
what I think are the main points of contention and hopefully make some
progress, or at least get us to a bit more constructive discussion
where *both sides* provide arguments. Right now there is a lot of "you
are doing X, but why don't you just do Y" with no argument for a) why
X is bad/wrong/inferior and b) why Y is better (and not just
equivalent or, even worse, inferior).

I trust you have the best intentions in mind for this piece of kernel
infrastructure, so do I, so let's try to find a path forward.

1. Strategically, uprobes/uretprobes have to be improved. Customers do
complain more and more that "uprobes are slow", justifiably so. Both
single-threaded performance matters, but also, critically, uprobes
scalability. I.e., if the kernel can handle N uprobe per second on a
single uncontended CPU, then triggering uprobes across M CPUs should,
ideally and roughly, give us about N * M total throughput.

This doesn't seem controversial, but I wanted to make it clear that
this is the end goal of my work. And no, this patch set alone doesn't,
yet, get us there. But it's a necessary step, IMO. Jiri Olsa took
single-threaded performance and is improving it with sys_uretprobe and
soon sys_uprobe, I'm looking into scalability and other smaller
single-threaded wins, where possible.

2. More tactically, RCU protection seems like the best way forward. We
got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
clarified that RCU Tasks Trace has nothing to do with Tasks Rude
flavor (whatever that is, I have no idea).

Now, RCU Tasks Trace were specifically designed for least overhead
hotpath (reader side) performance, at the expense of slowing down much
rarer writers. My microbenchmarking does show at least 5% difference.
Both flavors can handle sleepable uprobes waiting for page faults.
Tasks Trace flavor is already used for tracing in the BPF realm,
including for sleepable uprobes and works well. It's not going away.

Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
haven't seen a single argument why. Please provide that, or let's
stick to RCU Tasks Trace, because uprobe's use case is an ideal case
of what Tasks Trace flavor was designed for.

3. Regardless of RCU flavor, due to RCU protection, we have to add
batched register/unregister APIs, so we can amortize sync_rcu cost
during deregistration. Can we please agree on that as well? This is
the main goal of this patch set and I'd like to land it before working
further on changing and improving the rest of the locking schema.

I won't be happy about it, but just to move things forward, I can drop
a) custom refcounting and/or b) percpu RW semaphore. Both are
beneficial but not essential for batched APIs work. But if you force
me to do that, please state clearly your reasons/arguments. No one had
yet pointed out why refcounting is broken and why percpu RW semaphore
is bad. On the contrary, Ingo Molnar did suggest percpu RW semaphore
in the first place (see [0]), but we postponed it due to the lack of
batched APIs, and promised to do this work. Here I am, doing the
promised work. Not purely because of percpu RW semaphore, but
benefiting from it just as well.

  [0] https://lore.kernel.org/linux-trace-kernel/Zf+d9twfyIDosINf@gmail.com/

4. Another tactical thing, but an important one. Refcounting schema
for uprobes. I've replied already, but I think refcounting is
unavoidable for uretprobes, and current refcounting schema is
problematic for batched APIs due to race between finding uprobe and
there still being a possibility we'd need to undo all that and retry
again.

I think the main thing is to agree to change refcounting to avoid this
race, allowing for simpler batched registration. Hopefully we can
agree on that.

But also, refcount_inc_not_zero() which is another limiting factor for
scalability (see above about the end goal of scalability) vs
atomic64_add()-based epoch+refcount approach I took, which is
noticeably better on x86-64, and I don't think hurts any other
architecture, to say the least. I think the latter could be
generalized as an alternative flavor of refcount_t, but I'd prefer to
land it in uprobes in current shape, and if we think it's a good idea
to generalize, we can always do that refactoring once things stabilize
a bit.

You seem to have problems with the refcounting implementation I did
(besides overflow detection, which I'll address in the next revision,
so not a problem). My arguments are a) performance and b) it's well
contained within get/put helpers and doesn't leak outside of them *at
all*, while providing a nice always successful get_uprobe() primitive.

Can I please hear the arguments for not doing it, besides "Everyone is
using refcount_inc_not_zero", which isn't much of a reason (we'd never
do anything novel in the kernel if that was a good enough reason to
not do something new).

Again, thanks for engagement, I do appreciate it. But let's try to
move this forward. Thanks!
Steven Rostedt July 3, 2024, 9:57 p.m. UTC | #14
On Wed, 3 Jul 2024 09:50:57 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > However, in the past, the memory-barrier and array-indexing overhead
> > of SRCU has made it a no-go for lightweight probes into fastpath code.
> > And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> > Tasks Rude).  
> 
> I'm thinking we're growing too many RCU flavours again :/ I suppose I'll
> have to go read up on rcu/tasks.* and see what's what.

This RCU flavor is the one to handle trampolines. If the trampoline
never voluntarily schedules, then the quiescent state is a voluntary
schedule. The issue with trampolines is that if something was preempted
as it was jumping to a trampoline, there's no way to know when it is
safe to free that trampoline, as some preempted task's next instruction
is on that trampoline.

Any trampoline that does not voluntary schedule can use RCU task
synchronization. As it will wait till all tasks have voluntarily
scheduled or have entered user space (IIRC, Paul can correct me if I'm
wrong).

Now, if a trampoline does schedule, it would need to incorporate some
ref counting on the trampoline to handle the scheduling, but could
still use RCU task synchronization up to the point of the ref count.

And yes, the rude flavor was to handle the !rcu_is_watching case, and
can now be removed.

-- Steve
Paul E. McKenney July 3, 2024, 10:07 p.m. UTC | #15
On Wed, Jul 03, 2024 at 05:57:54PM -0400, Steven Rostedt wrote:
> On Wed, 3 Jul 2024 09:50:57 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > However, in the past, the memory-barrier and array-indexing overhead
> > > of SRCU has made it a no-go for lightweight probes into fastpath code.
> > > And these cases were what motivated RCU Tasks Trace (as opposed to RCU
> > > Tasks Rude).  
> > 
> > I'm thinking we're growing too many RCU flavours again :/ I suppose I'll
> > have to go read up on rcu/tasks.* and see what's what.
> 
> This RCU flavor is the one to handle trampolines. If the trampoline
> never voluntarily schedules, then the quiescent state is a voluntary
> schedule. The issue with trampolines is that if something was preempted
> as it was jumping to a trampoline, there's no way to know when it is
> safe to free that trampoline, as some preempted task's next instruction
> is on that trampoline.
> 
> Any trampoline that does not voluntary schedule can use RCU task
> synchronization. As it will wait till all tasks have voluntarily
> scheduled or have entered user space (IIRC, Paul can correct me if I'm
> wrong).

Agreed!

> Now, if a trampoline does schedule, it would need to incorporate some
> ref counting on the trampoline to handle the scheduling, but could
> still use RCU task synchronization up to the point of the ref count.

Or, if the schedule is due at most to a page fault, it can use RCU
Tasks Trace.

> And yes, the rude flavor was to handle the !rcu_is_watching case, and
> can now be removed.

From x86, agreed.

But have the other architectures done all the inlining and addition of
noistr required to permit this?  (Maybe they have, I honestly do not know.
But last I checked a few months ago, ARMv8 was not ready yet.)

							Thanx, Paul
Peter Zijlstra July 4, 2024, 8:39 a.m. UTC | #16
On Wed, Jul 03, 2024 at 07:08:21AM -0700, Paul E. McKenney wrote:
> On Wed, Jul 03, 2024 at 09:50:57AM +0200, Peter Zijlstra wrote:

> > Would it make sense to disable it for those architectures that have
> > already done this work?
> 
> It might well.  Any architectures other than x86 at this point?

Per 408b961146be ("tracing: WARN on rcuidle")
and git grep "select.*ARCH_WANTS_NO_INSTR"
arch/arm64/Kconfig:     select ARCH_WANTS_NO_INSTR
arch/loongarch/Kconfig: select ARCH_WANTS_NO_INSTR
arch/riscv/Kconfig:     select ARCH_WANTS_NO_INSTR
arch/s390/Kconfig:      select ARCH_WANTS_NO_INSTR
arch/x86/Kconfig:       select ARCH_WANTS_NO_INSTR

I'm thinking you can simply use that same condition here?
Peter Zijlstra July 4, 2024, 9:15 a.m. UTC | #17
On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:

> 2. More tactically, RCU protection seems like the best way forward. We
> got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> flavor (whatever that is, I have no idea).
> 
> Now, RCU Tasks Trace were specifically designed for least overhead
> hotpath (reader side) performance, at the expense of slowing down much
> rarer writers. My microbenchmarking does show at least 5% difference.
> Both flavors can handle sleepable uprobes waiting for page faults.
> Tasks Trace flavor is already used for tracing in the BPF realm,
> including for sleepable uprobes and works well. It's not going away.

I need to look into this new RCU flavour and why it exists -- for
example, why can't SRCU be improved to gain the same benefits. This is
what we've always done, improve SRCU.

> Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> haven't seen a single argument why. Please provide that, or let's
> stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> of what Tasks Trace flavor was designed for.

Because I actually know SRCU, and because it provides a local scope.
It isolates the unregister waiters from other random users. I'm not
going to use this funky new flavour until I truly understand it.

Also, we actually want two scopes here, there is no reason for the
consumer unreg to wait for the retprobe stuff.

> 3. Regardless of RCU flavor, due to RCU protection, we have to add
> batched register/unregister APIs, so we can amortize sync_rcu cost
> during deregistration. Can we please agree on that as well? This is
> the main goal of this patch set and I'd like to land it before working
> further on changing and improving the rest of the locking schema.

See my patch here:

  https://lkml.kernel.org/r/20240704084524.GC28838@noisy.programming.kicks-ass.net

I don't think it needs to be more complicated than that.

> I won't be happy about it, but just to move things forward, I can drop
> a) custom refcounting and/or b) percpu RW semaphore. Both are
> beneficial but not essential for batched APIs work. But if you force
> me to do that, please state clearly your reasons/arguments.

The reason I'm pushing RCU here is because AFAICT uprobes doesn't
actually need the stronger serialisation that rwlock (any flavour)
provide. It is a prime candidate for RCU, and I think you'll find plenty
papers / articles (by both Paul and others) that show that RCU scales
better.

As a bonus, you avoid that horrific write side cost that per-cpu rwsem
has.

The reason I'm not keen on that refcount thing was initially because I
did not understand the justification for it, but worse, once I did read
your justification, your very own numbers convinced me that the refcount
is fundamentally problematic, in any way shape or form.

> No one had yet pointed out why refcounting is broken 

Your very own numbers point out that refcounting is a problem here. 

> and why percpu RW semaphore is bad. 

Literature and history show us that RCU -- where possible -- is
always better than any reader-writer locking scheme.

> 4. Another tactical thing, but an important one. Refcounting schema
> for uprobes. I've replied already, but I think refcounting is
> unavoidable for uretprobes,

I think we can fix that, I replied here:

  https://lkml.kernel.org/r/20240704083152.GQ11386@noisy.programming.kicks-ass.net

> and current refcounting schema is
> problematic for batched APIs due to race between finding uprobe and
> there still being a possibility we'd need to undo all that and retry
> again.

Right, I've not looked too deeply at that, because I've not seen a
reason to actually change that. I can go think about it if you want, but
meh.
Steven Rostedt July 4, 2024, 1:56 p.m. UTC | #18
On Thu, 4 Jul 2024 11:15:59 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Now, RCU Tasks Trace were specifically designed for least overhead
> > hotpath (reader side) performance, at the expense of slowing down much
> > rarer writers. My microbenchmarking does show at least 5% difference.
> > Both flavors can handle sleepable uprobes waiting for page faults.
> > Tasks Trace flavor is already used for tracing in the BPF realm,
> > including for sleepable uprobes and works well. It's not going away.  
> 
> I need to look into this new RCU flavour and why it exists -- for
> example, why can't SRCU be improved to gain the same benefits. This is
> what we've always done, improve SRCU.

I don't know about this use case, but for the trampoline use case SRCU
doesn't work as it requires calling a srcu_read_lock() which isn't
possible when you need to take that lock from all function calls just
before it jumps to the ftrace trampoline. That is, it needs to be taken
before "call fentry".

I'm just stating this to provide the reason why we needed that flavor
of RCU.

-- Steve
Paul E. McKenney July 4, 2024, 3:13 p.m. UTC | #19
On Thu, Jul 04, 2024 at 10:39:35AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2024 at 07:08:21AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 03, 2024 at 09:50:57AM +0200, Peter Zijlstra wrote:
> 
> > > Would it make sense to disable it for those architectures that have
> > > already done this work?
> > 
> > It might well.  Any architectures other than x86 at this point?
> 
> Per 408b961146be ("tracing: WARN on rcuidle")
> and git grep "select.*ARCH_WANTS_NO_INSTR"
> arch/arm64/Kconfig:     select ARCH_WANTS_NO_INSTR
> arch/loongarch/Kconfig: select ARCH_WANTS_NO_INSTR
> arch/riscv/Kconfig:     select ARCH_WANTS_NO_INSTR
> arch/s390/Kconfig:      select ARCH_WANTS_NO_INSTR
> arch/x86/Kconfig:       select ARCH_WANTS_NO_INSTR
> 
> I'm thinking you can simply use that same condition here?

New one on me!  And it does look like that would work, and it also
looks like other code assumes that these architectures have all of their
deep-idle and entry/exit functions either inlined or noinstr-ed, so it
should be OK for Tasks Trace Rude to do likewise.  Thank you!!!

If you would like a sneak preview, please see the last few commits on
the "dev" branch of -rcu.  And this is easier than my original plan
immortalized (at least temporarily) on the "dev.2024.07.02a" branch.

Things left to do: (1) Rebase fixes into original commits. (2)
Make RCU Tasks stop ignoring idle tasks.  (3) Reorder the commits
for bisectability.  (4) Make rcutorture test RCU Tasks Rude even when
running on platforms that don't need it.  (5) Fix other bugs that I have
not yet spotted.

I expect to post an RFC patch early next week.  Unless there is some
emergency, I will slate these for the v6.12 merge window to give them
some soak time.

							Thanx, Paul
Paul E. McKenney July 4, 2024, 3:44 p.m. UTC | #20
On Thu, Jul 04, 2024 at 11:15:59AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:
> 
> > 2. More tactically, RCU protection seems like the best way forward. We
> > got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> > clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> > flavor (whatever that is, I have no idea).
> > 
> > Now, RCU Tasks Trace were specifically designed for least overhead
> > hotpath (reader side) performance, at the expense of slowing down much
> > rarer writers. My microbenchmarking does show at least 5% difference.
> > Both flavors can handle sleepable uprobes waiting for page faults.
> > Tasks Trace flavor is already used for tracing in the BPF realm,
> > including for sleepable uprobes and works well. It's not going away.
> 
> I need to look into this new RCU flavour and why it exists -- for
> example, why can't SRCU be improved to gain the same benefits. This is
> what we've always done, improve SRCU.

Well, it is all software.  And I certainly pushed SRCU hard.  If I recall
correctly, it took them a year to convince me that they needed something
more than SRCU could reasonably be convinced to do.

The big problem is that they need to be able to hook a simple BPF program
(for example, count the number of calls with given argument values) on
a fastpath function on a system running in production without causing
the automation to decide that this system is too slow, thus whacking it
over the head.  Any appreciable overhead is a no-go in this use case.
It is not just that the srcu_read_lock() function's smp_mb() call would
disqualify SRCU, its other added overhead would as well.  Plus this needs
RCU Tasks Trace CPU stall warnings to catch abuse, and SRCU doesn't
impose any limits on readers (how long to set the stall time?) and
doesn't track tasks.

> > Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> > haven't seen a single argument why. Please provide that, or let's
> > stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> > of what Tasks Trace flavor was designed for.
> 
> Because I actually know SRCU, and because it provides a local scope.
> It isolates the unregister waiters from other random users. I'm not
> going to use this funky new flavour until I truly understand it.

It is only a few hundred lines of code on top of the infrastructure
that also supports RCU Tasks and RCU Tasks Rude.  If you understand
SRCU and preemptible RCU, there will be nothing exotic there, and it is
simpler than Tree SRCU, to say nothing of preemptible RCU.  I would be
more than happy to take you through it if you would like, but not before
this coming Monday.

> Also, we actually want two scopes here, there is no reason for the
> consumer unreg to wait for the retprobe stuff.

I don't know that the performance requirements for userspace retprobes are
as severe as for function-call probes -- on that, I must defer to Andrii.
To your two-scopes point, it is quite possible that SRCU could be used
for userspace retprobes and RCU Tasks Trace for the others.  It certainly
seems to me that SRCU would be better than explicit reference counting,
but I could be missing something.  (Memory footprint, perhaps?  Though
maybe a single srcu_struct could be shared among all userspace retprobes.
Given the time-bounded reads, maybe stall warnings aren't needed,
give or take things like interrupts, preemption, and vCPU preemption.
Plus it is not like it would be hard to figure out which read-side code
region was at fault when the synchronize_srcu() took too long.)

							Thanx, Paul

> > 3. Regardless of RCU flavor, due to RCU protection, we have to add
> > batched register/unregister APIs, so we can amortize sync_rcu cost
> > during deregistration. Can we please agree on that as well? This is
> > the main goal of this patch set and I'd like to land it before working
> > further on changing and improving the rest of the locking schema.
> 
> See my patch here:
> 
>   https://lkml.kernel.org/r/20240704084524.GC28838@noisy.programming.kicks-ass.net
> 
> I don't think it needs to be more complicated than that.
> 
> > I won't be happy about it, but just to move things forward, I can drop
> > a) custom refcounting and/or b) percpu RW semaphore. Both are
> > beneficial but not essential for batched APIs work. But if you force
> > me to do that, please state clearly your reasons/arguments.
> 
> The reason I'm pushing RCU here is because AFAICT uprobes doesn't
> actually need the stronger serialisation that rwlock (any flavour)
> provide. It is a prime candidate for RCU, and I think you'll find plenty
> papers / articles (by both Paul and others) that show that RCU scales
> better.
> 
> As a bonus, you avoid that horrific write side cost that per-cpu rwsem
> has.
> 
> The reason I'm not keen on that refcount thing was initially because I
> did not understand the justification for it, but worse, once I did read
> your justification, your very own numbers convinced me that the refcount
> is fundamentally problematic, in any way shape or form.
> 
> > No one had yet pointed out why refcounting is broken 
> 
> Your very own numbers point out that refcounting is a problem here. 
> 
> > and why percpu RW semaphore is bad. 
> 
> Literature and history show us that RCU -- where possible -- is
> always better than any reader-writer locking scheme.
> 
> > 4. Another tactical thing, but an important one. Refcounting schema
> > for uprobes. I've replied already, but I think refcounting is
> > unavoidable for uretprobes,
> 
> I think we can fix that, I replied here:
> 
>   https://lkml.kernel.org/r/20240704083152.GQ11386@noisy.programming.kicks-ass.net
> 
> > and current refcounting schema is
> > problematic for batched APIs due to race between finding uprobe and
> > there still being a possibility we'd need to undo all that and retry
> > again.
> 
> Right, I've not looked too deeply at that, because I've not seen a
> reason to actually change that. I can go think about it if you want, but
> meh.