Message ID | 20240321145736.2373846-1-jonathan.haslam@gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | uprobes: reduce contention on uprobes_tree access | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Thu, Mar 21, 2024 at 7:57 AM Jonathan Haslam <jonathan.haslam@gmail.com> wrote: > > Active uprobes are stored in an RB tree and accesses to this tree are > dominated by read operations. Currently these accesses are serialized by > a spinlock but this leads to enormous contention when large numbers of > threads are executing active probes. > > This patch converts the spinlock used to serialize access to the > uprobes_tree RB tree into a reader-writer spinlock. This lock type > aligns naturally with the overwhelmingly read-only nature of the tree > usage here. Although the addition of reader-writer spinlocks are > discouraged [0], this fix is proposed as an interim solution while an > RCU based approach is implemented (that work is in a nascent form). This > fix also has the benefit of being trivial, self contained and therefore > simple to backport. Yep, makes sense, I think we'll want to backport this ASAP to some of the old kernels we have. Thanks! Acked-by: Andrii Nakryiko <andrii@kernel.org> > > This change has been tested against production workloads that exhibit > significant contention on the spinlock and an almost order of magnitude > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > [0] https://docs.kernel.org/locking/spinlocks.html > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > --- > kernel/events/uprobes.c | 22 +++++++++++----------- > 1 file changed, 11 insertions(+), 11 deletions(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 929e98c62965..42bf9b6e8bc0 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > */ > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > #define UPROBES_HASH_SZ 13 > /* serialize uprobe->pending_list */ > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > { > struct uprobe *uprobe; > > - spin_lock(&uprobes_treelock); > + read_lock(&uprobes_treelock); > uprobe = __find_uprobe(inode, offset); > - spin_unlock(&uprobes_treelock); > + read_unlock(&uprobes_treelock); > > return uprobe; > } > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > { > struct uprobe *u; > > - spin_lock(&uprobes_treelock); > + write_lock(&uprobes_treelock); > u = __insert_uprobe(uprobe); > - spin_unlock(&uprobes_treelock); > + write_unlock(&uprobes_treelock); > > return u; > } > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > if (WARN_ON(!uprobe_is_active(uprobe))) > return; > > - spin_lock(&uprobes_treelock); > + write_lock(&uprobes_treelock); > rb_erase(&uprobe->rb_node, &uprobes_tree); > - spin_unlock(&uprobes_treelock); > + write_unlock(&uprobes_treelock); > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > put_uprobe(uprobe); > } > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > min = vaddr_to_offset(vma, start); > max = min + (end - start) - 1; > > - spin_lock(&uprobes_treelock); > + read_lock(&uprobes_treelock); > n = find_node_in_range(inode, min, max); > if (n) { > for (t = n; t; t = rb_prev(t)) { > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > get_uprobe(u); > } > } > - spin_unlock(&uprobes_treelock); > + read_unlock(&uprobes_treelock); > } > > /* @vma contains reference counter, not the probed instruction. */ > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > min = vaddr_to_offset(vma, start); > max = min + (end - start) - 1; > > - spin_lock(&uprobes_treelock); > + read_lock(&uprobes_treelock); > n = find_node_in_range(inode, min, max); > - spin_unlock(&uprobes_treelock); > + read_unlock(&uprobes_treelock); > > return !!n; > } > -- > 2.43.0 >
* Jonathan Haslam <jonathan.haslam@gmail.com> wrote: > Active uprobes are stored in an RB tree and accesses to this tree are > dominated by read operations. Currently these accesses are serialized by > a spinlock but this leads to enormous contention when large numbers of > threads are executing active probes. > > This patch converts the spinlock used to serialize access to the > uprobes_tree RB tree into a reader-writer spinlock. This lock type > aligns naturally with the overwhelmingly read-only nature of the tree > usage here. Although the addition of reader-writer spinlocks are > discouraged [0], this fix is proposed as an interim solution while an > RCU based approach is implemented (that work is in a nascent form). This > fix also has the benefit of being trivial, self contained and therefore > simple to backport. > > This change has been tested against production workloads that exhibit > significant contention on the spinlock and an almost order of magnitude > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). Have you considered/measured per-CPU RW semaphores? Thanks, Ingo
On Thu, 21 Mar 2024 07:57:35 -0700 Jonathan Haslam <jonathan.haslam@gmail.com> wrote: > Active uprobes are stored in an RB tree and accesses to this tree are > dominated by read operations. Currently these accesses are serialized by > a spinlock but this leads to enormous contention when large numbers of > threads are executing active probes. > > This patch converts the spinlock used to serialize access to the > uprobes_tree RB tree into a reader-writer spinlock. This lock type > aligns naturally with the overwhelmingly read-only nature of the tree > usage here. Although the addition of reader-writer spinlocks are > discouraged [0], this fix is proposed as an interim solution while an > RCU based approach is implemented (that work is in a nascent form). This > fix also has the benefit of being trivial, self contained and therefore > simple to backport. > > This change has been tested against production workloads that exhibit > significant contention on the spinlock and an almost order of magnitude > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). Looks good to me. Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> BTW, how did you measure the overhead? I think spinlock overhead will depend on how much lock contention happens. Thank you, > > [0] https://docs.kernel.org/locking/spinlocks.html > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > --- > kernel/events/uprobes.c | 22 +++++++++++----------- > 1 file changed, 11 insertions(+), 11 deletions(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 929e98c62965..42bf9b6e8bc0 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > */ > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > #define UPROBES_HASH_SZ 13 > /* serialize uprobe->pending_list */ > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > { > struct uprobe *uprobe; > > - spin_lock(&uprobes_treelock); > + read_lock(&uprobes_treelock); > uprobe = __find_uprobe(inode, offset); > - spin_unlock(&uprobes_treelock); > + read_unlock(&uprobes_treelock); > > return uprobe; > } > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > { > struct uprobe *u; > > - spin_lock(&uprobes_treelock); > + write_lock(&uprobes_treelock); > u = __insert_uprobe(uprobe); > - spin_unlock(&uprobes_treelock); > + write_unlock(&uprobes_treelock); > > return u; > } > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > if (WARN_ON(!uprobe_is_active(uprobe))) > return; > > - spin_lock(&uprobes_treelock); > + write_lock(&uprobes_treelock); > rb_erase(&uprobe->rb_node, &uprobes_tree); > - spin_unlock(&uprobes_treelock); > + write_unlock(&uprobes_treelock); > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > put_uprobe(uprobe); > } > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > min = vaddr_to_offset(vma, start); > max = min + (end - start) - 1; > > - spin_lock(&uprobes_treelock); > + read_lock(&uprobes_treelock); > n = find_node_in_range(inode, min, max); > if (n) { > for (t = n; t; t = rb_prev(t)) { > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > get_uprobe(u); > } > } > - spin_unlock(&uprobes_treelock); > + read_unlock(&uprobes_treelock); > } > > /* @vma contains reference counter, not the probed instruction. */ > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > min = vaddr_to_offset(vma, start); > max = min + (end - start) - 1; > > - spin_lock(&uprobes_treelock); > + read_lock(&uprobes_treelock); > n = find_node_in_range(inode, min, max); > - spin_unlock(&uprobes_treelock); > + read_unlock(&uprobes_treelock); > > return !!n; > } > -- > 2.43.0 >
Hi Masami, > > This change has been tested against production workloads that exhibit > > significant contention on the spinlock and an almost order of magnitude > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > Looks good to me. > > Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> > > BTW, how did you measure the overhead? I think spinlock overhead > will depend on how much lock contention happens. Absolutely. I have the original production workload to test this with and a derived one that mimics this test case. The production case has ~24 threads running on a 192 core system which access 14 USDTs around 1.5 million times per second in total (across all USDTs). My test case is similar but can drive a higher rate of USDT access across more threads and therefore generate higher contention. All measurements are done using bpftrace scripts around relevant parts of code in uprobes.c and application code. Jon. > > Thank you, > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > --- > > kernel/events/uprobes.c | 22 +++++++++++----------- > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 929e98c62965..42bf9b6e8bc0 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > */ > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > #define UPROBES_HASH_SZ 13 > > /* serialize uprobe->pending_list */ > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > { > > struct uprobe *uprobe; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > uprobe = __find_uprobe(inode, offset); > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > > > return uprobe; > > } > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > { > > struct uprobe *u; > > > > - spin_lock(&uprobes_treelock); > > + write_lock(&uprobes_treelock); > > u = __insert_uprobe(uprobe); > > - spin_unlock(&uprobes_treelock); > > + write_unlock(&uprobes_treelock); > > > > return u; > > } > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > if (WARN_ON(!uprobe_is_active(uprobe))) > > return; > > > > - spin_lock(&uprobes_treelock); > > + write_lock(&uprobes_treelock); > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > - spin_unlock(&uprobes_treelock); > > + write_unlock(&uprobes_treelock); > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > put_uprobe(uprobe); > > } > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > n = find_node_in_range(inode, min, max); > > if (n) { > > for (t = n; t; t = rb_prev(t)) { > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > get_uprobe(u); > > } > > } > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > } > > > > /* @vma contains reference counter, not the probed instruction. */ > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > n = find_node_in_range(inode, min, max); > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > > > return !!n; > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
Hi Ingo, > > This change has been tested against production workloads that exhibit > > significant contention on the spinlock and an almost order of magnitude > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > Have you considered/measured per-CPU RW semaphores? No I hadn't but thanks hugely for suggesting it! In initial measurements it seems to be between 20-100% faster than the RW spinlocks! Apologies for all the exclamation marks but I'm very excited. I'll do some more testing tomorrow but so far it's looking very good. Thanks again for the input. Jon.
On Mon, Mar 25, 2024 at 12:12 PM Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > > Hi Ingo, > > > > This change has been tested against production workloads that exhibit > > > significant contention on the spinlock and an almost order of magnitude > > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > > > Have you considered/measured per-CPU RW semaphores? > > No I hadn't but thanks hugely for suggesting it! In initial measurements > it seems to be between 20-100% faster than the RW spinlocks! Apologies for > all the exclamation marks but I'm very excited. I'll do some more testing > tomorrow but so far it's looking very good. > Documentation ([0]) says that locking for writing calls synchronize_rcu(), is that right? If that's true, attaching multiple uprobes (including just attaching a single BPF multi-uprobe) will take a really long time. We need to confirm we are not significantly regressing this. And if we do, we need to take measures in the BPF multi-uprobe attachment code path to make sure that a single multi-uprobe attachment is still fast. If my worries above turn out to be true, it still feels like a first good step should be landing this patch as is (and get it backported to older kernels), and then have percpu rw-semaphore as a final (and a bit more invasive) solution (it's RCU-based, so feels like a good primitive to settle on), making sure to not regress multi-uprobes (we'll probably will need some batched API for multiple uprobes). Thoughts? [0] https://docs.kernel.org/locking/percpu-rw-semaphore.html > Thanks again for the input. > > Jon.
> > > Have you considered/measured per-CPU RW semaphores? > > > > No I hadn't but thanks hugely for suggesting it! In initial measurements > > it seems to be between 20-100% faster than the RW spinlocks! Apologies for > > all the exclamation marks but I'm very excited. I'll do some more testing > > tomorrow but so far it's looking very good. > > > > Documentation ([0]) says that locking for writing calls > synchronize_rcu(), is that right? If that's true, attaching multiple > uprobes (including just attaching a single BPF multi-uprobe) will take > a really long time. We need to confirm we are not significantly > regressing this. And if we do, we need to take measures in the BPF > multi-uprobe attachment code path to make sure that a single > multi-uprobe attachment is still fast. > > If my worries above turn out to be true, it still feels like a first > good step should be landing this patch as is (and get it backported to > older kernels), and then have percpu rw-semaphore as a final (and a > bit more invasive) solution (it's RCU-based, so feels like a good > primitive to settle on), making sure to not regress multi-uprobes > (we'll probably will need some batched API for multiple uprobes). > > Thoughts? Agreed. In the percpu_down_write() path we call rcu_sync_enter() which is what calls into synchronize_rcu(). I haven't done the measurements yet but I would imagine this has to regress probe attachment, at least in the uncontended case. Of course, reads are by far the dominant mode here but we probably shouldn't punish writes excessively. I will do some measurements to quantify the write penalty here. I agree that a batched interface for probe attachment is needed here. The usual mode of operation for us is that we have a number of USDTs (uprobes) in hand and we want to enable and disable them in one shot. Removing the need to do multiple locking operations is definitely an efficiency improvement that needs to be done. Tie that together with per-CPU RW semaphores and this should scale extremely well in both a read and write case. Jon.
On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Thu, 21 Mar 2024 07:57:35 -0700 > Jonathan Haslam <jonathan.haslam@gmail.com> wrote: > > > Active uprobes are stored in an RB tree and accesses to this tree are > > dominated by read operations. Currently these accesses are serialized by > > a spinlock but this leads to enormous contention when large numbers of > > threads are executing active probes. > > > > This patch converts the spinlock used to serialize access to the > > uprobes_tree RB tree into a reader-writer spinlock. This lock type > > aligns naturally with the overwhelmingly read-only nature of the tree > > usage here. Although the addition of reader-writer spinlocks are > > discouraged [0], this fix is proposed as an interim solution while an > > RCU based approach is implemented (that work is in a nascent form). This > > fix also has the benefit of being trivial, self contained and therefore > > simple to backport. > > > > This change has been tested against production workloads that exhibit > > significant contention on the spinlock and an almost order of magnitude > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > Looks good to me. > > Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Masami, Given the discussion around per-cpu rw semaphore and need for (internal) batched attachment API for uprobes, do you think you can apply this patch as is for now? We can then gain initial improvements in scalability that are also easy to backport, and Jonathan will work on a more complete solution based on per-cpu RW semaphore, as suggested by Ingo. > > BTW, how did you measure the overhead? I think spinlock overhead > will depend on how much lock contention happens. > > Thank you, > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > --- > > kernel/events/uprobes.c | 22 +++++++++++----------- > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 929e98c62965..42bf9b6e8bc0 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > */ > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > #define UPROBES_HASH_SZ 13 > > /* serialize uprobe->pending_list */ > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > { > > struct uprobe *uprobe; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > uprobe = __find_uprobe(inode, offset); > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > > > return uprobe; > > } > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > { > > struct uprobe *u; > > > > - spin_lock(&uprobes_treelock); > > + write_lock(&uprobes_treelock); > > u = __insert_uprobe(uprobe); > > - spin_unlock(&uprobes_treelock); > > + write_unlock(&uprobes_treelock); > > > > return u; > > } > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > if (WARN_ON(!uprobe_is_active(uprobe))) > > return; > > > > - spin_lock(&uprobes_treelock); > > + write_lock(&uprobes_treelock); > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > - spin_unlock(&uprobes_treelock); > > + write_unlock(&uprobes_treelock); > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > put_uprobe(uprobe); > > } > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > n = find_node_in_range(inode, min, max); > > if (n) { > > for (t = n; t; t = rb_prev(t)) { > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > get_uprobe(u); > > } > > } > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > } > > > > /* @vma contains reference counter, not the probed instruction. */ > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(&uprobes_treelock); > > + read_lock(&uprobes_treelock); > > n = find_node_in_range(inode, min, max); > > - spin_unlock(&uprobes_treelock); > > + read_unlock(&uprobes_treelock); > > > > return !!n; > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Tue, 26 Mar 2024 09:01:47 -0700 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > On Thu, 21 Mar 2024 07:57:35 -0700 > > Jonathan Haslam <jonathan.haslam@gmail.com> wrote: > > > > > Active uprobes are stored in an RB tree and accesses to this tree are > > > dominated by read operations. Currently these accesses are serialized by > > > a spinlock but this leads to enormous contention when large numbers of > > > threads are executing active probes. > > > > > > This patch converts the spinlock used to serialize access to the > > > uprobes_tree RB tree into a reader-writer spinlock. This lock type > > > aligns naturally with the overwhelmingly read-only nature of the tree > > > usage here. Although the addition of reader-writer spinlocks are > > > discouraged [0], this fix is proposed as an interim solution while an > > > RCU based approach is implemented (that work is in a nascent form). This > > > fix also has the benefit of being trivial, self contained and therefore > > > simple to backport. > > > > > > This change has been tested against production workloads that exhibit > > > significant contention on the spinlock and an almost order of magnitude > > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > > > Looks good to me. > > > > Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> > > Masami, > > Given the discussion around per-cpu rw semaphore and need for > (internal) batched attachment API for uprobes, do you think you can > apply this patch as is for now? We can then gain initial improvements > in scalability that are also easy to backport, and Jonathan will work > on a more complete solution based on per-cpu RW semaphore, as > suggested by Ingo. Yeah, it is interesting to use per-cpu rw semaphore on uprobe. I would like to wait for the next version. Thank you, > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > will depend on how much lock contention happens. > > > > Thank you, > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > --- > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > --- a/kernel/events/uprobes.c > > > +++ b/kernel/events/uprobes.c > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > */ > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > #define UPROBES_HASH_SZ 13 > > > /* serialize uprobe->pending_list */ > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > { > > > struct uprobe *uprobe; > > > > > > - spin_lock(&uprobes_treelock); > > > + read_lock(&uprobes_treelock); > > > uprobe = __find_uprobe(inode, offset); > > > - spin_unlock(&uprobes_treelock); > > > + read_unlock(&uprobes_treelock); > > > > > > return uprobe; > > > } > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > { > > > struct uprobe *u; > > > > > > - spin_lock(&uprobes_treelock); > > > + write_lock(&uprobes_treelock); > > > u = __insert_uprobe(uprobe); > > > - spin_unlock(&uprobes_treelock); > > > + write_unlock(&uprobes_treelock); > > > > > > return u; > > > } > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > return; > > > > > > - spin_lock(&uprobes_treelock); > > > + write_lock(&uprobes_treelock); > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > - spin_unlock(&uprobes_treelock); > > > + write_unlock(&uprobes_treelock); > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > put_uprobe(uprobe); > > > } > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > min = vaddr_to_offset(vma, start); > > > max = min + (end - start) - 1; > > > > > > - spin_lock(&uprobes_treelock); > > > + read_lock(&uprobes_treelock); > > > n = find_node_in_range(inode, min, max); > > > if (n) { > > > for (t = n; t; t = rb_prev(t)) { > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > get_uprobe(u); > > > } > > > } > > > - spin_unlock(&uprobes_treelock); > > > + read_unlock(&uprobes_treelock); > > > } > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > min = vaddr_to_offset(vma, start); > > > max = min + (end - start) - 1; > > > > > > - spin_lock(&uprobes_treelock); > > > + read_lock(&uprobes_treelock); > > > n = find_node_in_range(inode, min, max); > > > - spin_unlock(&uprobes_treelock); > > > + read_unlock(&uprobes_treelock); > > > > > > return !!n; > > > } > > > -- > > > 2.43.0 > > > > > > > > > -- > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Mon, 25 Mar 2024 19:04:59 +0000 Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > Hi Masami, > > > > This change has been tested against production workloads that exhibit > > > significant contention on the spinlock and an almost order of magnitude > > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > > > Looks good to me. > > > > Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > BTW, how did you measure the overhead? I think spinlock overhead > > will depend on how much lock contention happens. > > Absolutely. I have the original production workload to test this with and > a derived one that mimics this test case. The production case has ~24 > threads running on a 192 core system which access 14 USDTs around 1.5 > million times per second in total (across all USDTs). My test case is > similar but can drive a higher rate of USDT access across more threads and > therefore generate higher contention. Thanks for the info. So this result is measured in enough large machine with high parallelism. So lock contention is matter. Can you also include this information with the number in next version? Thank you, > > All measurements are done using bpftrace scripts around relevant parts of > code in uprobes.c and application code. > > Jon. > > > > > Thank you, > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > --- > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > --- a/kernel/events/uprobes.c > > > +++ b/kernel/events/uprobes.c > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > */ > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > #define UPROBES_HASH_SZ 13 > > > /* serialize uprobe->pending_list */ > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > { > > > struct uprobe *uprobe; > > > > > > - spin_lock(&uprobes_treelock); > > > + read_lock(&uprobes_treelock); > > > uprobe = __find_uprobe(inode, offset); > > > - spin_unlock(&uprobes_treelock); > > > + read_unlock(&uprobes_treelock); > > > > > > return uprobe; > > > } > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > { > > > struct uprobe *u; > > > > > > - spin_lock(&uprobes_treelock); > > > + write_lock(&uprobes_treelock); > > > u = __insert_uprobe(uprobe); > > > - spin_unlock(&uprobes_treelock); > > > + write_unlock(&uprobes_treelock); > > > > > > return u; > > > } > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > return; > > > > > > - spin_lock(&uprobes_treelock); > > > + write_lock(&uprobes_treelock); > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > - spin_unlock(&uprobes_treelock); > > > + write_unlock(&uprobes_treelock); > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > put_uprobe(uprobe); > > > } > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > min = vaddr_to_offset(vma, start); > > > max = min + (end - start) - 1; > > > > > > - spin_lock(&uprobes_treelock); > > > + read_lock(&uprobes_treelock); > > > n = find_node_in_range(inode, min, max); > > > if (n) { > > > for (t = n; t; t = rb_prev(t)) { > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > get_uprobe(u); > > > } > > > } > > > - spin_unlock(&uprobes_treelock); > > > + read_unlock(&uprobes_treelock); > > > } > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > min = vaddr_to_offset(vma, start); > > > max = min + (end - start) - 1; > > > > > > - spin_lock(&uprobes_treelock); > > > + read_lock(&uprobes_treelock); > > > n = find_node_in_range(inode, min, max); > > > - spin_unlock(&uprobes_treelock); > > > + read_unlock(&uprobes_treelock); > > > > > > return !!n; > > > } > > > -- > > > 2.43.0 > > > > > > > > > -- > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > Masami, > > > > Given the discussion around per-cpu rw semaphore and need for > > (internal) batched attachment API for uprobes, do you think you can > > apply this patch as is for now? We can then gain initial improvements > > in scalability that are also easy to backport, and Jonathan will work > > on a more complete solution based on per-cpu RW semaphore, as > > suggested by Ingo. > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > I would like to wait for the next version. My initial tests show a nice improvement on the over RW spinlocks but significant regression in acquiring a write lock. I've got a few days vacation over Easter but I'll aim to get some more formalised results out to the thread toward the end of next week. Jon. > > Thank you, > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > will depend on how much lock contention happens. > > > > > > Thank you, > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > --- > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > --- a/kernel/events/uprobes.c > > > > +++ b/kernel/events/uprobes.c > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > */ > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > /* serialize uprobe->pending_list */ > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > { > > > > struct uprobe *uprobe; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + read_lock(&uprobes_treelock); > > > > uprobe = __find_uprobe(inode, offset); > > > > - spin_unlock(&uprobes_treelock); > > > > + read_unlock(&uprobes_treelock); > > > > > > > > return uprobe; > > > > } > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > { > > > > struct uprobe *u; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + write_lock(&uprobes_treelock); > > > > u = __insert_uprobe(uprobe); > > > > - spin_unlock(&uprobes_treelock); > > > > + write_unlock(&uprobes_treelock); > > > > > > > > return u; > > > > } > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > return; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + write_lock(&uprobes_treelock); > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > - spin_unlock(&uprobes_treelock); > > > > + write_unlock(&uprobes_treelock); > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > put_uprobe(uprobe); > > > > } > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > min = vaddr_to_offset(vma, start); > > > > max = min + (end - start) - 1; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + read_lock(&uprobes_treelock); > > > > n = find_node_in_range(inode, min, max); > > > > if (n) { > > > > for (t = n; t; t = rb_prev(t)) { > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > get_uprobe(u); > > > > } > > > > } > > > > - spin_unlock(&uprobes_treelock); > > > > + read_unlock(&uprobes_treelock); > > > > } > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > min = vaddr_to_offset(vma, start); > > > > max = min + (end - start) - 1; > > > > > > > > - spin_lock(&uprobes_treelock); > > > > + read_lock(&uprobes_treelock); > > > > n = find_node_in_range(inode, min, max); > > > > - spin_unlock(&uprobes_treelock); > > > > + read_unlock(&uprobes_treelock); > > > > > > > > return !!n; > > > > } > > > > -- > > > > 2.43.0 > > > > > > > > > > > > > -- > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Wed, 27 Mar 2024 17:06:01 +0000 Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > > > Masami, > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > (internal) batched attachment API for uprobes, do you think you can > > > apply this patch as is for now? We can then gain initial improvements > > > in scalability that are also easy to backport, and Jonathan will work > > > on a more complete solution based on per-cpu RW semaphore, as > > > suggested by Ingo. > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > I would like to wait for the next version. > > My initial tests show a nice improvement on the over RW spinlocks but > significant regression in acquiring a write lock. I've got a few days > vacation over Easter but I'll aim to get some more formalised results out > to the thread toward the end of next week. As far as the write lock is only on the cold path, I think you can choose per-cpu RW semaphore. Since it does not do busy wait, the total system performance impact will be small. I look forward to your formalized results :) Thank you, > > Jon. > > > > > Thank you, > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > will depend on how much lock contention happens. > > > > > > > > Thank you, > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > --- > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > --- a/kernel/events/uprobes.c > > > > > +++ b/kernel/events/uprobes.c > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > */ > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > /* serialize uprobe->pending_list */ > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > { > > > > > struct uprobe *uprobe; > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > + read_lock(&uprobes_treelock); > > > > > uprobe = __find_uprobe(inode, offset); > > > > > - spin_unlock(&uprobes_treelock); > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > return uprobe; > > > > > } > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > { > > > > > struct uprobe *u; > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > + write_lock(&uprobes_treelock); > > > > > u = __insert_uprobe(uprobe); > > > > > - spin_unlock(&uprobes_treelock); > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > return u; > > > > > } > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > return; > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > + write_lock(&uprobes_treelock); > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > - spin_unlock(&uprobes_treelock); > > > > > + write_unlock(&uprobes_treelock); > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > put_uprobe(uprobe); > > > > > } > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > min = vaddr_to_offset(vma, start); > > > > > max = min + (end - start) - 1; > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > + read_lock(&uprobes_treelock); > > > > > n = find_node_in_range(inode, min, max); > > > > > if (n) { > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > get_uprobe(u); > > > > > } > > > > > } > > > > > - spin_unlock(&uprobes_treelock); > > > > > + read_unlock(&uprobes_treelock); > > > > > } > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > min = vaddr_to_offset(vma, start); > > > > > max = min + (end - start) - 1; > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > + read_lock(&uprobes_treelock); > > > > > n = find_node_in_range(inode, min, max); > > > > > - spin_unlock(&uprobes_treelock); > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > return !!n; > > > > > } > > > > > -- > > > > > 2.43.0 > > > > > > > > > > > > > > > > > -- > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > -- > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Wed, 27 Mar 2024 17:06:01 +0000 > Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > > > > > Masami, > > > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > (internal) batched attachment API for uprobes, do you think you can > > > > apply this patch as is for now? We can then gain initial improvements > > > > in scalability that are also easy to backport, and Jonathan will work > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > suggested by Ingo. > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > I would like to wait for the next version. > > > > My initial tests show a nice improvement on the over RW spinlocks but > > significant regression in acquiring a write lock. I've got a few days > > vacation over Easter but I'll aim to get some more formalised results out > > to the thread toward the end of next week. > > As far as the write lock is only on the cold path, I think you can choose > per-cpu RW semaphore. Since it does not do busy wait, the total system > performance impact will be small. No, Masami, unfortunately it's not as simple. In BPF we have BPF multi-uprobe, which can be used to attach to thousands of user functions. It currently creates one uprobe at a time, as we don't really have a batched API. If each such uprobe registration will now take a (relatively) long time, when multiplied by number of attach-to user functions, it will be a horrible regression in terms of attachment/detachment performance. So when we switch to per-CPU rw semaphore, we'll need to provide an internal batch uprobe attach/detach API to make sure that attaching to multiple uprobes is still fast. Which is why I was asking to land this patch as is, as it relieves the scalability pains in production and is easy to backport to old kernels. And then we can work on batched APIs and switch to per-CPU rw semaphore. So I hope you can reconsider and accept improvements in this patch, while Jonathan will keep working on even better final solution. Thanks! > I look forward to your formalized results :) > > Thank you, > > > > > Jon. > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > will depend on how much lock contention happens. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > --- > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > --- a/kernel/events/uprobes.c > > > > > > +++ b/kernel/events/uprobes.c > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > */ > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > /* serialize uprobe->pending_list */ > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > { > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + read_lock(&uprobes_treelock); > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > return uprobe; > > > > > > } > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > { > > > > > > struct uprobe *u; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + write_lock(&uprobes_treelock); > > > > > > u = __insert_uprobe(uprobe); > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > return u; > > > > > > } > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > return; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + write_lock(&uprobes_treelock); > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + write_unlock(&uprobes_treelock); > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > put_uprobe(uprobe); > > > > > > } > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > min = vaddr_to_offset(vma, start); > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + read_lock(&uprobes_treelock); > > > > > > n = find_node_in_range(inode, min, max); > > > > > > if (n) { > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > get_uprobe(u); > > > > > > } > > > > > > } > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + read_unlock(&uprobes_treelock); > > > > > > } > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > min = vaddr_to_offset(vma, start); > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + read_lock(&uprobes_treelock); > > > > > > n = find_node_in_range(inode, min, max); > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > return !!n; > > > > > > } > > > > > > -- > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > -- > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > -- > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > On Wed, 27 Mar 2024 17:06:01 +0000 > > Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > > > > > > > Masami, > > > > > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > (internal) batched attachment API for uprobes, do you think you can > > > > > apply this patch as is for now? We can then gain initial improvements > > > > > in scalability that are also easy to backport, and Jonathan will work > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > suggested by Ingo. > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > I would like to wait for the next version. > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > significant regression in acquiring a write lock. I've got a few days > > > vacation over Easter but I'll aim to get some more formalised results out > > > to the thread toward the end of next week. > > > > As far as the write lock is only on the cold path, I think you can choose > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > performance impact will be small. > > No, Masami, unfortunately it's not as simple. In BPF we have BPF > multi-uprobe, which can be used to attach to thousands of user > functions. It currently creates one uprobe at a time, as we don't > really have a batched API. If each such uprobe registration will now > take a (relatively) long time, when multiplied by number of attach-to > user functions, it will be a horrible regression in terms of > attachment/detachment performance. > > So when we switch to per-CPU rw semaphore, we'll need to provide an > internal batch uprobe attach/detach API to make sure that attaching to > multiple uprobes is still fast. > > Which is why I was asking to land this patch as is, as it relieves the > scalability pains in production and is easy to backport to old > kernels. And then we can work on batched APIs and switch to per-CPU rw > semaphore. > > So I hope you can reconsider and accept improvements in this patch, > while Jonathan will keep working on even better final solution. > Thanks! > > > I look forward to your formalized results :) > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes and USDTs, reporting attach/detach timings: $ sudo ./test_progs -v -t uprobe_multi_test/bench bpf_testmod.ko is already unloaded. Loading bpf_testmod.ko... Successfully loaded bpf_testmod.ko. test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec test_bench_attach_uprobe:PASS:uprobes_count 0 nsec test_bench_attach_uprobe: attached in 0.120s test_bench_attach_uprobe: detached in 0.092s #400/5 uprobe_multi_test/bench_uprobe:OK test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec test_bench_attach_usdt:PASS:usdt_count 0 nsec test_bench_attach_usdt: attached in 0.124s test_bench_attach_usdt: detached in 0.064s #400/6 uprobe_multi_test/bench_usdt:OK #400 uprobe_multi_test:OK Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED Successfully unloaded bpf_testmod.ko. So it should be easy for Jonathan to validate his changes with this. > > Thank you, > > > > > > > > Jon. > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > > --- > > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > > */ > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > { > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > return uprobe; > > > > > > > } > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > > { > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > return u; > > > > > > > } > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > return; > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > > put_uprobe(uprobe); > > > > > > > } > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > if (n) { > > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > get_uprobe(u); > > > > > > > } > > > > > > > } > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > } > > > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > return !!n; > > > > > > > } > > > > > > > -- > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > -- > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > -- > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Fri, 29 Mar 2024 10:33:57 -0700 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > > On Wed, 27 Mar 2024 17:06:01 +0000 > > > Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > > > > > > > > > Masami, > > > > > > > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > > (internal) batched attachment API for uprobes, do you think you can > > > > > > apply this patch as is for now? We can then gain initial improvements > > > > > > in scalability that are also easy to backport, and Jonathan will work > > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > > suggested by Ingo. > > > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > > I would like to wait for the next version. > > > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > > significant regression in acquiring a write lock. I've got a few days > > > > vacation over Easter but I'll aim to get some more formalised results out > > > > to the thread toward the end of next week. > > > > > > As far as the write lock is only on the cold path, I think you can choose > > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > > performance impact will be small. > > > > No, Masami, unfortunately it's not as simple. In BPF we have BPF > > multi-uprobe, which can be used to attach to thousands of user > > functions. It currently creates one uprobe at a time, as we don't > > really have a batched API. If each such uprobe registration will now > > take a (relatively) long time, when multiplied by number of attach-to > > user functions, it will be a horrible regression in terms of > > attachment/detachment performance. Ah, got it. So attachment/detachment performance should be counted. > > > > So when we switch to per-CPU rw semaphore, we'll need to provide an > > internal batch uprobe attach/detach API to make sure that attaching to > > multiple uprobes is still fast. Yeah, we need such interface like register_uprobes(...). > > > > Which is why I was asking to land this patch as is, as it relieves the > > scalability pains in production and is easy to backport to old > > kernels. And then we can work on batched APIs and switch to per-CPU rw > > semaphore. OK, then I'll push this to for-next at this moment. Please share if you have a good idea for the batch interface which can be backported. I guess it should involve updating userspace changes too. Thank you! > > > > So I hope you can reconsider and accept improvements in this patch, > > while Jonathan will keep working on even better final solution. > > Thanks! > > > > > I look forward to your formalized results :) > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > and USDTs, reporting attach/detach timings: > $ sudo ./test_progs -v -t uprobe_multi_test/bench > bpf_testmod.ko is already unloaded. > Loading bpf_testmod.ko... > Successfully loaded bpf_testmod.ko. > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > test_bench_attach_uprobe: attached in 0.120s > test_bench_attach_uprobe: detached in 0.092s > #400/5 uprobe_multi_test/bench_uprobe:OK > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > test_bench_attach_usdt:PASS:usdt_count 0 nsec > test_bench_attach_usdt: attached in 0.124s > test_bench_attach_usdt: detached in 0.064s > #400/6 uprobe_multi_test/bench_usdt:OK > #400 uprobe_multi_test:OK > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > Successfully unloaded bpf_testmod.ko. > > So it should be easy for Jonathan to validate his changes with this. > > > > Thank you, > > > > > > > > > > > Jon. > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > > > --- > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > > > */ > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > { > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > } > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > > > { > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > return u; > > > > > > > > } > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > > return; > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > > > put_uprobe(uprobe); > > > > > > > > } > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > if (n) { > > > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > get_uprobe(u); > > > > > > > > } > > > > > > > > } > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > } > > > > > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > return !!n; > > > > > > > > } > > > > > > > > -- > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > -- > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > -- > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Fri, Mar 29, 2024 at 5:36 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Fri, 29 Mar 2024 10:33:57 -0700 > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > > > > > > > On Wed, 27 Mar 2024 17:06:01 +0000 > > > > Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > > > > > > > > > > > Masami, > > > > > > > > > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > > > (internal) batched attachment API for uprobes, do you think you can > > > > > > > apply this patch as is for now? We can then gain initial improvements > > > > > > > in scalability that are also easy to backport, and Jonathan will work > > > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > > > suggested by Ingo. > > > > > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > > > I would like to wait for the next version. > > > > > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > > > significant regression in acquiring a write lock. I've got a few days > > > > > vacation over Easter but I'll aim to get some more formalised results out > > > > > to the thread toward the end of next week. > > > > > > > > As far as the write lock is only on the cold path, I think you can choose > > > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > > > performance impact will be small. > > > > > > No, Masami, unfortunately it's not as simple. In BPF we have BPF > > > multi-uprobe, which can be used to attach to thousands of user > > > functions. It currently creates one uprobe at a time, as we don't > > > really have a batched API. If each such uprobe registration will now > > > take a (relatively) long time, when multiplied by number of attach-to > > > user functions, it will be a horrible regression in terms of > > > attachment/detachment performance. > > Ah, got it. So attachment/detachment performance should be counted. > > > > > > > So when we switch to per-CPU rw semaphore, we'll need to provide an > > > internal batch uprobe attach/detach API to make sure that attaching to > > > multiple uprobes is still fast. > > Yeah, we need such interface like register_uprobes(...). > > > > > > > Which is why I was asking to land this patch as is, as it relieves the > > > scalability pains in production and is easy to backport to old > > > kernels. And then we can work on batched APIs and switch to per-CPU rw > > > semaphore. > > OK, then I'll push this to for-next at this moment. Great, thanks a lot! > Please share if you have a good idea for the batch interface which can be > backported. I guess it should involve updating userspace changes too. > Yep, we'll investigate a best way to provide batch interface for uprobes and will send patches. > Thank you! > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > while Jonathan will keep working on even better final solution. > > > Thanks! > > > > > > > I look forward to your formalized results :) > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > and USDTs, reporting attach/detach timings: > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > bpf_testmod.ko is already unloaded. > > Loading bpf_testmod.ko... > > Successfully loaded bpf_testmod.ko. > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > test_bench_attach_uprobe: attached in 0.120s > > test_bench_attach_uprobe: detached in 0.092s > > #400/5 uprobe_multi_test/bench_uprobe:OK > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > test_bench_attach_usdt: attached in 0.124s > > test_bench_attach_usdt: detached in 0.064s > > #400/6 uprobe_multi_test/bench_usdt:OK > > #400 uprobe_multi_test:OK > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > Successfully unloaded bpf_testmod.ko. > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > Thank you, > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > > > > --- > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > > > > */ > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > > { > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > } > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > { > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > return u; > > > > > > > > > } > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > > > return; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > > > > put_uprobe(uprobe); > > > > > > > > > } > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > if (n) { > > > > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > get_uprobe(u); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > } > > > > > > > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > return !!n; > > > > > > > > > } > > > > > > > > > -- > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > -- > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > -- > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > > Given the discussion around per-cpu rw semaphore and need for > > > > (internal) batched attachment API for uprobes, do you think you can > > > > apply this patch as is for now? We can then gain initial improvements > > > > in scalability that are also easy to backport, and Jonathan will work > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > suggested by Ingo. > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > I would like to wait for the next version. > > > > My initial tests show a nice improvement on the over RW spinlocks but > > significant regression in acquiring a write lock. I've got a few days > > vacation over Easter but I'll aim to get some more formalised results out > > to the thread toward the end of next week. > > As far as the write lock is only on the cold path, I think you can choose > per-cpu RW semaphore. Since it does not do busy wait, the total system > performance impact will be small. > I look forward to your formalized results :) Sorry for the delay in getting back to you on this Masami. I have used one of the bpf selftest benchmarks to provide some form of comparison of the 3 different approaches (spinlock, RW spinlock and per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop' benchmark which just executes a single uprobe with a minimal bpf program attached. The tests were done on a 32 core qemu/kvm instance. Things to note about the results: - The results are slightly variable so don't get too caught up on individual thread count - it's the trend that is important. - In terms of throughput with this specific benchmark a *very* macro view is that the RW spinlock provides 40-60% more throughput than the spinlock. The per-CPU RW semaphore provides in the order of 50-100% more throughput then the spinlock. - This doesn't fully reflect the large reduction in latency that we have seen in application based measurements. However, it does demonstrate that even the trivial change of going to a RW spinlock provides significant benefits. I haven't included the measurements on per-CPU RW semaphore write performance as they are completely in line with those that Paul McKenney posted on his journal [0]. On a 32 core system I see semaphore writes to take in the order of 25-28 millisecs - the cost of the synchronize_rcu(). Each block of results below show 1 line per execution of the benchmark (the "Summary" line) and each line is a run with one more thread added - a thread is a "producer". The lines are edited to remove extraneous output that adds no value here. The tests were executed with this driver script: for num_threads in {1..20} do sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary done spinlock Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod) Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod) Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod) Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod) Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod) Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod) Summary: hits 2.046 ± 0.004M/s ( 0.292M/prod) Summary: hits 1.818 ± 0.002M/s ( 0.227M/prod) Summary: hits 1.867 ± 0.024M/s ( 0.207M/prod) Summary: hits 1.692 ± 0.003M/s ( 0.169M/prod) Summary: hits 1.778 ± 0.004M/s ( 0.162M/prod) Summary: hits 1.710 ± 0.011M/s ( 0.142M/prod) Summary: hits 1.687 ± 0.022M/s ( 0.130M/prod) Summary: hits 1.697 ± 0.004M/s ( 0.121M/prod) Summary: hits 1.645 ± 0.011M/s ( 0.110M/prod) Summary: hits 1.671 ± 0.002M/s ( 0.104M/prod) Summary: hits 1.692 ± 0.014M/s ( 0.100M/prod) Summary: hits 1.700 ± 0.015M/s ( 0.094M/prod) Summary: hits 1.668 ± 0.005M/s ( 0.088M/prod) Summary: hits 1.644 ± 0.004M/s ( 0.082M/prod) RW spinlock Summary: hits 1.465 ± 0.008M/s ( 1.465M/prod) Summary: hits 1.750 ± 0.035M/s ( 0.875M/prod) Summary: hits 2.164 ± 0.008M/s ( 0.721M/prod) Summary: hits 2.235 ± 0.014M/s ( 0.559M/prod) Summary: hits 2.202 ± 0.005M/s ( 0.440M/prod) Summary: hits 2.816 ± 0.003M/s ( 0.469M/prod) Summary: hits 2.364 ± 0.002M/s ( 0.338M/prod) Summary: hits 2.327 ± 0.008M/s ( 0.291M/prod) Summary: hits 2.147 ± 0.005M/s ( 0.239M/prod) Summary: hits 2.266 ± 0.011M/s ( 0.227M/prod) Summary: hits 2.483 ± 0.003M/s ( 0.226M/prod) Summary: hits 2.573 ± 0.008M/s ( 0.214M/prod) Summary: hits 2.569 ± 0.004M/s ( 0.198M/prod) Summary: hits 2.507 ± 0.013M/s ( 0.179M/prod) Summary: hits 2.165 ± 0.008M/s ( 0.144M/prod) Summary: hits 2.524 ± 0.004M/s ( 0.158M/prod) Summary: hits 2.059 ± 0.020M/s ( 0.121M/prod) Summary: hits 2.332 ± 0.007M/s ( 0.130M/prod) Summary: hits 2.404 ± 0.017M/s ( 0.127M/prod) Summary: hits 2.187 ± 0.002M/s ( 0.109M/prod) per-CPU RW semaphore Summary: hits 1.341 ± 0.017M/s ( 1.341M/prod) Summary: hits 2.397 ± 0.011M/s ( 1.198M/prod) Summary: hits 3.547 ± 0.041M/s ( 1.182M/prod) Summary: hits 4.108 ± 0.016M/s ( 1.027M/prod) Summary: hits 3.138 ± 0.055M/s ( 0.628M/prod) Summary: hits 3.247 ± 0.017M/s ( 0.541M/prod) Summary: hits 2.877 ± 0.005M/s ( 0.411M/prod) Summary: hits 2.880 ± 0.002M/s ( 0.360M/prod) Summary: hits 2.579 ± 0.001M/s ( 0.287M/prod) Summary: hits 2.982 ± 0.011M/s ( 0.298M/prod) Summary: hits 2.603 ± 0.002M/s ( 0.237M/prod) Summary: hits 3.013 ± 0.004M/s ( 0.251M/prod) Summary: hits 3.059 ± 0.001M/s ( 0.235M/prod) Summary: hits 2.721 ± 0.014M/s ( 0.194M/prod) Summary: hits 2.943 ± 0.005M/s ( 0.196M/prod) Summary: hits 3.366 ± 0.011M/s ( 0.210M/prod) Summary: hits 2.459 ± 0.001M/s ( 0.145M/prod) Summary: hits 3.023 ± 0.024M/s ( 0.168M/prod) Summary: hits 2.919 ± 0.002M/s ( 0.154M/prod) Summary: hits 2.569 ± 0.002M/s ( 0.128M/prod) [0] https://paulmck.livejournal.com/67547.html Thanks. Jon. > > Thank you, > > > > > Jon. > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > will depend on how much lock contention happens. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > --- > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > --- a/kernel/events/uprobes.c > > > > > > +++ b/kernel/events/uprobes.c > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > */ > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > /* serialize uprobe->pending_list */ > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > { > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + read_lock(&uprobes_treelock); > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > return uprobe; > > > > > > } > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > { > > > > > > struct uprobe *u; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + write_lock(&uprobes_treelock); > > > > > > u = __insert_uprobe(uprobe); > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > return u; > > > > > > } > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > return; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + write_lock(&uprobes_treelock); > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + write_unlock(&uprobes_treelock); > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > put_uprobe(uprobe); > > > > > > } > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > min = vaddr_to_offset(vma, start); > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + read_lock(&uprobes_treelock); > > > > > > n = find_node_in_range(inode, min, max); > > > > > > if (n) { > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > get_uprobe(u); > > > > > > } > > > > > > } > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + read_unlock(&uprobes_treelock); > > > > > > } > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > min = vaddr_to_offset(vma, start); > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > + read_lock(&uprobes_treelock); > > > > > > n = find_node_in_range(inode, min, max); > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > return !!n; > > > > > > } > > > > > > -- > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > -- > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > -- > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Wed, Apr 3, 2024 at 4:05 AM Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > (internal) batched attachment API for uprobes, do you think you can > > > > > apply this patch as is for now? We can then gain initial improvements > > > > > in scalability that are also easy to backport, and Jonathan will work > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > suggested by Ingo. > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > I would like to wait for the next version. > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > significant regression in acquiring a write lock. I've got a few days > > > vacation over Easter but I'll aim to get some more formalised results out > > > to the thread toward the end of next week. > > > > As far as the write lock is only on the cold path, I think you can choose > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > performance impact will be small. > > I look forward to your formalized results :) > > Sorry for the delay in getting back to you on this Masami. > > I have used one of the bpf selftest benchmarks to provide some form of > comparison of the 3 different approaches (spinlock, RW spinlock and > per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop' > benchmark which just executes a single uprobe with a minimal bpf program > attached. The tests were done on a 32 core qemu/kvm instance. > Thanks a lot for running benchmarks and providing results! > Things to note about the results: > > - The results are slightly variable so don't get too caught up on > individual thread count - it's the trend that is important. > - In terms of throughput with this specific benchmark a *very* macro view > is that the RW spinlock provides 40-60% more throughput than the > spinlock. The per-CPU RW semaphore provides in the order of 50-100% > more throughput then the spinlock. > - This doesn't fully reflect the large reduction in latency that we have > seen in application based measurements. However, it does demonstrate > that even the trivial change of going to a RW spinlock provides > significant benefits. This is probably because trig-uprobe-nop creates a single uprobe that is triggered on many CPUs. While in production we have also *many* uprobes running on many CPUs. In this benchmark, besides contention on uprobes_treelock, we are also hammering on other per-uprobe locks (register_rwsem, also if you don't have [0] patch locally, there will be another filter lock taken each time, filter->rwlock). There is also atomic refcounting going on, which when you have the same uprobe across all CPUs at the same time will cause a bunch of cache line bouncing. So yes, it's understandable that in practice in production you see an even larger effect of optimizing uprobe_treelock than in this micro-benchmark. [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf > > I haven't included the measurements on per-CPU RW semaphore write > performance as they are completely in line with those that Paul McKenney > posted on his journal [0]. On a 32 core system I see semaphore writes to > take in the order of 25-28 millisecs - the cost of the synchronize_rcu(). > > Each block of results below show 1 line per execution of the benchmark (the > "Summary" line) and each line is a run with one more thread added - a > thread is a "producer". The lines are edited to remove extraneous output > that adds no value here. > > The tests were executed with this driver script: > > for num_threads in {1..20} > do > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary just want to mention -a (affinity) option that you can pass a bench tool, it will pin each thread on its own CPU. It generally makes tests more uniform, eliminating CPU migrations variability. > done > > > spinlock > > Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod) > Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod) > Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod) I also wanted to point out that the first measurement (1.453M/s in this row) is total throughput across all threads, while value in parenthesis (0.900M/prod) is averaged throughput per each thread. So this M/prod value is the most interesting in this benchmark where we assess the effect of reducing contention. > Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod) > Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod) > Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod) [...]
> > Things to note about the results: > > > > - The results are slightly variable so don't get too caught up on > > individual thread count - it's the trend that is important. > > - In terms of throughput with this specific benchmark a *very* macro view > > is that the RW spinlock provides 40-60% more throughput than the > > spinlock. The per-CPU RW semaphore provides in the order of 50-100% > > more throughput then the spinlock. > > - This doesn't fully reflect the large reduction in latency that we have > > seen in application based measurements. However, it does demonstrate > > that even the trivial change of going to a RW spinlock provides > > significant benefits. > > This is probably because trig-uprobe-nop creates a single uprobe that > is triggered on many CPUs. While in production we have also *many* > uprobes running on many CPUs. In this benchmark, besides contention on > uprobes_treelock, we are also hammering on other per-uprobe locks > (register_rwsem, also if you don't have [0] patch locally, there will > be another filter lock taken each time, filter->rwlock). There is also > atomic refcounting going on, which when you have the same uprobe > across all CPUs at the same time will cause a bunch of cache line > bouncing. > > So yes, it's understandable that in practice in production you see an > even larger effect of optimizing uprobe_treelock than in this > micro-benchmark. > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf Thanks for the reply and the thoughts on this Andrii. Yes, I do have the filter->rwlock fix applied but, as you say, there are no doubt other effects at play here as to be expected in such a synthetic workload. I'm pleased with the outcomes though as they show a good result even if they are at the lower end of what I expect. The results also show that pursuing an RCU solution is definitely worth it but that write penalty is brutal in the case of a full synchronize_rcu()! Should be fun. > > for num_threads in {1..20} > > do > > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary > > just want to mention -a (affinity) option that you can pass a bench > tool, it will pin each thread on its own CPU. It generally makes tests > more uniform, eliminating CPU migrations variability. Thanks for pointing that flag out! Jon. > > > done > > > > > > spinlock > > > > Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod) > > Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod) > > Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod) > > I also wanted to point out that the first measurement (1.453M/s in > this row) is total throughput across all threads, while value in > parenthesis (0.900M/prod) is averaged throughput per each thread. So > this M/prod value is the most interesting in this benchmark where we > assess the effect of reducing contention. > > > Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod) > > Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod) > > Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod) > > [...]
Hi Masami, > > > Which is why I was asking to land this patch as is, as it relieves the > > > scalability pains in production and is easy to backport to old > > > kernels. And then we can work on batched APIs and switch to per-CPU rw > > > semaphore. > > OK, then I'll push this to for-next at this moment. > Please share if you have a good idea for the batch interface which can be > backported. I guess it should involve updating userspace changes too. Did you (or anyone else) need anything more from me on this one so that it can be pushed? I provided some benchmark numbers but happy to provide anything else that may be required. Thanks! Jon. > > Thank you! > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > while Jonathan will keep working on even better final solution. > > > Thanks! > > > > > > > I look forward to your formalized results :) > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > and USDTs, reporting attach/detach timings: > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > bpf_testmod.ko is already unloaded. > > Loading bpf_testmod.ko... > > Successfully loaded bpf_testmod.ko. > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > test_bench_attach_uprobe: attached in 0.120s > > test_bench_attach_uprobe: detached in 0.092s > > #400/5 uprobe_multi_test/bench_uprobe:OK > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > test_bench_attach_usdt: attached in 0.124s > > test_bench_attach_usdt: detached in 0.064s > > #400/6 uprobe_multi_test/bench_usdt:OK > > #400 uprobe_multi_test:OK > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > Successfully unloaded bpf_testmod.ko. > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > Thank you, > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > > > > --- > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > > > > */ > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > > { > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > } > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > { > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > return u; > > > > > > > > > } > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > > > return; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > > > > put_uprobe(uprobe); > > > > > > > > > } > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > if (n) { > > > > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > get_uprobe(u); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > } > > > > > > > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > return !!n; > > > > > > > > > } > > > > > > > > > -- > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > -- > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > -- > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Wed, 10 Apr 2024 11:38:11 +0100 Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > Hi Masami, > > > > > Which is why I was asking to land this patch as is, as it relieves the > > > > scalability pains in production and is easy to backport to old > > > > kernels. And then we can work on batched APIs and switch to per-CPU rw > > > > semaphore. > > > > OK, then I'll push this to for-next at this moment. > > Please share if you have a good idea for the batch interface which can be > > backported. I guess it should involve updating userspace changes too. > > Did you (or anyone else) need anything more from me on this one so that it > can be pushed? I provided some benchmark numbers but happy to provide > anything else that may be required. Yeah, if you can update with the result, it looks better to me. Or, can I update the description? Thank you, > > Thanks! > > Jon. > > > > > Thank you! > > > > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > > while Jonathan will keep working on even better final solution. > > > > Thanks! > > > > > > > > > I look forward to your formalized results :) > > > > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > > and USDTs, reporting attach/detach timings: > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > > bpf_testmod.ko is already unloaded. > > > Loading bpf_testmod.ko... > > > Successfully loaded bpf_testmod.ko. > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > > test_bench_attach_uprobe: attached in 0.120s > > > test_bench_attach_uprobe: detached in 0.092s > > > #400/5 uprobe_multi_test/bench_uprobe:OK > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > > test_bench_attach_usdt: attached in 0.124s > > > test_bench_attach_usdt: detached in 0.064s > > > #400/6 uprobe_multi_test/bench_usdt:OK > > > #400 uprobe_multi_test:OK > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > > Successfully unloaded bpf_testmod.ko. > > > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > > > > > --- > > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > > > > > */ > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > > > { > > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > > } > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > > { > > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > return u; > > > > > > > > > > } > > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > > > > > put_uprobe(uprobe); > > > > > > > > > > } > > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > > if (n) { > > > > > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > > get_uprobe(u); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > return !!n; > > > > > > > > > > } > > > > > > > > > > -- > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > -- > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > -- > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > OK, then I'll push this to for-next at this moment. > > > Please share if you have a good idea for the batch interface which can be > > > backported. I guess it should involve updating userspace changes too. > > > > Did you (or anyone else) need anything more from me on this one so that it > > can be pushed? I provided some benchmark numbers but happy to provide > > anything else that may be required. > > Yeah, if you can update with the result, it looks better to me. > Or, can I update the description? Sure, please feel free to update the description yourself. Jon. > > Thank you, > > > > > Thanks! > > > > Jon. > > > > > > > > Thank you! > > > > > > > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > > > while Jonathan will keep working on even better final solution. > > > > > Thanks! > > > > > > > > > > > I look forward to your formalized results :) > > > > > > > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > > > and USDTs, reporting attach/detach timings: > > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > > > bpf_testmod.ko is already unloaded. > > > > Loading bpf_testmod.ko... > > > > Successfully loaded bpf_testmod.ko. > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > > > test_bench_attach_uprobe: attached in 0.120s > > > > test_bench_attach_uprobe: detached in 0.092s > > > > #400/5 uprobe_multi_test/bench_uprobe:OK > > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > > > test_bench_attach_usdt: attached in 0.124s > > > > test_bench_attach_usdt: detached in 0.064s > > > > #400/6 uprobe_multi_test/bench_usdt:OK > > > > #400 uprobe_multi_test:OK > > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > > > Successfully unloaded bpf_testmod.ko. > > > > > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > > > > > > --- > > > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > > > > > > */ > > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > > > > { > > > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > > > } > > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > > > { > > > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > return u; > > > > > > > > > > > } > > > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > > > > > > put_uprobe(uprobe); > > > > > > > > > > > } > > > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > > > if (n) { > > > > > > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > > > get_uprobe(u); > > > > > > > > > > > } > > > > > > > > > > > } > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > return !!n; > > > > > > > > > > > } > > > > > > > > > > > -- > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > -- > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > -- > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
Hi Masami, > > > OK, then I'll push this to for-next at this moment. > > > Please share if you have a good idea for the batch interface which can be > > > backported. I guess it should involve updating userspace changes too. > > > > Did you (or anyone else) need anything more from me on this one so that it > > can be pushed? I provided some benchmark numbers but happy to provide > > anything else that may be required. > > Yeah, if you can update with the result, it looks better to me. > Or, can I update the description? Just checking if you need me to do anything on this so that it can be pushed to for-next? Would be really great if we can get this in. Thanks for all your help. Jon. > > Thank you, > > > > > Thanks! > > > > Jon. > > > > > > > > Thank you! > > > > > > > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > > > while Jonathan will keep working on even better final solution. > > > > > Thanks! > > > > > > > > > > > I look forward to your formalized results :) > > > > > > > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > > > and USDTs, reporting attach/detach timings: > > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > > > bpf_testmod.ko is already unloaded. > > > > Loading bpf_testmod.ko... > > > > Successfully loaded bpf_testmod.ko. > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > > > test_bench_attach_uprobe: attached in 0.120s > > > > test_bench_attach_uprobe: detached in 0.092s > > > > #400/5 uprobe_multi_test/bench_uprobe:OK > > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > > > test_bench_attach_usdt: attached in 0.124s > > > > test_bench_attach_usdt: detached in 0.064s > > > > #400/6 uprobe_multi_test/bench_usdt:OK > > > > #400 uprobe_multi_test:OK > > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > > > Successfully unloaded bpf_testmod.ko. > > > > > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > > > > > > --- > > > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > > > > > > */ > > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > > > > { > > > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > > > } > > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > > > { > > > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > return u; > > > > > > > > > > > } > > > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > > > > > > put_uprobe(uprobe); > > > > > > > > > > > } > > > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > > > if (n) { > > > > > > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > > > get_uprobe(u); > > > > > > > > > > > } > > > > > > > > > > > } > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > return !!n; > > > > > > > > > > > } > > > > > > > > > > > -- > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > -- > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > -- > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
On Thu, 18 Apr 2024 12:10:59 +0100 Jonthan Haslam <jonathan.haslam@gmail.com> wrote: > Hi Masami, > > > > > OK, then I'll push this to for-next at this moment. > > > > Please share if you have a good idea for the batch interface which can be > > > > backported. I guess it should involve updating userspace changes too. > > > > > > Did you (or anyone else) need anything more from me on this one so that it > > > can be pushed? I provided some benchmark numbers but happy to provide > > > anything else that may be required. > > > > Yeah, if you can update with the result, it looks better to me. > > Or, can I update the description? > > Just checking if you need me to do anything on this so that it can be > pushed to for-next? Would be really great if we can get this in. Thanks > for all your help. Yes, other uprobe performance improvements[1][2] have the benchmark results, but this patch doesn't. If you update this with the benchmark results, it is helpful to me. [1] https://lore.kernel.org/all/20240318181728.2795838-3-andrii@kernel.org/ [2] https://lore.kernel.org/all/20240318181728.2795838-4-andrii@kernel.org/ Thank you, > > Jon. > > > > > Thank you, > > > > > > > > Thanks! > > > > > > Jon. > > > > > > > > > > > Thank you! > > > > > > > > > > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > > > > while Jonathan will keep working on even better final solution. > > > > > > Thanks! > > > > > > > > > > > > > I look forward to your formalized results :) > > > > > > > > > > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > > > > and USDTs, reporting attach/detach timings: > > > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > > > > bpf_testmod.ko is already unloaded. > > > > > Loading bpf_testmod.ko... > > > > > Successfully loaded bpf_testmod.ko. > > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > > > > test_bench_attach_uprobe: attached in 0.120s > > > > > test_bench_attach_uprobe: detached in 0.092s > > > > > #400/5 uprobe_multi_test/bench_uprobe:OK > > > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > > > > test_bench_attach_usdt: attached in 0.124s > > > > > test_bench_attach_usdt: detached in 0.064s > > > > > #400/6 uprobe_multi_test/bench_usdt:OK > > > > > #400 uprobe_multi_test:OK > > > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED > > > > > Successfully unloaded bpf_testmod.ko. > > > > > > > > > > So it should be easy for Jonathan to validate his changes with this. > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > Jon. > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> > > > > > > > > > > > > --- > > > > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++----------- > > > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > > > > > > > */ > > > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) > > > > > > > > > > > > > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > > > > > > > > > > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > > > > > > > /* serialize uprobe->pending_list */ > > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) > > > > > > > > > > > > { > > > > > > > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > > > return uprobe; > > > > > > > > > > > > } > > > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > > > > > > > > > > > > { > > > > > > > > > > > > struct uprobe *u; > > > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > > > u = __insert_uprobe(uprobe); > > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > > > return u; > > > > > > > > > > > > } > > > > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > > > > > > > return; > > > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > > + write_lock(&uprobes_treelock); > > > > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree); > > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > > + write_unlock(&uprobes_treelock); > > > > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ > > > > > > > > > > > > put_uprobe(uprobe); > > > > > > > > > > > > } > > > > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > > > > if (n) { > > > > > > > > > > > > for (t = n; t; t = rb_prev(t)) { > > > > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > > > > > > > > > > > get_uprobe(u); > > > > > > > > > > > > } > > > > > > > > > > > > } > > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */ > > > > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e > > > > > > > > > > > > min = vaddr_to_offset(vma, start); > > > > > > > > > > > > max = min + (end - start) - 1; > > > > > > > > > > > > > > > > > > > > > > > > - spin_lock(&uprobes_treelock); > > > > > > > > > > > > + read_lock(&uprobes_treelock); > > > > > > > > > > > > n = find_node_in_range(inode, min, max); > > > > > > > > > > > > - spin_unlock(&uprobes_treelock); > > > > > > > > > > > > + read_unlock(&uprobes_treelock); > > > > > > > > > > > > > > > > > > > > > > > > return !!n; > > > > > > > > > > > > } > > > > > > > > > > > > -- > > > > > > > > > > > > 2.43.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > > > > > > > -- > > > > Masami Hiramatsu (Google) <mhiramat@kernel.org> > > > > > > -- > > Masami Hiramatsu (Google) <mhiramat@kernel.org> >
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 929e98c62965..42bf9b6e8bc0 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; */ #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree) -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ #define UPROBES_HASH_SZ 13 /* serialize uprobe->pending_list */ @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset) { struct uprobe *uprobe; - spin_lock(&uprobes_treelock); + read_lock(&uprobes_treelock); uprobe = __find_uprobe(inode, offset); - spin_unlock(&uprobes_treelock); + read_unlock(&uprobes_treelock); return uprobe; } @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) { struct uprobe *u; - spin_lock(&uprobes_treelock); + write_lock(&uprobes_treelock); u = __insert_uprobe(uprobe); - spin_unlock(&uprobes_treelock); + write_unlock(&uprobes_treelock); return u; } @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) if (WARN_ON(!uprobe_is_active(uprobe))) return; - spin_lock(&uprobes_treelock); + write_lock(&uprobes_treelock); rb_erase(&uprobe->rb_node, &uprobes_tree); - spin_unlock(&uprobes_treelock); + write_unlock(&uprobes_treelock); RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */ put_uprobe(uprobe); } @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, min = vaddr_to_offset(vma, start); max = min + (end - start) - 1; - spin_lock(&uprobes_treelock); + read_lock(&uprobes_treelock); n = find_node_in_range(inode, min, max); if (n) { for (t = n; t; t = rb_prev(t)) { @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, get_uprobe(u); } } - spin_unlock(&uprobes_treelock); + read_unlock(&uprobes_treelock); } /* @vma contains reference counter, not the probed instruction. */ @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e min = vaddr_to_offset(vma, start); max = min + (end - start) - 1; - spin_lock(&uprobes_treelock); + read_lock(&uprobes_treelock); n = find_node_in_range(inode, min, max); - spin_unlock(&uprobes_treelock); + read_unlock(&uprobes_treelock); return !!n; }
Active uprobes are stored in an RB tree and accesses to this tree are dominated by read operations. Currently these accesses are serialized by a spinlock but this leads to enormous contention when large numbers of threads are executing active probes. This patch converts the spinlock used to serialize access to the uprobes_tree RB tree into a reader-writer spinlock. This lock type aligns naturally with the overwhelmingly read-only nature of the tree usage here. Although the addition of reader-writer spinlocks are discouraged [0], this fix is proposed as an interim solution while an RCU based approach is implemented (that work is in a nascent form). This fix also has the benefit of being trivial, self contained and therefore simple to backport. This change has been tested against production workloads that exhibit significant contention on the spinlock and an almost order of magnitude reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). [0] https://docs.kernel.org/locking/spinlocks.html Signed-off-by: Jonathan Haslam <jonathan.haslam@gmail.com> --- kernel/events/uprobes.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-)