Message ID | 20240805202803.1813090-1-andrii@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | uprobes: get rid of bogus trace_uprobe hit counter | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Not a local patch |
On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote: > trace_uprobe->nhit counter is not incremented atomically, so its value > is bogus in practice. On the other hand, it's actually a pretty big > uprobe scalability problem due to heavy cache line bouncing between CPUs > triggering the same uprobe. so you're seeing that in the benchmark, right? I'm curious how bad the numbers are > > Drop it and emit obviously unrealistic value in its stead in > uporbe_profiler seq file. > > The alternative would be allocating per-CPU counter, but I'm not sure > it's justified. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > kernel/trace/trace_uprobe.c | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > index 52e76a73fa7c..5d38207db479 100644 > --- a/kernel/trace/trace_uprobe.c > +++ b/kernel/trace/trace_uprobe.c > @@ -62,7 +62,6 @@ struct trace_uprobe { > struct uprobe *uprobe; > unsigned long offset; > unsigned long ref_ctr_offset; > - unsigned long nhit; > struct trace_probe tp; > }; > > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) > > tu = to_trace_uprobe(ev); > seq_printf(m, " %s %-44s %15lu\n", tu->filename, > - trace_probe_name(&tu->tp), tu->nhit); > + trace_probe_name(&tu->tp), ULONG_MAX); seems harsh.. would it be that bad to create per cpu counter for that? jirka > return 0; > } > > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs) > int ret = 0; > > tu = container_of(con, struct trace_uprobe, consumer); > - tu->nhit++; > > udd.tu = tu; > udd.bp_addr = instruction_pointer(regs); > -- > 2.43.5 >
On Tue, Aug 6, 2024 at 12:37 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote: > > trace_uprobe->nhit counter is not incremented atomically, so its value > > is bogus in practice. On the other hand, it's actually a pretty big > > uprobe scalability problem due to heavy cache line bouncing between CPUs > > triggering the same uprobe. > > so you're seeing that in the benchmark, right? I'm curious how bad > the numbers are > Yes. So, once we get rid of all the uprobe/uretprobe/mm locks (ongoing work), this one was the last limiter to linear scalability. With this counter, I was topping out at about 12 mln/s uprobe triggering (I think it was 32 CPUs, but I don't remember exactly now). About 30% of CPU cycles were spent in this increment. But those 30% don't paint the full picture. Once the counter is removed, the same uprobe throughput jumps to 62 mln/s or so. So we definitely have to do something about it. > > > > Drop it and emit obviously unrealistic value in its stead in > > uporbe_profiler seq file. > > > > The alternative would be allocating per-CPU counter, but I'm not sure > > it's justified. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > kernel/trace/trace_uprobe.c | 4 +--- > > 1 file changed, 1 insertion(+), 3 deletions(-) > > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > > index 52e76a73fa7c..5d38207db479 100644 > > --- a/kernel/trace/trace_uprobe.c > > +++ b/kernel/trace/trace_uprobe.c > > @@ -62,7 +62,6 @@ struct trace_uprobe { > > struct uprobe *uprobe; > > unsigned long offset; > > unsigned long ref_ctr_offset; > > - unsigned long nhit; > > struct trace_probe tp; > > }; > > > > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) > > > > tu = to_trace_uprobe(ev); > > seq_printf(m, " %s %-44s %15lu\n", tu->filename, > > - trace_probe_name(&tu->tp), tu->nhit); > > + trace_probe_name(&tu->tp), ULONG_MAX); > > seems harsh.. would it be that bad to create per cpu counter for that? Well, consider this patch a conversation starter. There are two reasons why I'm removing the counter instead of doing per-CPU one: - it's less work to send out a patch pointing out the problem (but the solution might change) - this counter was never correct in the presence of multiple threads, so I'm not sure how useful it is. Yes, I think we can do per-CPU counters, but do we want to pay the memory price? That's what I want to get from Masami, Steven, or Peter (whoever cares enough). > > jirka > > > return 0; > > } > > > > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs) > > int ret = 0; > > > > tu = container_of(con, struct trace_uprobe, consumer); > > - tu->nhit++; > > > > udd.tu = tu; > > udd.bp_addr = instruction_pointer(regs); > > -- > > 2.43.5 > >
On Tue, 6 Aug 2024 10:26:25 -0700 Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > On Tue, Aug 6, 2024 at 12:37 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote: > > > trace_uprobe->nhit counter is not incremented atomically, so its value > > > is bogus in practice. On the other hand, it's actually a pretty big > > > uprobe scalability problem due to heavy cache line bouncing between CPUs > > > triggering the same uprobe. > > > > so you're seeing that in the benchmark, right? I'm curious how bad > > the numbers are > > > > Yes. So, once we get rid of all the uprobe/uretprobe/mm locks (ongoing > work), this one was the last limiter to linear scalability. > > With this counter, I was topping out at about 12 mln/s uprobe > triggering (I think it was 32 CPUs, but I don't remember exactly now). > About 30% of CPU cycles were spent in this increment. > > But those 30% don't paint the full picture. Once the counter is > removed, the same uprobe throughput jumps to 62 mln/s or so. So we > definitely have to do something about it. > > > > > > > Drop it and emit obviously unrealistic value in its stead in > > > uporbe_profiler seq file. > > > > > > The alternative would be allocating per-CPU counter, but I'm not sure > > > it's justified. > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > --- > > > kernel/trace/trace_uprobe.c | 4 +--- > > > 1 file changed, 1 insertion(+), 3 deletions(-) > > > > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > > > index 52e76a73fa7c..5d38207db479 100644 > > > --- a/kernel/trace/trace_uprobe.c > > > +++ b/kernel/trace/trace_uprobe.c > > > @@ -62,7 +62,6 @@ struct trace_uprobe { > > > struct uprobe *uprobe; > > > unsigned long offset; > > > unsigned long ref_ctr_offset; > > > - unsigned long nhit; > > > struct trace_probe tp; > > > }; > > > > > > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) > > > > > > tu = to_trace_uprobe(ev); > > > seq_printf(m, " %s %-44s %15lu\n", tu->filename, > > > - trace_probe_name(&tu->tp), tu->nhit); > > > + trace_probe_name(&tu->tp), ULONG_MAX); > > > > seems harsh.. would it be that bad to create per cpu counter for that? > > Well, consider this patch a conversation starter. There are two > reasons why I'm removing the counter instead of doing per-CPU one: > > - it's less work to send out a patch pointing out the problem (but > the solution might change) > - this counter was never correct in the presence of multiple > threads, so I'm not sure how useful it is. > > Yes, I think we can do per-CPU counters, but do we want to pay the > memory price? That's what I want to get from Masami, Steven, or Peter > (whoever cares enough). I would like to make it per-cpu counter *and* make it kconfig optional. Or just remove with the file (but it changes the user interface without option). For the kprobes, the profile file is useful because it shows "missed" counter. This tells user whether your trace data drops some events or not. But if uprobes profile only shows the number of hit, we can use the histogram trigger if needed. Thank you, > > > > > jirka > > > > > return 0; > > > } > > > > > > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs) > > > int ret = 0; > > > > > > tu = container_of(con, struct trace_uprobe, consumer); > > > - tu->nhit++; > > > > > > udd.tu = tu; > > > udd.bp_addr = instruction_pointer(regs); > > > -- > > > 2.43.5 > > >
On Wed, Aug 7, 2024 at 2:44 PM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Tue, 6 Aug 2024 10:26:25 -0700 > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > > On Tue, Aug 6, 2024 at 12:37 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote: > > > > trace_uprobe->nhit counter is not incremented atomically, so its value > > > > is bogus in practice. On the other hand, it's actually a pretty big > > > > uprobe scalability problem due to heavy cache line bouncing between CPUs > > > > triggering the same uprobe. > > > > > > so you're seeing that in the benchmark, right? I'm curious how bad > > > the numbers are > > > > > > > Yes. So, once we get rid of all the uprobe/uretprobe/mm locks (ongoing > > work), this one was the last limiter to linear scalability. > > > > With this counter, I was topping out at about 12 mln/s uprobe > > triggering (I think it was 32 CPUs, but I don't remember exactly now). > > About 30% of CPU cycles were spent in this increment. > > > > But those 30% don't paint the full picture. Once the counter is > > removed, the same uprobe throughput jumps to 62 mln/s or so. So we > > definitely have to do something about it. > > > > > > > > > > Drop it and emit obviously unrealistic value in its stead in > > > > uporbe_profiler seq file. > > > > > > > > The alternative would be allocating per-CPU counter, but I'm not sure > > > > it's justified. > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > --- > > > > kernel/trace/trace_uprobe.c | 4 +--- > > > > 1 file changed, 1 insertion(+), 3 deletions(-) > > > > > > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > > > > index 52e76a73fa7c..5d38207db479 100644 > > > > --- a/kernel/trace/trace_uprobe.c > > > > +++ b/kernel/trace/trace_uprobe.c > > > > @@ -62,7 +62,6 @@ struct trace_uprobe { > > > > struct uprobe *uprobe; > > > > unsigned long offset; > > > > unsigned long ref_ctr_offset; > > > > - unsigned long nhit; > > > > struct trace_probe tp; > > > > }; > > > > > > > > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) > > > > > > > > tu = to_trace_uprobe(ev); > > > > seq_printf(m, " %s %-44s %15lu\n", tu->filename, > > > > - trace_probe_name(&tu->tp), tu->nhit); > > > > + trace_probe_name(&tu->tp), ULONG_MAX); > > > > > > seems harsh.. would it be that bad to create per cpu counter for that? > > > > Well, consider this patch a conversation starter. There are two > > reasons why I'm removing the counter instead of doing per-CPU one: > > > > - it's less work to send out a patch pointing out the problem (but > > the solution might change) > > - this counter was never correct in the presence of multiple > > threads, so I'm not sure how useful it is. > > > > Yes, I think we can do per-CPU counters, but do we want to pay the > > memory price? That's what I want to get from Masami, Steven, or Peter > > (whoever cares enough). > > I would like to make it per-cpu counter *and* make it kconfig optional. > Or just remove with the file (but it changes the user interface without > option). > > For the kprobes, the profile file is useful because it shows "missed" > counter. This tells user whether your trace data drops some events or not. > But if uprobes profile only shows the number of hit, we can use the > histogram trigger if needed. > I really don't like extra Kconfig options for something like this. So I'll just go ahead and switch this to per-CPU counters unconditionally. For BPF, this is unnecessary memory overhead, but as we shift to multi-uprobe (which doesn't go through trace_uprobe logic), it probably isn't a big deal over the longer term. > Thank you, > > > > > > > > > jirka > > > > > > > return 0; > > > > } > > > > > > > > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs) > > > > int ret = 0; > > > > > > > > tu = container_of(con, struct trace_uprobe, consumer); > > > > - tu->nhit++; > > > > > > > > udd.tu = tu; > > > > udd.bp_addr = instruction_pointer(regs); > > > > -- > > > > 2.43.5 > > > > > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org>
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 52e76a73fa7c..5d38207db479 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -62,7 +62,6 @@ struct trace_uprobe { struct uprobe *uprobe; unsigned long offset; unsigned long ref_ctr_offset; - unsigned long nhit; struct trace_probe tp; }; @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v) tu = to_trace_uprobe(ev); seq_printf(m, " %s %-44s %15lu\n", tu->filename, - trace_probe_name(&tu->tp), tu->nhit); + trace_probe_name(&tu->tp), ULONG_MAX); return 0; } @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs) int ret = 0; tu = container_of(con, struct trace_uprobe, consumer); - tu->nhit++; udd.tu = tu; udd.bp_addr = instruction_pointer(regs);
trace_uprobe->nhit counter is not incremented atomically, so its value is bogus in practice. On the other hand, it's actually a pretty big uprobe scalability problem due to heavy cache line bouncing between CPUs triggering the same uprobe. Drop it and emit obviously unrealistic value in its stead in uporbe_profiler seq file. The alternative would be allocating per-CPU counter, but I'm not sure it's justified. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- kernel/trace/trace_uprobe.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-)