uprobes: get rid of bogus trace_uprobe hit counter

Message ID	20240805202803.1813090-1-andrii@kernel.org (mailing list archive)
State	Not Applicable
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 11F1413CF86; Mon, 5 Aug 2024 20:28:06 +0000 (UTC) From: Andrii Nakryiko <andrii@kernel.org> To: linux-trace-kernel@vger.kernel.org, rostedt@goodmis.org, mhiramat@kernel.org Cc: peterz@infradead.org, oleg@redhat.com, bpf@vger.kernel.org, linux-kernel@vger.kernel.org, jolsa@kernel.org, Andrii Nakryiko <andrii@kernel.org> Subject: [PATCH] uprobes: get rid of bogus trace_uprobe hit counter Date: Mon, 5 Aug 2024 13:28:03 -0700 Message-ID: <20240805202803.1813090-1-andrii@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	uprobes: get rid of bogus trace_uprobe hit counter \| expand uprobes: get rid of bogus trace_uprobe hit counter

Context	Check	Description
netdev/tree_selection	success	Not a local patch

Andrii Nakryiko Aug. 5, 2024, 8:28 p.m. UTC

trace_uprobe->nhit counter is not incremented atomically, so its value
is bogus in practice. On the other hand, it's actually a pretty big
uprobe scalability problem due to heavy cache line bouncing between CPUs
triggering the same uprobe.

Drop it and emit obviously unrealistic value in its stead in
uporbe_profiler seq file.

The alternative would be allocating per-CPU counter, but I'm not sure
it's justified.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/trace/trace_uprobe.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

Jiri Olsa Aug. 6, 2024, 7:37 a.m. UTC | #1

On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote:
> trace_uprobe->nhit counter is not incremented atomically, so its value
> is bogus in practice. On the other hand, it's actually a pretty big
> uprobe scalability problem due to heavy cache line bouncing between CPUs
> triggering the same uprobe.

so you're seeing that in the benchmark, right? I'm curious how bad
the numbers are

> 
> Drop it and emit obviously unrealistic value in its stead in
> uporbe_profiler seq file.
> 
> The alternative would be allocating per-CPU counter, but I'm not sure
> it's justified.
> 
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/trace/trace_uprobe.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index 52e76a73fa7c..5d38207db479 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -62,7 +62,6 @@ struct trace_uprobe {
>  	struct uprobe			*uprobe;
>  	unsigned long			offset;
>  	unsigned long			ref_ctr_offset;
> -	unsigned long			nhit;
>  	struct trace_probe		tp;
>  };
>  
> @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v)
>  
>  	tu = to_trace_uprobe(ev);
>  	seq_printf(m, "  %s %-44s %15lu\n", tu->filename,
> -			trace_probe_name(&tu->tp), tu->nhit);
> +		   trace_probe_name(&tu->tp), ULONG_MAX);

seems harsh.. would it be that bad to create per cpu counter for that?

jirka

>  	return 0;
>  }
>  
> @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
>  	int ret = 0;
>  
>  	tu = container_of(con, struct trace_uprobe, consumer);
> -	tu->nhit++;
>  
>  	udd.tu = tu;
>  	udd.bp_addr = instruction_pointer(regs);
> -- 
> 2.43.5
>

Andrii Nakryiko Aug. 6, 2024, 5:26 p.m. UTC | #2

On Tue, Aug 6, 2024 at 12:37 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote:
> > trace_uprobe->nhit counter is not incremented atomically, so its value
> > is bogus in practice. On the other hand, it's actually a pretty big
> > uprobe scalability problem due to heavy cache line bouncing between CPUs
> > triggering the same uprobe.
>
> so you're seeing that in the benchmark, right? I'm curious how bad
> the numbers are
>

Yes. So, once we get rid of all the uprobe/uretprobe/mm locks (ongoing
work), this one was the last limiter to linear scalability.

With this counter, I was topping out at about 12 mln/s uprobe
triggering (I think it was 32 CPUs, but I don't remember exactly now).
About 30% of CPU cycles were spent in this increment.

But those 30% don't paint the full picture. Once the counter is
removed, the same uprobe throughput jumps to 62 mln/s or so. So we
definitely have to do something about it.

> >
> > Drop it and emit obviously unrealistic value in its stead in
> > uporbe_profiler seq file.
> >
> > The alternative would be allocating per-CPU counter, but I'm not sure
> > it's justified.
> >
> > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > ---
> >  kernel/trace/trace_uprobe.c | 4 +---
> >  1 file changed, 1 insertion(+), 3 deletions(-)
> >
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index 52e76a73fa7c..5d38207db479 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -62,7 +62,6 @@ struct trace_uprobe {
> >       struct uprobe                   *uprobe;
> >       unsigned long                   offset;
> >       unsigned long                   ref_ctr_offset;
> > -     unsigned long                   nhit;
> >       struct trace_probe              tp;
> >  };
> >
> > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v)
> >
> >       tu = to_trace_uprobe(ev);
> >       seq_printf(m, "  %s %-44s %15lu\n", tu->filename,
> > -                     trace_probe_name(&tu->tp), tu->nhit);
> > +                trace_probe_name(&tu->tp), ULONG_MAX);
>
> seems harsh.. would it be that bad to create per cpu counter for that?

Well, consider this patch a conversation starter. There are two
reasons why I'm removing the counter instead of doing per-CPU one:

  - it's less work to send out a patch pointing out the problem (but
the solution might change)
  - this counter was never correct in the presence of multiple
threads, so I'm not sure how useful it is.

Yes, I think we can do per-CPU counters, but do we want to pay the
memory price? That's what I want to get from Masami, Steven, or Peter
(whoever cares enough).

>
> jirka
>
> >       return 0;
> >  }
> >
> > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
> >       int ret = 0;
> >
> >       tu = container_of(con, struct trace_uprobe, consumer);
> > -     tu->nhit++;
> >
> >       udd.tu = tu;
> >       udd.bp_addr = instruction_pointer(regs);
> > --
> > 2.43.5
> >

Masami Hiramatsu (Google) Aug. 7, 2024, 9:43 p.m. UTC | #3

On Tue, 6 Aug 2024 10:26:25 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Tue, Aug 6, 2024 at 12:37 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote:
> > > trace_uprobe->nhit counter is not incremented atomically, so its value
> > > is bogus in practice. On the other hand, it's actually a pretty big
> > > uprobe scalability problem due to heavy cache line bouncing between CPUs
> > > triggering the same uprobe.
> >
> > so you're seeing that in the benchmark, right? I'm curious how bad
> > the numbers are
> >
> 
> Yes. So, once we get rid of all the uprobe/uretprobe/mm locks (ongoing
> work), this one was the last limiter to linear scalability.
> 
> With this counter, I was topping out at about 12 mln/s uprobe
> triggering (I think it was 32 CPUs, but I don't remember exactly now).
> About 30% of CPU cycles were spent in this increment.
> 
> But those 30% don't paint the full picture. Once the counter is
> removed, the same uprobe throughput jumps to 62 mln/s or so. So we
> definitely have to do something about it.
> 
> > >
> > > Drop it and emit obviously unrealistic value in its stead in
> > > uporbe_profiler seq file.
> > >
> > > The alternative would be allocating per-CPU counter, but I'm not sure
> > > it's justified.
> > >
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > >  kernel/trace/trace_uprobe.c | 4 +---
> > >  1 file changed, 1 insertion(+), 3 deletions(-)
> > >
> > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > > index 52e76a73fa7c..5d38207db479 100644
> > > --- a/kernel/trace/trace_uprobe.c
> > > +++ b/kernel/trace/trace_uprobe.c
> > > @@ -62,7 +62,6 @@ struct trace_uprobe {
> > >       struct uprobe                   *uprobe;
> > >       unsigned long                   offset;
> > >       unsigned long                   ref_ctr_offset;
> > > -     unsigned long                   nhit;
> > >       struct trace_probe              tp;
> > >  };
> > >
> > > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v)
> > >
> > >       tu = to_trace_uprobe(ev);
> > >       seq_printf(m, "  %s %-44s %15lu\n", tu->filename,
> > > -                     trace_probe_name(&tu->tp), tu->nhit);
> > > +                trace_probe_name(&tu->tp), ULONG_MAX);
> >
> > seems harsh.. would it be that bad to create per cpu counter for that?
> 
> Well, consider this patch a conversation starter. There are two
> reasons why I'm removing the counter instead of doing per-CPU one:
> 
>   - it's less work to send out a patch pointing out the problem (but
> the solution might change)
>   - this counter was never correct in the presence of multiple
> threads, so I'm not sure how useful it is.
> 
> Yes, I think we can do per-CPU counters, but do we want to pay the
> memory price? That's what I want to get from Masami, Steven, or Peter
> (whoever cares enough).

I would like to make it per-cpu counter *and* make it kconfig optional.
Or just remove with the file (but it changes the user interface without
option).

For the kprobes, the profile file is useful because it shows "missed"
counter. This tells user whether your trace data drops some events or not.
But if uprobes profile only shows the number of hit, we can use the
histogram trigger if needed.

Thank you,

> 
> >
> > jirka
> >
> > >       return 0;
> > >  }
> > >
> > > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
> > >       int ret = 0;
> > >
> > >       tu = container_of(con, struct trace_uprobe, consumer);
> > > -     tu->nhit++;
> > >
> > >       udd.tu = tu;
> > >       udd.bp_addr = instruction_pointer(regs);
> > > --
> > > 2.43.5
> > >

Andrii Nakryiko Aug. 9, 2024, 6:43 p.m. UTC | #4

On Wed, Aug 7, 2024 at 2:44 PM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Tue, 6 Aug 2024 10:26:25 -0700
> Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> > On Tue, Aug 6, 2024 at 12:37 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Mon, Aug 05, 2024 at 01:28:03PM -0700, Andrii Nakryiko wrote:
> > > > trace_uprobe->nhit counter is not incremented atomically, so its value
> > > > is bogus in practice. On the other hand, it's actually a pretty big
> > > > uprobe scalability problem due to heavy cache line bouncing between CPUs
> > > > triggering the same uprobe.
> > >
> > > so you're seeing that in the benchmark, right? I'm curious how bad
> > > the numbers are
> > >
> >
> > Yes. So, once we get rid of all the uprobe/uretprobe/mm locks (ongoing
> > work), this one was the last limiter to linear scalability.
> >
> > With this counter, I was topping out at about 12 mln/s uprobe
> > triggering (I think it was 32 CPUs, but I don't remember exactly now).
> > About 30% of CPU cycles were spent in this increment.
> >
> > But those 30% don't paint the full picture. Once the counter is
> > removed, the same uprobe throughput jumps to 62 mln/s or so. So we
> > definitely have to do something about it.
> >
> > > >
> > > > Drop it and emit obviously unrealistic value in its stead in
> > > > uporbe_profiler seq file.
> > > >
> > > > The alternative would be allocating per-CPU counter, but I'm not sure
> > > > it's justified.
> > > >
> > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > > ---
> > > >  kernel/trace/trace_uprobe.c | 4 +---
> > > >  1 file changed, 1 insertion(+), 3 deletions(-)
> > > >
> > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > > > index 52e76a73fa7c..5d38207db479 100644
> > > > --- a/kernel/trace/trace_uprobe.c
> > > > +++ b/kernel/trace/trace_uprobe.c
> > > > @@ -62,7 +62,6 @@ struct trace_uprobe {
> > > >       struct uprobe                   *uprobe;
> > > >       unsigned long                   offset;
> > > >       unsigned long                   ref_ctr_offset;
> > > > -     unsigned long                   nhit;
> > > >       struct trace_probe              tp;
> > > >  };
> > > >
> > > > @@ -821,7 +820,7 @@ static int probes_profile_seq_show(struct seq_file *m, void *v)
> > > >
> > > >       tu = to_trace_uprobe(ev);
> > > >       seq_printf(m, "  %s %-44s %15lu\n", tu->filename,
> > > > -                     trace_probe_name(&tu->tp), tu->nhit);
> > > > +                trace_probe_name(&tu->tp), ULONG_MAX);
> > >
> > > seems harsh.. would it be that bad to create per cpu counter for that?
> >
> > Well, consider this patch a conversation starter. There are two
> > reasons why I'm removing the counter instead of doing per-CPU one:
> >
> >   - it's less work to send out a patch pointing out the problem (but
> > the solution might change)
> >   - this counter was never correct in the presence of multiple
> > threads, so I'm not sure how useful it is.
> >
> > Yes, I think we can do per-CPU counters, but do we want to pay the
> > memory price? That's what I want to get from Masami, Steven, or Peter
> > (whoever cares enough).
>
> I would like to make it per-cpu counter *and* make it kconfig optional.
> Or just remove with the file (but it changes the user interface without
> option).
>
> For the kprobes, the profile file is useful because it shows "missed"
> counter. This tells user whether your trace data drops some events or not.
> But if uprobes profile only shows the number of hit, we can use the
> histogram trigger if needed.
>

I really don't like extra Kconfig options for something like this. So
I'll just go ahead and switch this to per-CPU counters
unconditionally.

For BPF, this is unnecessary memory overhead, but as we shift to
multi-uprobe (which doesn't go through trace_uprobe logic), it
probably isn't a big deal over the longer term.

> Thank you,
>
> >
> > >
> > > jirka
> > >
> > > >       return 0;
> > > >  }
> > > >
> > > > @@ -1507,7 +1506,6 @@ static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
> > > >       int ret = 0;
> > > >
> > > >       tu = container_of(con, struct trace_uprobe, consumer);
> > > > -     tu->nhit++;
> > > >
> > > >       udd.tu = tu;
> > > >       udd.bp_addr = instruction_pointer(regs);
> > > > --
> > > > 2.43.5
> > > >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>

uprobes: get rid of bogus trace_uprobe hit counter

Checks

Commit Message

Comments

Patch