diff mbox series

[v2] tracing/uprobe: Add missing PID filter for uretprobe

Message ID	ME0P300MB0416034322B9915ECD3888649D882@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM (mailing list archive)
State	Changes Requested
Delegated to:	Masami Hiramatsu
Headers	show Received: from AUS01-SY4-obe.outbound.protection.outlook.com (mail-sy4aus01olkn2085.outbound.protection.outlook.com [40.92.62.85]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF4A946556 for <linux-trace-kernel@vger.kernel.org>; Fri, 23 Aug 2024 13:58:31 +0000 (UTC) From: Tianyi Liu <i.pear@outlook.com> To: rostedt@goodmis.org, mhiramat@kernel.org Cc: mathieu.desnoyers@efficios.com, flaniel@linux.microsoft.com, albancrequy@linux.microsoft.com, linux-trace-kernel@vger.kernel.org, Tianyi Liu <i.pear@outlook.com> Subject: [PATCH v2] tracing/uprobe: Add missing PID filter for uretprobe Date: Fri, 23 Aug 2024 21:53:00 +0800 Message-ID: <ME0P300MB0416034322B9915ECD3888649D882@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM> Content-Transfer-Encoding: 8bit Content-Type: text/plain Precedence: bulk MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1
Series	[v2] tracing/uprobe: Add missing PID filter for uretprobe \| expand [v2] tracing/uprobe: Add missing PID filter for uretprobe

Commit Message

Tianyi Liu Aug. 23, 2024, 1:53 p.m. UTC

U(ret)probes are designed to be filterable using the PID, which is the
second parameter in the perf_event_open syscall. Currently, uprobe works
well with the filtering, but uretprobe is not affected by it. This often
leads to users being disturbed by events from uninterested processes while
using uretprobe.

We found that the filter function was not invoked when uretprobe was
initially implemented, and this has been existing for ten years. We have
tested the patch under our workload, binding eBPF programs to uretprobe
tracepoints, and confirmed that it resolved our problem.

Following are the steps to reproduce the issue:

Step 1. Compile the following reproducer program:
```

int main() {
    printf("pid: %d\n", getpid());
    while (1) {
        sleep(2);
        void *ptr = malloc(1024);
        free(ptr);
    }
}
```
We will then use uretprobe to trace the `malloc` function.

Step 2. Run two instances of the reproducer program and record their PIDs.

Step 3. Use uretprobe to trace each of the two running reproducers
separately. We use bpftrace to make it easier to reproduce. Please run two
instances of bpftrace simultaneously: the first instance filters events
from PID1, and the second instance filters events from PID2.

The expected behavior is that each bpftrace instance would only print
events matching its respective PID filter. However, in practice, both
bpftrace instances receive events from both processes, the PID filter is
ineffective at this moment:

Before:
```
PID1=55256
bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
Attaching 1 probe...
time=0 pid=55256
time=2 pid=55273
time=2 pid=55256
time=4 pid=55273
time=4 pid=55256
time=6 pid=55273
time=6 pid=55256

PID2=55273
bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
Attaching 1 probe...
time=0 pid=55273
time=0 pid=55256
time=2 pid=55273
time=2 pid=55256
time=4 pid=55273
time=4 pid=55256
time=6 pid=55273
time=6 pid=55256
```

After: Both bpftrace instances will show the expected behavior, only
printing events from the PID specified by their respective filters:
```
PID1=1621
bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
Attaching 1 probe...
time=0 pid=1621
time=2 pid=1621
time=4 pid=1621
time=6 pid=1621

PID2=1633
bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
Attaching 1 probe...
time=0 pid=1633
time=2 pid=1633
time=4 pid=1633
time=6 pid=1633
```

Fixes: c1ae5c75e103 ("uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()")
Cc: Alban Crequy <albancrequy@linux.microsoft.com>
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Tianyi Liu <i.pear@outlook.com>
---
Changes in v2:
- Drop cover letter and update commit message.
- Link to v1: https://lore.kernel.org/linux-trace-kernel/ME0P300MB04166144CDF92A72B9E1BAEA9D8F2@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM/
---
 kernel/trace/trace_uprobe.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Masami Hiramatsu (Google) Aug. 23, 2024, 5:44 p.m. UTC | #1

On Fri, 23 Aug 2024 21:53:00 +0800
Tianyi Liu <i.pear@outlook.com> wrote:

> U(ret)probes are designed to be filterable using the PID, which is the
> second parameter in the perf_event_open syscall. Currently, uprobe works
> well with the filtering, but uretprobe is not affected by it. This often
> leads to users being disturbed by events from uninterested processes while
> using uretprobe.
> 
> We found that the filter function was not invoked when uretprobe was
> initially implemented, and this has been existing for ten years. We have
> tested the patch under our workload, binding eBPF programs to uretprobe
> tracepoints, and confirmed that it resolved our problem.

Is this eBPF related problem? It seems only perf record is also affected.
Let me try.


> 
> Following are the steps to reproduce the issue:
> 
> Step 1. Compile the following reproducer program:
> ```
> 
> int main() {
>     printf("pid: %d\n", getpid());
>     while (1) {
>         sleep(2);
>         void *ptr = malloc(1024);
>         free(ptr);
>     }
> }
> ```
> We will then use uretprobe to trace the `malloc` function.

OK, and run perf probe to add an event on malloc's return.

$ sudo ~/bin/perf probe -x ./malloc-run --add malloc%return  
Added new event:
  probe_malloc:malloc__return (on malloc%return in /home/mhiramat/ksrc/linux/malloc-run)

You can now use it in all perf tools, such as:

	perf record -e probe_malloc:malloc__return -aR sleep 1

> 
> Step 2. Run two instances of the reproducer program and record their PIDs.

$ ./malloc-run &  ./malloc-run &
[1] 93927
[2] 93928
pid: 93927
pid: 93928

And trace one of them;

$ sudo ~/bin/perf trace record -e probe_malloc:malloc__return  -p 93928
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ]

And dump the data;

$ sudo ~/bin/perf script
      malloc-run   93928 [004] 351736.730649:       raw_syscalls:sys_exit: NR 230 = 0
      malloc-run   93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
      malloc-run   93928 [004] 351736.730696:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
      malloc-run   93928 [004] 351738.730857:       raw_syscalls:sys_exit: NR 230 = 0
      malloc-run   93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
      malloc-run   93928 [004] 351738.730883:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
      malloc-run   93928 [004] 351740.731110:       raw_syscalls:sys_exit: NR 230 = 0
      malloc-run   93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
      malloc-run   93928 [004] 351740.731127:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,

Hmm, it seems to trace one pid data. (without this change)
If this changes eBPF behavior, I would like to involve eBPF people to ask
this is OK. As far as from the viewpoint of perf tool, current code works.

But I agree that current code is a bit strange. Oleg, do you know anything?

Thank you,

> 
> Step 3. Use uretprobe to trace each of the two running reproducers
> separately. We use bpftrace to make it easier to reproduce. Please run two
> instances of bpftrace simultaneously: the first instance filters events
> from PID1, and the second instance filters events from PID2.
> 
> The expected behavior is that each bpftrace instance would only print
> events matching its respective PID filter. However, in practice, both
> bpftrace instances receive events from both processes, the PID filter is
> ineffective at this moment:
> 
> Before:
> ```
> PID1=55256
> bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
> Attaching 1 probe...
> time=0 pid=55256
> time=2 pid=55273
> time=2 pid=55256
> time=4 pid=55273
> time=4 pid=55256
> time=6 pid=55273
> time=6 pid=55256
> 
> PID2=55273
> bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
> Attaching 1 probe...
> time=0 pid=55273
> time=0 pid=55256
> time=2 pid=55273
> time=2 pid=55256
> time=4 pid=55273
> time=4 pid=55256
> time=6 pid=55273
> time=6 pid=55256
> ```
> 
> After: Both bpftrace instances will show the expected behavior, only
> printing events from the PID specified by their respective filters:
> ```
> PID1=1621
> bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
> Attaching 1 probe...
> time=0 pid=1621
> time=2 pid=1621
> time=4 pid=1621
> time=6 pid=1621
> 
> PID2=1633
> bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
> Attaching 1 probe...
> time=0 pid=1633
> time=2 pid=1633
> time=4 pid=1633
> time=6 pid=1633
> ```
> 
> Fixes: c1ae5c75e103 ("uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()")
> Cc: Alban Crequy <albancrequy@linux.microsoft.com>
> Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
> Signed-off-by: Tianyi Liu <i.pear@outlook.com>
> ---
> Changes in v2:
> - Drop cover letter and update commit message.
> - Link to v1: https://lore.kernel.org/linux-trace-kernel/ME0P300MB04166144CDF92A72B9E1BAEA9D8F2@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM/
> ---
>  kernel/trace/trace_uprobe.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index c98e3b3386ba..c7e2a0962928 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1443,6 +1443,9 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func,
>  				struct pt_regs *regs,
>  				struct uprobe_cpu_buffer **ucbp)
>  {
> +	if (!uprobe_perf_filter(&tu->consumer, 0, current->mm))
> +		return;
> +
>  	__uprobe_perf_func(tu, func, regs, ucbp);
>  }
>  
> -- 
> 2.34.1
>

Andrii Nakryiko Aug. 23, 2024, 7:07 p.m. UTC | #2

adding Jiri and bpftrace folks


On Fri, Aug 23, 2024 at 10:44 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
>
> On Fri, 23 Aug 2024 21:53:00 +0800
> Tianyi Liu <i.pear@outlook.com> wrote:
>
> > U(ret)probes are designed to be filterable using the PID, which is the
> > second parameter in the perf_event_open syscall. Currently, uprobe works
> > well with the filtering, but uretprobe is not affected by it. This often
> > leads to users being disturbed by events from uninterested processes while
> > using uretprobe.
> >
> > We found that the filter function was not invoked when uretprobe was
> > initially implemented, and this has been existing for ten years. We have
> > tested the patch under our workload, binding eBPF programs to uretprobe
> > tracepoints, and confirmed that it resolved our problem.
>
> Is this eBPF related problem? It seems only perf record is also affected.
> Let me try.
>
>
> >
> > Following are the steps to reproduce the issue:
> >
> > Step 1. Compile the following reproducer program:
> > ```
> >
> > int main() {
> >     printf("pid: %d\n", getpid());
> >     while (1) {
> >         sleep(2);
> >         void *ptr = malloc(1024);
> >         free(ptr);
> >     }
> > }
> > ```
> > We will then use uretprobe to trace the `malloc` function.
>
> OK, and run perf probe to add an event on malloc's return.
>
> $ sudo ~/bin/perf probe -x ./malloc-run --add malloc%return
> Added new event:
>   probe_malloc:malloc__return (on malloc%return in /home/mhiramat/ksrc/linux/malloc-run)
>
> You can now use it in all perf tools, such as:
>
>         perf record -e probe_malloc:malloc__return -aR sleep 1
>
> >
> > Step 2. Run two instances of the reproducer program and record their PIDs.
>
> $ ./malloc-run &  ./malloc-run &
> [1] 93927
> [2] 93928
> pid: 93927
> pid: 93928
>
> And trace one of them;
>
> $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return  -p 93928
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ]
>
> And dump the data;
>
> $ sudo ~/bin/perf script
>       malloc-run   93928 [004] 351736.730649:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351736.730696:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
>       malloc-run   93928 [004] 351738.730857:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351738.730883:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
>       malloc-run   93928 [004] 351740.731110:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351740.731127:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
>
> Hmm, it seems to trace one pid data. (without this change)
> If this changes eBPF behavior, I would like to involve eBPF people to ask
> this is OK. As far as from the viewpoint of perf tool, current code works.
>
> But I agree that current code is a bit strange. Oleg, do you know anything?
>
> Thank you,
>
> >
> > Step 3. Use uretprobe to trace each of the two running reproducers
> > separately. We use bpftrace to make it easier to reproduce. Please run two
> > instances of bpftrace simultaneously: the first instance filters events
> > from PID1, and the second instance filters events from PID2.
> >
> > The expected behavior is that each bpftrace instance would only print
> > events matching its respective PID filter. However, in practice, both
> > bpftrace instances receive events from both processes, the PID filter is
> > ineffective at this moment:
> >
> > Before:
> > ```
> > PID1=55256
> > bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
> > Attaching 1 probe...
> > time=0 pid=55256
> > time=2 pid=55273
> > time=2 pid=55256
> > time=4 pid=55273
> > time=4 pid=55256
> > time=6 pid=55273
> > time=6 pid=55256
> >
> > PID2=55273
> > bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
> > Attaching 1 probe...
> > time=0 pid=55273
> > time=0 pid=55256
> > time=2 pid=55273
> > time=2 pid=55256
> > time=4 pid=55273
> > time=4 pid=55256
> > time=6 pid=55273
> > time=6 pid=55256
> > ```

This is a bit confusing, because even if the kernel-side uretprobe
handler doesn't do the filtering by itself, uprobe subsystem shouldn't
install breakpoints on processes which don't have uretprobe requested
for (unless I'm missing something, of course).

It still needs to be fixed like you do in your patch, though. Even
more, we probably need a similar UPROBE_HANDLER_REMOVE handling in
handle_uretprobe_chain() to clean up breakpoint for processes which
don't have uretprobe attached anymore (but I think that's a separate
follow up).

Anyways, I think fixing this won't affect BPF in the sense that this
is clearly a bug and shouldn't happen. Furthermore, in
multi-uprobe/multi-uretprobe implementation we already filter this out
correctly, so I don't think anyone does (neither should) rely on this
buggy behavior.

> >
> > After: Both bpftrace instances will show the expected behavior, only
> > printing events from the PID specified by their respective filters:
> > ```
> > PID1=1621
> > bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
> > Attaching 1 probe...
> > time=0 pid=1621
> > time=2 pid=1621
> > time=4 pid=1621
> > time=6 pid=1621
> >
> > PID2=1633
> > bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
> > Attaching 1 probe...
> > time=0 pid=1633
> > time=2 pid=1633
> > time=4 pid=1633
> > time=6 pid=1633
> > ```
> >
> > Fixes: c1ae5c75e103 ("uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()")
> > Cc: Alban Crequy <albancrequy@linux.microsoft.com>
> > Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
> > Signed-off-by: Tianyi Liu <i.pear@outlook.com>
> > ---
> > Changes in v2:
> > - Drop cover letter and update commit message.
> > - Link to v1: https://lore.kernel.org/linux-trace-kernel/ME0P300MB04166144CDF92A72B9E1BAEA9D8F2@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM/
> > ---
> >  kernel/trace/trace_uprobe.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index c98e3b3386ba..c7e2a0962928 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -1443,6 +1443,9 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func,
> >                               struct pt_regs *regs,
> >                               struct uprobe_cpu_buffer **ucbp)
> >  {
> > +     if (!uprobe_perf_filter(&tu->consumer, 0, current->mm))
> > +             return;
> > +
> >       __uprobe_perf_func(tu, func, regs, ucbp);
> >  }
> >
> > --
> > 2.34.1
> >
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@kernel.org>
>

Tianyi Liu Aug. 24, 2024, 5:49 a.m. UTC | #3

Hi Masami and Andrii:

I would like to share more information and ideas, but I'm possibly wrong.

> > U(ret)probes are designed to be filterable using the PID, which is the
> > second parameter in the perf_event_open syscall. Currently, uprobe works
> > well with the filtering, but uretprobe is not affected by it. This often
> > leads to users being disturbed by events from uninterested processes while
> > using uretprobe.
> >
> > We found that the filter function was not invoked when uretprobe was
> > initially implemented, and this has been existing for ten years. We have
> > tested the patch under our workload, binding eBPF programs to uretprobe
> > tracepoints, and confirmed that it resolved our problem.
> 
> Is this eBPF related problem? It seems only perf record is also affected.
> Let me try.

I guess it should be a general issue and is not specific to BPF, because
the BPF handler is only a event "consumer".

> 
> And trace one of them;
> 
> $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return  -p 93928
> 

A key trigger here is that the two tracer instances (either `bpftrace` or
`perf record`) must be running *simultaneously*. One of them should use
PID1 as filter, while the other uses PID2.

I think the reason why only tracing PID1 fails to trigger the bug is that,
uprobe uses some form of copy-on-write mechanism to create independent
.text pages for the traced process. For example, if only PID1 is being
traced, then only the libc.so used by PID1 is patched. Other processes
will continue to use the unpatched original libc.so, so the event isn't
triggered. In my reproduction example, the two bpftrace instances must be
running at the same moment.

> This is a bit confusing, because even if the kernel-side uretprobe
> handler doesn't do the filtering by itself, uprobe subsystem shouldn't
> install breakpoints on processes which don't have uretprobe requested
> for (unless I'm missing something, of course).

There're two tracers, one attached to PID1, and the other attached
to PID2, as described above.

> It still needs to be fixed like you do in your patch, though. Even
> more, we probably need a similar UPROBE_HANDLER_REMOVE handling in
> handle_uretprobe_chain() to clean up breakpoint for processes which
> don't have uretprobe attached anymore (but I think that's a separate
> follow up).

Agreed, the implementation of uretprobe should be almost the same as
uprobe, but it seems uretprobe was ignored in previous modifications.

> $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return  -p 93928
> ^C[ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ]
> 
> And dump the data;
> 
> $ sudo ~/bin/perf script
>       malloc-run   93928 [004] 351736.730649:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351736.730696:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
>       malloc-run   93928 [004] 351738.730857:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351738.730883:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
>       malloc-run   93928 [004] 351740.731110:       raw_syscalls:sys_exit: NR 230 = 0
>       malloc-run   93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
>       malloc-run   93928 [004] 351740.731127:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
> 
> Hmm, it seems to trace one pid data. (without this change)
> If this changes eBPF behavior, I would like to involve eBPF people to ask
> this is OK. As far as from the viewpoint of perf tool, current code works.

I tried this and also couldn't reproduce the bug. Even when running two
perf instances simultaneously, `perf record` (or perhaps `perf trace` for
convenience) only outputs events from the corresponding PID as expected.
I initially suspected that perf might be applying another filter in user
space, but that doesn't seem to be the case. I'll need to conduct further
debugging, which might take some time.

I also tried combining bpftrace with `perf trace`. Specifically, I used
`perf trace` for PID1 and bpftrace for PID2. `perf trace` still only
outputs events from PID1, but bpftrace prints events from both PIDs.
I'm not yet sure why this is happening.

Thanks so much,

Masami Hiramatsu (Google) Aug. 24, 2024, 5:27 p.m. UTC | #4

On Sat, 24 Aug 2024 13:49:26 +0800
Tianyi Liu <i.pear@outlook.com> wrote:

> Hi Masami and Andrii:
> 
> I would like to share more information and ideas, but I'm possibly wrong.
> 
> > > U(ret)probes are designed to be filterable using the PID, which is the
> > > second parameter in the perf_event_open syscall. Currently, uprobe works
> > > well with the filtering, but uretprobe is not affected by it. This often
> > > leads to users being disturbed by events from uninterested processes while
> > > using uretprobe.
> > >
> > > We found that the filter function was not invoked when uretprobe was
> > > initially implemented, and this has been existing for ten years. We have
> > > tested the patch under our workload, binding eBPF programs to uretprobe
> > > tracepoints, and confirmed that it resolved our problem.
> > 
> > Is this eBPF related problem? It seems only perf record is also affected.
> > Let me try.
> 
> I guess it should be a general issue and is not specific to BPF, because
> the BPF handler is only a event "consumer".
> 
> > 
> > And trace one of them;
> > 
> > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return  -p 93928
> > 
> 
> A key trigger here is that the two tracer instances (either `bpftrace` or
> `perf record`) must be running *simultaneously*. One of them should use
> PID1 as filter, while the other uses PID2.

Even if I run `perf trace record` simultaneously, it filters the event
correctly. I just ran;

sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93927 -o trace1.data -- sleep 50 &
sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 -o trace2.data -- sleep 50 &

And dump trace1.data and trace2.data by `perf script`.

> 
> I think the reason why only tracing PID1 fails to trigger the bug is that,
> uprobe uses some form of copy-on-write mechanism to create independent
> .text pages for the traced process. For example, if only PID1 is being
> traced, then only the libc.so used by PID1 is patched. Other processes
> will continue to use the unpatched original libc.so, so the event isn't
> triggered. In my reproduction example, the two bpftrace instances must be
> running at the same moment.
> 
> > This is a bit confusing, because even if the kernel-side uretprobe
> > handler doesn't do the filtering by itself, uprobe subsystem shouldn't
> > install breakpoints on processes which don't have uretprobe requested
> > for (unless I'm missing something, of course).
> 
> There're two tracers, one attached to PID1, and the other attached
> to PID2, as described above.

Yeah, but perf works fine. Maybe perf only enables its ring buffer for
the specific process and read from the specific ring buffer (Note, perf
has per-process ring buffer IIRC.) How does the bpftracer implement it?

> 
> > It still needs to be fixed like you do in your patch, though. Even
> > more, we probably need a similar UPROBE_HANDLER_REMOVE handling in
> > handle_uretprobe_chain() to clean up breakpoint for processes which
> > don't have uretprobe attached anymore (but I think that's a separate
> > follow up).
> 
> Agreed, the implementation of uretprobe should be almost the same as
> uprobe, but it seems uretprobe was ignored in previous modifications.

OK, I just have a confirmation from BPF people, because I could not
reproduce the issue with perf tool.

> 
> > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return  -p 93928
> > ^C[ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ]
> > 
> > And dump the data;
> > 
> > $ sudo ~/bin/perf script
> >       malloc-run   93928 [004] 351736.730649:       raw_syscalls:sys_exit: NR 230 = 0
> >       malloc-run   93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
> >       malloc-run   93928 [004] 351736.730696:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
> >       malloc-run   93928 [004] 351738.730857:       raw_syscalls:sys_exit: NR 230 = 0
> >       malloc-run   93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
> >       malloc-run   93928 [004] 351738.730883:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
> >       malloc-run   93928 [004] 351740.731110:       raw_syscalls:sys_exit: NR 230 = 0
> >       malloc-run   93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204)
> >       malloc-run   93928 [004] 351740.731127:      raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0,
> > 
> > Hmm, it seems to trace one pid data. (without this change)
> > If this changes eBPF behavior, I would like to involve eBPF people to ask
> > this is OK. As far as from the viewpoint of perf tool, current code works.
> 
> I tried this and also couldn't reproduce the bug. Even when running two
> perf instances simultaneously, `perf record` (or perhaps `perf trace` for
> convenience) only outputs events from the corresponding PID as expected.
> I initially suspected that perf might be applying another filter in user
> space, but that doesn't seem to be the case. I'll need to conduct further
> debugging, which might take some time.
> 
> I also tried combining bpftrace with `perf trace`. Specifically, I used
> `perf trace` for PID1 and bpftrace for PID2. `perf trace` still only
> outputs events from PID1, but bpftrace prints events from both PIDs.
> I'm not yet sure why this is happening.

Yeah, if perf only reads the specific process's ring buffer, it should
work without this filter.

Thanks,

> 
> Thanks so much,

Oleg Nesterov Aug. 25, 2024, 5 p.m. UTC | #5

On 08/23, Andrii Nakryiko wrote:
>
> This is a bit confusing, because even if the kernel-side uretprobe
> handler doesn't do the filtering by itself, uprobe subsystem shouldn't
> install breakpoints on processes which don't have uretprobe requested
> for (unless I'm missing something, of course).

Yes, but in this case uprobe_register/apply will install breakpoints
on both tasks, with PID1 and PID2.

> It still needs to be fixed

Agreed,

> like you do in your patch,

Not sure...

> more, we probably need a similar UPROBE_HANDLER_REMOVE handling in
> handle_uretprobe_chain() to clean up breakpoint for processes which
> don't have uretprobe attached anymore (but I think that's a separate
> follow up).

Probably yes... but yes, this is another issue.

Oleg.

Oleg Nesterov Aug. 25, 2024, 5:14 p.m. UTC | #6

On 08/24, Tianyi Liu wrote:
>
> A key trigger here is that the two tracer instances (either `bpftrace` or
> `perf record`) must be running *simultaneously*. One of them should use
> PID1 as filter, while the other uses PID2.

Yes.

> Agreed, the implementation of uretprobe should be almost the same as
> uprobe, but it seems uretprobe was ignored in previous modifications.

I forgot EVERYTHING about the code in kernel/trace/trace_uprobe.c, but
at first glance I am not sure that your patch is what we need...

At least I certainly disagree with "Fixes: c1ae5c75e103" ;)

uretprobe_perf_func/etc was designed for perf, and afaics this code still
works fine even if you run 2 perf-record's with -p PID1/PID2 at the same
time.

BPF hacks/hooks were added later, so perhaps this should be fixed in the
bpf code, but I have no idea what bpftrace does...

Oleg.

Oleg Nesterov Aug. 25, 2024, 6:43 p.m. UTC | #7

On 08/25, Oleg Nesterov wrote:
>
> At least I certainly disagree with "Fixes: c1ae5c75e103" ;)
>
> uretprobe_perf_func/etc was designed for perf, and afaics this code still
> works fine even if you run 2 perf-record's with -p PID1/PID2 at the same
> time.

Forgot to mention...

And note that in this case uprobe_perf_func()->uprobe_perf_filter() will
never return false, and this is correct.

Oleg.

Oleg Nesterov Aug. 25, 2024, 10:40 p.m. UTC | #8

On 08/25, Oleg Nesterov wrote:
>
> At least I certainly disagree with "Fixes: c1ae5c75e103" ;)
>
> uretprobe_perf_func/etc was designed for perf, and afaics this code still
> works fine even if you run 2 perf-record's with -p PID1/PID2 at the same
> time.
>
> BPF hacks/hooks were added later, so perhaps this should be fixed in the
> bpf code, but I have no idea what bpftrace does...

And I can't install bpftrace on my old Fedora 23 working laptop ;) Yes, yes,
I know, I should upgrade it.

For the moment, please forget about ret-probes. Could you compile this program

	#define _GNU_SOURCE
	#include <unistd.h>
	#include <sched.h>
	#include <signal.h>

	int func(int i)
	{
		return i;
	}

	int test(void *arg)
	{
		int i;
		for (i = 0;; ++i) {
			sleep(1);
			func(i);
		}
		return 0;
	}

	int main(void)
	{
		static char stack[65536];

		clone(test, stack + sizeof(stack)/2, CLONE_VM|SIGCHLD, NULL);
		test(NULL);

		return 0;
	}

and then do something like

	$ ./test &
	$ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }'

I hope that the syntax of the 2nd command is correct...

I _think_ that it will print 2 pids too.

But "perf-record -p" works as expected.

Oleg.

Jiri Olsa Aug. 26, 2024, 10:05 a.m. UTC | #9

On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote:
> On 08/25, Oleg Nesterov wrote:
> >
> > At least I certainly disagree with "Fixes: c1ae5c75e103" ;)
> >
> > uretprobe_perf_func/etc was designed for perf, and afaics this code still
> > works fine even if you run 2 perf-record's with -p PID1/PID2 at the same
> > time.
> >
> > BPF hacks/hooks were added later, so perhaps this should be fixed in the
> > bpf code, but I have no idea what bpftrace does...
> 
> And I can't install bpftrace on my old Fedora 23 working laptop ;) Yes, yes,
> I know, I should upgrade it.
> 
> For the moment, please forget about ret-probes. Could you compile this program
> 
> 	#define _GNU_SOURCE
> 	#include <unistd.h>
> 	#include <sched.h>
> 	#include <signal.h>
> 
> 	int func(int i)
> 	{
> 		return i;
> 	}
> 
> 	int test(void *arg)
> 	{
> 		int i;
> 		for (i = 0;; ++i) {
> 			sleep(1);
> 			func(i);
> 		}
> 		return 0;
> 	}
> 
> 	int main(void)
> 	{
> 		static char stack[65536];
> 
> 		clone(test, stack + sizeof(stack)/2, CLONE_VM|SIGCHLD, NULL);
> 		test(NULL);
> 
> 		return 0;
> 	}
> 
> and then do something like
> 
> 	$ ./test &
> 	$ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }'
> 
> I hope that the syntax of the 2nd command is correct...
> 
> I _think_ that it will print 2 pids too.

yes.. but with CLONE_VM both processes share 'mm' so they are threads,
and at least uprobe_multi filters by process [1] now.. ;-)

> 
> But "perf-record -p" works as expected.

I wonder it's because there's the perf layer that schedules each
uprobe event only when its process (PID1/2) is scheduled in and will
receive events only from that cpu while the process is running on it

jirka

[1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic

> 
> Oleg.
>

Oleg Nesterov Aug. 26, 2024, 11:57 a.m. UTC | #10

On 08/26, Jiri Olsa wrote:
>
> On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote:
> > 	$ ./test &
> > 	$ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }'
> >
> > I hope that the syntax of the 2nd command is correct...
> >
> > I _think_ that it will print 2 pids too.
>
> yes.. but with CLONE_VM both processes share 'mm'

Yes sure,

> so they are threads,

Well this depends on definition ;) but the CLONE_VM child is not a sub-thread,
it has another TGID. See below.

> and at least uprobe_multi filters by process [1] now.. ;-)

OK, if you say that this behaviour is fine I won't argue, I simply do not know.
But see below.

> > But "perf-record -p" works as expected.
>
> I wonder it's because there's the perf layer that schedules each
> uprobe event only when its process (PID1/2) is scheduled in and will
> receive events only from that cpu while the process is running on it

Not sure I understand... The task which hits the breakpoint is always
current, it is always scheduled in.

The main purpose of uprobe_perf_func()->uprobe_perf_filter() is NOT that
we want to avoid __uprobe_perf_func() although this makes sense.

The main purpose is that we want to remove the breakpoints in current->mm
when uprobe_perf_filter() returns false, that is why UPROBE_HANDLER_REMOVE.
IOW, the main purpose is not penalise user-space unnecessarily.

IIUC (but I am not sure), perf-record -p will work "correctly" even if we
remove uprobe_perf_filter() altogether. IIRC the perf layer does its own
filtering but I forgot everything.

And this makes me think that perhaps BPF can't rely on uprobe_perf_filter()
either, even we forget about ret-probes.

> [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic

Looks obviously wrong... get_pid_task(PIDTYPE_TGID) can return a zombie
leader with ->mm == NULL while other threads and thus the whole process
is still alive.

And again, the changelog says "the intent for PID filtering it to filter by
*process*", but clone(CLONE_VM) creates another process, not a thread.

So perhaps we need

	-	if (link->task && current->mm != link->task->mm)
	+	if (link->task && !same_thread_group(current, link->task))

in uprobe_prog_run() to make "filter by *process*" true, but this won't
fix the problem with link->task->mm == NULL in uprobe_multi_link_filter().

Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not...
Then which userspace tool uses this code? ;)

Oleg.

Oleg Nesterov Aug. 26, 2024, 12:24 p.m. UTC | #11

On 08/26, Oleg Nesterov wrote:
>
> Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not...

Given that the patch from Tianyi makes the difference, bpftrace should
trigger the "#ifdef CONFIG_BPF_EVENTS" code in __uprobe_perf_func(), and
bpf_prog_run_array_uprobe() should return zero, correct?

> Then which userspace tool uses this code? ;)

Yes ;)

Oleg.

Jiri Olsa Aug. 26, 2024, 1:48 p.m. UTC | #12

On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> On 08/26, Jiri Olsa wrote:
> >
> > On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote:
> > > 	$ ./test &
> > > 	$ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }'
> > >
> > > I hope that the syntax of the 2nd command is correct...
> > >
> > > I _think_ that it will print 2 pids too.
> >
> > yes.. but with CLONE_VM both processes share 'mm'
> 
> Yes sure,
> 
> > so they are threads,
> 
> Well this depends on definition ;) but the CLONE_VM child is not a sub-thread,
> it has another TGID. See below.
> 
> > and at least uprobe_multi filters by process [1] now.. ;-)
> 
> OK, if you say that this behaviour is fine I won't argue, I simply do not know.
> But see below.
> 
> > > But "perf-record -p" works as expected.
> >
> > I wonder it's because there's the perf layer that schedules each
> > uprobe event only when its process (PID1/2) is scheduled in and will
> > receive events only from that cpu while the process is running on it
> 
> Not sure I understand... The task which hits the breakpoint is always
> current, it is always scheduled in.

hum, I might be missing something, but ;-)

assuming we have 2 tasks, each with perf uprobe event assigned

in perf path there's uprobe_perf_func which calls the uprobe_perf_filter
and if it returns true it then goes:

  uprobe_perf_func
    __uprobe_perf_func
      perf_trace_buf_submit
        perf_tp_event
        {

           hlist_for_each_entry_rcu(event, head, hlist_entry) {
             if (perf_tp_event_match(event, &data, regs)) {
                perf_swevent_event(event, count, &data, regs);

        }

so it will store the event only to perf event which is added for
the task that is currently scheduled in, so it's not stored to the
other task event

in comparison with uprobe_multi path, where uprobe_multi_link_filter
will pass and then it goes:

  handler_chain
    uprobe_multi_link_handler
      bpf_prog_run
       - executes bpf program

the bpf program will get executed for uprobe from both tasks

> 
> The main purpose of uprobe_perf_func()->uprobe_perf_filter() is NOT that
> we want to avoid __uprobe_perf_func() although this makes sense.
> 
> The main purpose is that we want to remove the breakpoints in current->mm
> when uprobe_perf_filter() returns false, that is why UPROBE_HANDLER_REMOVE.
> IOW, the main purpose is not penalise user-space unnecessarily.
> 
> IIUC (but I am not sure), perf-record -p will work "correctly" even if we
> remove uprobe_perf_filter() altogether. IIRC the perf layer does its own
> filtering but I forgot everything.

I think that's what I tried to describe above

> 
> And this makes me think that perhaps BPF can't rely on uprobe_perf_filter()
> either, even we forget about ret-probes.
> 
> > [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic
> 
> Looks obviously wrong... get_pid_task(PIDTYPE_TGID) can return a zombie
> leader with ->mm == NULL while other threads and thus the whole process
> is still alive.
> 
> And again, the changelog says "the intent for PID filtering it to filter by
> *process*", but clone(CLONE_VM) creates another process, not a thread.
> 
> So perhaps we need
> 
> 	-	if (link->task && current->mm != link->task->mm)
> 	+	if (link->task && !same_thread_group(current, link->task))

that seems correct, need to check

> 
> in uprobe_prog_run() to make "filter by *process*" true, but this won't
> fix the problem with link->task->mm == NULL in uprobe_multi_link_filter().
> 
> 
> Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not...
> Then which userspace tool uses this code? ;)

yes, it will trigger if you attach to multiple uprobes, like with your
test example with:

  # bpftrace -p xxx -e 'uprobe:./ex:func* { printf("%d\n", pid); }'

thanks,
jirka

Tianyi Liu Aug. 26, 2024, 2:52 p.m. UTC | #13

Hi Oleg,

> For the moment, please forget about ret-probes. Could you compile this program
> 
> #define _GNU_SOURCE
> #include <unistd.h>
> #include <sched.h>
> #include <signal.h>
> 
> int func(int i)
> {
>         return i;
> }
> 
> int test(void *arg)
> {
>         int i;
>         for (i = 0;; ++i) {
>                 sleep(1);
>                 func(i);
>         }
>         return 0;
> }
> 
> int main(void)
> {
>         static char stack[65536];
> 
>         clone(test, stack + sizeof(stack)/2, CLONE_VM|SIGCHLD, NULL);
>         test(NULL);
> 
>         return 0;
> }
> 
> and then do something like
> 
>         $ ./test &
>         $ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }'
> 
> I hope that the syntax of the 2nd command is correct...
> 
> I _think_ that it will print 2 pids too.
> 
> But "perf-record -p" works as expected.

Yes, the output from bpftrace and perf matches what you described:

$ ./tester &
[1] 158592

$ perf probe -x tester --add func
Added new event:
  probe_tester:func    (on func in /root/test/tester)

$ bpftrace -p 158592 -e 'uprobe:./tester:func { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }'
Attaching 1 probe...
time=0 pid=158592
time=0 pid=158594
time=1 pid=158592
time=1 pid=158594
time=2 pid=158592
time=2 pid=158594
time=3 pid=158592
time=3 pid=158594

$ perf record -e probe_tester:func -p 158592 -o 158592
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.022 MB 158592 (19 samples) ]

$ perf script -i 158592
tester  158592 [006] 246475.295762: probe_tester:func: (55a6def14149)
tester  158592 [006] 246476.295828: probe_tester:func: (55a6def14149)
tester  158592 [006] 246477.295892: probe_tester:func: (55a6def14149)
tester  158592 [006] 246478.295958: probe_tester:func: (55a6def14149)
tester  158592 [006] 246479.296024: probe_tester:func: (55a6def14149)
tester  158592 [010] 246480.296202: probe_tester:func: (55a6def14149)
tester  158592 [010] 246481.296360: probe_tester:func: (55a6def14149)
[...]

Thanks,

Oleg Nesterov Aug. 26, 2024, 6:56 p.m. UTC | #14

On 08/26, Jiri Olsa wrote:
>
> On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> > On 08/26, Jiri Olsa wrote:
> > >
> > > > But "perf-record -p" works as expected.
> > >
> > > I wonder it's because there's the perf layer that schedules each
> > > uprobe event only when its process (PID1/2) is scheduled in and will
> > > receive events only from that cpu while the process is running on it
> >
> > Not sure I understand... The task which hits the breakpoint is always
> > current, it is always scheduled in.
>
> hum, I might be missing something, but ;-)

No it is me ;) at least I certainly misunderstood your "scheduled in".

> assuming we have 2 tasks, each with perf uprobe event assigned
>
> in perf path there's uprobe_perf_func which calls the uprobe_perf_filter
> and if it returns true it then goes:
>
>   uprobe_perf_func
>     __uprobe_perf_func
>       perf_trace_buf_submit
>         perf_tp_event
>         {
>
>            hlist_for_each_entry_rcu(event, head, hlist_entry) {
>              if (perf_tp_event_match(event, &data, regs)) {
>                 perf_swevent_event(event, count, &data, regs);

Aha. So, in this particular case, when the CLONE_VM child hits the bp
and calls uprobe_perf_func(), even perf_tp_event_match() won't be called,
head is hlist_empty(), right?

Thanks!

> in comparison with uprobe_multi path, where uprobe_multi_link_filter

Yeah, this is clear.

> > IIUC (but I am not sure), perf-record -p will work "correctly" even if we
> > remove uprobe_perf_filter() altogether. IIRC the perf layer does its own
> > filtering but I forgot everything.
>
> I think that's what I tried to describe above

Yes, thanks.

Oleg.

Oleg Nesterov Aug. 26, 2024, 9:25 p.m. UTC | #15

This is offtopic, sorry for the spam, but...

On 08/26, Jiri Olsa wrote:
>
> On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> >
> > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not...
> > Then which userspace tool uses this code? ;)
>
> yes, it will trigger if you attach to multiple uprobes, like with your
> test example with:
>
>   # bpftrace -p xxx -e 'uprobe:./ex:func* { printf("%d\n", pid); }'

Hmm. I reserved the testing machine with fedora 40 to play with bpftrace.

dummy.c:

	#include <unistd.h>

	void func1(void) {}
	void func2(void) {}

	int main(void) { for (;;) pause(); }

If I do

	# ./dummy &
	# bpftrace -p $! -e 'uprobe:./dummy:func* { printf("%d\n", pid); }'

and run

	# bpftrace -e 'kprobe:__uprobe_register { printf("%s\n", kstack); }'

on another console I get

	Attaching 1 probe...

        __uprobe_register+1
        probe_event_enable+399
        perf_trace_event_init+440
        perf_uprobe_init+152
        perf_uprobe_event_init+74
        perf_try_init_event+71
        perf_event_alloc+1681
        __do_sys_perf_event_open+447
        do_syscall_64+130
        entry_SYSCALL_64_after_hwframe+118

        __uprobe_register+1
        probe_event_enable+399
        perf_trace_event_init+440
        perf_uprobe_init+152
        perf_uprobe_event_init+74
        perf_try_init_event+71
        perf_event_alloc+1681
        __do_sys_perf_event_open+447
        do_syscall_64+130
        entry_SYSCALL_64_after_hwframe+118

so it seems that bpftrace doesn't use bpf_uprobe_multi_link_attach()
(called by sys_bpf(BPF_LINK_CREATE) ?) in this case.

But again, this is offtopic, please forget.

Oleg.

Jiri Olsa Aug. 26, 2024, 10:01 p.m. UTC | #16

On Mon, Aug 26, 2024 at 11:25:53PM +0200, Oleg Nesterov wrote:
> This is offtopic, sorry for the spam, but...
> 
> On 08/26, Jiri Olsa wrote:
> >
> > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> > >
> > > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not...
> > > Then which userspace tool uses this code? ;)
> >
> > yes, it will trigger if you attach to multiple uprobes, like with your
> > test example with:
> >
> >   # bpftrace -p xxx -e 'uprobe:./ex:func* { printf("%d\n", pid); }'
> 
> Hmm. I reserved the testing machine with fedora 40 to play with bpftrace.
> 
> dummy.c:
> 
> 	#include <unistd.h>
> 
> 	void func1(void) {}
> 	void func2(void) {}
> 
> 	int main(void) { for (;;) pause(); }
> 
> If I do
> 
> 	# ./dummy &
> 	# bpftrace -p $! -e 'uprobe:./dummy:func* { printf("%d\n", pid); }'
> 
> and run
> 
> 	# bpftrace -e 'kprobe:__uprobe_register { printf("%s\n", kstack); }'

did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting:

[root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }'
Attaching 1 probe...

        uprobe_register+1
        bpf_uprobe_multi_link_attach+685
        __sys_bpf+9395
        __x64_sys_bpf+26
        do_syscall_64+128
        entry_SYSCALL_64_after_hwframe+118


I'm not sure what's bpftrace version in fedora 40, I'm using upstream build:

[root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi
  uprobe_multi: yes
[root@qemu ex]# ../bpftrace/build/src/bpftrace --version
bpftrace v0.20.0


jirka

> 
> on another console I get
> 
> 	Attaching 1 probe...
> 
>         __uprobe_register+1
>         probe_event_enable+399
>         perf_trace_event_init+440
>         perf_uprobe_init+152
>         perf_uprobe_event_init+74
>         perf_try_init_event+71
>         perf_event_alloc+1681
>         __do_sys_perf_event_open+447
>         do_syscall_64+130
>         entry_SYSCALL_64_after_hwframe+118
> 
>         __uprobe_register+1
>         probe_event_enable+399
>         perf_trace_event_init+440
>         perf_uprobe_init+152
>         perf_uprobe_event_init+74
>         perf_try_init_event+71
>         perf_event_alloc+1681
>         __do_sys_perf_event_open+447
>         do_syscall_64+130
>         entry_SYSCALL_64_after_hwframe+118
> 
> so it seems that bpftrace doesn't use bpf_uprobe_multi_link_attach()
> (called by sys_bpf(BPF_LINK_CREATE) ?) in this case.
> 
> But again, this is offtopic, please forget.
> 
> Oleg.
>

Andrii Nakryiko Aug. 26, 2024, 10:08 p.m. UTC | #17

On Mon, Aug 26, 2024 at 3:01 PM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Aug 26, 2024 at 11:25:53PM +0200, Oleg Nesterov wrote:
> > This is offtopic, sorry for the spam, but...
> >
> > On 08/26, Jiri Olsa wrote:
> > >
> > > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> > > >
> > > > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not...
> > > > Then which userspace tool uses this code? ;)
> > >
> > > yes, it will trigger if you attach to multiple uprobes, like with your
> > > test example with:
> > >
> > >   # bpftrace -p xxx -e 'uprobe:./ex:func* { printf("%d\n", pid); }'
> >
> > Hmm. I reserved the testing machine with fedora 40 to play with bpftrace.
> >
> > dummy.c:
> >
> >       #include <unistd.h>
> >
> >       void func1(void) {}
> >       void func2(void) {}
> >
> >       int main(void) { for (;;) pause(); }
> >
> > If I do
> >
> >       # ./dummy &
> >       # bpftrace -p $! -e 'uprobe:./dummy:func* { printf("%d\n", pid); }'
> >
> > and run
> >
> >       # bpftrace -e 'kprobe:__uprobe_register { printf("%s\n", kstack); }'
>
> did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting:
>
> [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }'
> Attaching 1 probe...
>
>         uprobe_register+1
>         bpf_uprobe_multi_link_attach+685
>         __sys_bpf+9395
>         __x64_sys_bpf+26
>         do_syscall_64+128
>         entry_SYSCALL_64_after_hwframe+118
>
>
> I'm not sure what's bpftrace version in fedora 40, I'm using upstream build:
>
> [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi
>   uprobe_multi: yes
> [root@qemu ex]# ../bpftrace/build/src/bpftrace --version
> bpftrace v0.20.0
>
>

So what's the conclusion for the original issue? Do we have a path
forward with the fix?

> jirka
>
> >
> > on another console I get
> >
> >       Attaching 1 probe...
> >
> >         __uprobe_register+1
> >         probe_event_enable+399
> >         perf_trace_event_init+440
> >         perf_uprobe_init+152
> >         perf_uprobe_event_init+74
> >         perf_try_init_event+71
> >         perf_event_alloc+1681
> >         __do_sys_perf_event_open+447
> >         do_syscall_64+130
> >         entry_SYSCALL_64_after_hwframe+118
> >
> >         __uprobe_register+1
> >         probe_event_enable+399
> >         perf_trace_event_init+440
> >         perf_uprobe_init+152
> >         perf_uprobe_event_init+74
> >         perf_try_init_event+71
> >         perf_event_alloc+1681
> >         __do_sys_perf_event_open+447
> >         do_syscall_64+130
> >         entry_SYSCALL_64_after_hwframe+118
> >
> > so it seems that bpftrace doesn't use bpf_uprobe_multi_link_attach()
> > (called by sys_bpf(BPF_LINK_CREATE) ?) in this case.
> >
> > But again, this is offtopic, please forget.
> >
> > Oleg.
> >

Oleg Nesterov Aug. 26, 2024, 10:29 p.m. UTC | #18

On 08/27, Jiri Olsa wrote:
>
> did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting:
>
> [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }'
> Attaching 1 probe...
>
>         uprobe_register+1

so I guess you are on tip/perf/core which killed uprobe_register_refctr()
and changed bpf_uprobe_multi_link_attach() to use uprobe_register

>         bpf_uprobe_multi_link_attach+685
>         __sys_bpf+9395
>         __x64_sys_bpf+26
>         do_syscall_64+128
>         entry_SYSCALL_64_after_hwframe+118
>
>
> I'm not sure what's bpftrace version in fedora 40, I'm using upstream build:

bpftrace v0.20.1

> [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi
>   uprobe_multi: yes

Aha, I get

	uprobe_multi: no

OK. So, on your setup bpftrace uses bpf_uprobe_multi_link_attach()
and this implies ->ret_handler = uprobe_multi_link_ret_handler()
which calls uprobe_prog_run() which does

	if (link->task && current->mm != link->task->mm)
		return 0;

So, can you reproduce the problem reported by Tianyi on your setup?

Oleg.

Tianyi Liu Aug. 27, 2024, 6:27 a.m. UTC | #19

On 08/27 05:26, Oleg Nesterov wrote:
>    __uprobe_register+1
>    probe_event_enable+399
>    perf_trace_event_init+440
>    perf_uprobe_init+152
>    perf_uprobe_event_init+74
>    perf_try_init_event+71
>    perf_event_alloc+1681
>    __do_sys_perf_event_open+447
>    do_syscall_64+130
>    entry_SYSCALL_64_after_hwframe+118
> 
> so it seems that bpftrace doesn't use bpf_uprobe_multi_link_attach()
> (called by sys_bpf(BPF_LINK_CREATE) ?) in this case.

I'm using bpftrace v0.21.2 with "uprobe_multi: no", and I got the same
stack as yours.

When I use strace to trace bpftrace, I get:
```
$ strace bpftrace -p 38962 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' 2>&1 | grep perf_event_open
perf_event_open({type=0x7 /* PERF_TYPE_??? */, size=0x88 /* PERF_ATTR_SIZE_??? */,config=0x1, sample_period=1, 
sample_type=0, read_format=0, precise_ip=0 /* arbitrary skid */, ...}, 38962, -1, -1, PERF_FLAG_FD_CLOEXEC) = 12

$ cat /sys/bus/event_source/devices/uprobe/type 
7
```

Here bpftrace is using "uprobe" as the event type, which is registered in
```
static struct pmu perf_uprobe = {
	.task_ctx_nr	= perf_sw_context,
	.event_init	= perf_uprobe_event_init,
	.add		= perf_trace_add,
	.del		= perf_trace_del,
	.start		= perf_swevent_start,
	.stop		= perf_swevent_stop,
	.read		= perf_swevent_read,
	.attr_groups	= uprobe_attr_groups,
};
perf_pmu_register(&perf_uprobe, "uprobe", -1);
```

While I use strace for perf, I get:
```
$ strace perf trace -e probe_libc:malloc__return -p 38962 2>&1 |grep perf_event_open
[...]
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1641, sample_period=1, 
sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW, 
read_format=PERF_FORMAT_ID, disabled=1, mmap=1, comm=1, task=1, precise_ip=0 /* arbitrary skid */, sample_id_all=1, 
exclude_guest=1, mmap2=1, comm_exec=1, ksymbol=1, bpf_event=1, ...}, 38962, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3
```

The PERF_TYPE_TRACEPOINT is registered in:
```
static struct pmu perf_tracepoint = {
	.task_ctx_nr	= perf_sw_context,
	.event_init	= perf_tp_event_init,
	.add		= perf_trace_add,
	.del		= perf_trace_del,
	.start		= perf_swevent_start,
	.stop		= perf_swevent_stop,
	.read		= perf_swevent_read,
};
perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT);
```

So there's a slight difference between the event types being used. I 
haven't found the connection between this and the filtering, just WFI.

By the way, I used bpftrace just for convenience in reproducing the issue.
This bug was originally discovered while using cilium/ebpf (a golang ebpf
library), which also uses "uprobe" as the event type [1].

[1] https://github.com/cilium/ebpf/blob/main/link/kprobe.go#L251

Jiri Olsa Aug. 27, 2024, 10:08 a.m. UTC | #20

On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> On 08/26, Jiri Olsa wrote:
> >
> > On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote:
> > > 	$ ./test &
> > > 	$ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }'
> > >
> > > I hope that the syntax of the 2nd command is correct...
> > >
> > > I _think_ that it will print 2 pids too.
> >
> > yes.. but with CLONE_VM both processes share 'mm'
> 
> Yes sure,
> 
> > so they are threads,
> 
> Well this depends on definition ;) but the CLONE_VM child is not a sub-thread,
> it has another TGID. See below.
> 
> > and at least uprobe_multi filters by process [1] now.. ;-)
> 
> OK, if you say that this behaviour is fine I won't argue, I simply do not know.
> But see below.
> 
> > > But "perf-record -p" works as expected.
> >
> > I wonder it's because there's the perf layer that schedules each
> > uprobe event only when its process (PID1/2) is scheduled in and will
> > receive events only from that cpu while the process is running on it
> 
> Not sure I understand... The task which hits the breakpoint is always
> current, it is always scheduled in.
> 
> The main purpose of uprobe_perf_func()->uprobe_perf_filter() is NOT that
> we want to avoid __uprobe_perf_func() although this makes sense.
> 
> The main purpose is that we want to remove the breakpoints in current->mm
> when uprobe_perf_filter() returns false, that is why UPROBE_HANDLER_REMOVE.
> IOW, the main purpose is not penalise user-space unnecessarily.
> 
> IIUC (but I am not sure), perf-record -p will work "correctly" even if we
> remove uprobe_perf_filter() altogether. IIRC the perf layer does its own
> filtering but I forgot everything.
> 
> And this makes me think that perhaps BPF can't rely on uprobe_perf_filter()
> either, even we forget about ret-probes.
> 
> > [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic
> 
> Looks obviously wrong... get_pid_task(PIDTYPE_TGID) can return a zombie
> leader with ->mm == NULL while other threads and thus the whole process
> is still alive.
> 
> And again, the changelog says "the intent for PID filtering it to filter by
> *process*", but clone(CLONE_VM) creates another process, not a thread.
> 
> So perhaps we need
> 
> 	-	if (link->task && current->mm != link->task->mm)
> 	+	if (link->task && !same_thread_group(current, link->task))
> 
> in uprobe_prog_run() to make "filter by *process*" true, but this won't
> fix the problem with link->task->mm == NULL in uprobe_multi_link_filter().

would the same_thread_group(current, link->task) work in such case?
(zombie leader with other alive threads)

jirka

> 
> 
> Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not...
> Then which userspace tool uses this code? ;)
> 
> Oleg.
>

Jiri Olsa Aug. 27, 2024, 10:20 a.m. UTC | #21

On Tue, Aug 27, 2024 at 12:08:39PM +0200, Jiri Olsa wrote:
> On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> > On 08/26, Jiri Olsa wrote:
> > >
> > > On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote:
> > > > 	$ ./test &
> > > > 	$ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }'
> > > >
> > > > I hope that the syntax of the 2nd command is correct...
> > > >
> > > > I _think_ that it will print 2 pids too.
> > >
> > > yes.. but with CLONE_VM both processes share 'mm'
> > 
> > Yes sure,
> > 
> > > so they are threads,
> > 
> > Well this depends on definition ;) but the CLONE_VM child is not a sub-thread,
> > it has another TGID. See below.
> > 
> > > and at least uprobe_multi filters by process [1] now.. ;-)
> > 
> > OK, if you say that this behaviour is fine I won't argue, I simply do not know.
> > But see below.
> > 
> > > > But "perf-record -p" works as expected.
> > >
> > > I wonder it's because there's the perf layer that schedules each
> > > uprobe event only when its process (PID1/2) is scheduled in and will
> > > receive events only from that cpu while the process is running on it
> > 
> > Not sure I understand... The task which hits the breakpoint is always
> > current, it is always scheduled in.
> > 
> > The main purpose of uprobe_perf_func()->uprobe_perf_filter() is NOT that
> > we want to avoid __uprobe_perf_func() although this makes sense.
> > 
> > The main purpose is that we want to remove the breakpoints in current->mm
> > when uprobe_perf_filter() returns false, that is why UPROBE_HANDLER_REMOVE.
> > IOW, the main purpose is not penalise user-space unnecessarily.
> > 
> > IIUC (but I am not sure), perf-record -p will work "correctly" even if we
> > remove uprobe_perf_filter() altogether. IIRC the perf layer does its own
> > filtering but I forgot everything.
> > 
> > And this makes me think that perhaps BPF can't rely on uprobe_perf_filter()
> > either, even we forget about ret-probes.
> > 
> > > [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic
> > 
> > Looks obviously wrong... get_pid_task(PIDTYPE_TGID) can return a zombie
> > leader with ->mm == NULL while other threads and thus the whole process
> > is still alive.
> > 
> > And again, the changelog says "the intent for PID filtering it to filter by
> > *process*", but clone(CLONE_VM) creates another process, not a thread.
> > 
> > So perhaps we need
> > 
> > 	-	if (link->task && current->mm != link->task->mm)
> > 	+	if (link->task && !same_thread_group(current, link->task))
> > 
> > in uprobe_prog_run() to make "filter by *process*" true, but this won't
> > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter().
> 
> would the same_thread_group(current, link->task) work in such case?
> (zombie leader with other alive threads)

should uprobe_perf_filter use same_thread_group as well instead
of mm pointers check?

jirka

Oleg Nesterov Aug. 27, 2024, 10:40 a.m. UTC | #22

On 08/27, Jiri Olsa wrote:
>
> On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> >
> > So perhaps we need
> >
> > 	-	if (link->task && current->mm != link->task->mm)
> > 	+	if (link->task && !same_thread_group(current, link->task))
> >
> > in uprobe_prog_run() to make "filter by *process*" true, but this won't
> > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter().
>
> would the same_thread_group(current, link->task) work in such case?
> (zombie leader with other alive threads)

Why not? task_struct->signal is stable, it can't be changed.

But again, uprobe_multi_link_filter() won't work if the leader,
uprobe->link->task, exits or it has already exited.

Perhaps something like the additional change below...

Oleg.

--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3322,13 +3322,28 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe,
 	return err;
 }
 
+
 static bool
 uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm)
 {
 	struct bpf_uprobe *uprobe;
+	struct task_struct *task, *t;
+	bool ret = false;
 
 	uprobe = container_of(con, struct bpf_uprobe, consumer);
-	return uprobe->link->task->mm == mm;
+	task = uprobe->link->task;
+
+	rcu_read_lock();
+	for_each_thread(task, t) {
+		struct mm_struct *mm = READ_ONCE(t->mm);
+		if (mm) {
+			ret = t->mm == mm;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static int

Oleg Nesterov Aug. 27, 2024, 10:54 a.m. UTC | #23

On 08/27, Jiri Olsa wrote:
>
> On Tue, Aug 27, 2024 at 12:08:39PM +0200, Jiri Olsa wrote:
> > >
> > > 	-	if (link->task && current->mm != link->task->mm)
> > > 	+	if (link->task && !same_thread_group(current, link->task))
> > >
> > > in uprobe_prog_run() to make "filter by *process*" true, but this won't
> > > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter().
> >
> > would the same_thread_group(current, link->task) work in such case?
> > (zombie leader with other alive threads)
>
> should uprobe_perf_filter use same_thread_group as well instead
> of mm pointers check?

uprobe_perf_filter or uprobe_multi_link_filter ?

In any case I don't think same_thread_group(current, whatever) can work.

For example, uc->filter() can be called from uprobe_register() paths. In
this case "current" is the unrelated task which does, say, perf-record, etc.
Even if uc->filter() was only called from handler_chain(), it couldn't work,
think of UPROBE_HANDLER_REMOVE.

See also another email I sent a minute ago.

Oleg.

Jiri Olsa Aug. 27, 2024, 1:07 p.m. UTC | #24

On Tue, Aug 27, 2024 at 12:29:38AM +0200, Oleg Nesterov wrote:
> On 08/27, Jiri Olsa wrote:
> >
> > did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting:
> >
> > [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }'
> > Attaching 1 probe...
> >
> >         uprobe_register+1
> 
> so I guess you are on tip/perf/core which killed uprobe_register_refctr()
> and changed bpf_uprobe_multi_link_attach() to use uprobe_register
> 
> >         bpf_uprobe_multi_link_attach+685
> >         __sys_bpf+9395
> >         __x64_sys_bpf+26
> >         do_syscall_64+128
> >         entry_SYSCALL_64_after_hwframe+118
> >
> >
> > I'm not sure what's bpftrace version in fedora 40, I'm using upstream build:
> 
> bpftrace v0.20.1
> 
> > [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi
> >   uprobe_multi: yes
> 
> Aha, I get
> 
> 	uprobe_multi: no
> 
> OK. So, on your setup bpftrace uses bpf_uprobe_multi_link_attach()
> and this implies ->ret_handler = uprobe_multi_link_ret_handler()
> which calls uprobe_prog_run() which does
> 
> 	if (link->task && current->mm != link->task->mm)
> 		return 0;
> 
> So, can you reproduce the problem reported by Tianyi on your setup?

yes, I can repduce the issue with uretprobe on top of perf event uprobe

running 2 tasks of the test code:

	int func() {
		return 0;
	}

	int main() {
	    printf("pid: %d\n", getpid());
	    while (1) {
		sleep(2);
		func();
	    }
	}

and running 2 instances of bpftrace (each with separate pid):

	[root@qemu ex]# ../bpftrace/build/src/bpftrace -p 1018 -e 'uretprobe:./test:func { printf("%d\n", pid); }'
	Attaching 1 probe...
	1018
	1017
	1018
	1017

	[root@qemu ex]# ../bpftrace/build/src/bpftrace -p 1017 -e 'uretprobe:./test:func { printf("%d\n", pid); }'
	Attaching 1 probe...
	1017
	1018
	1017
	1018

will execute bpf program twice for each bpftrace instance, like:

          sched-in 1018 
            perf_trace_add

   ->     uprobe-hit
            handle_swbp
              handler_chain
              {
                for_each_uprobe_consumer {

                  // consumer for task 1019
                  uprobe_dispatcher
                    uprobe_perf_func
                      uprobe_perf_filter return false

                  // consumer for task 1018
                  uprobe_dispatcher
                    uprobe_perf_func
                      uprobe_perf_filter return true
                       -> could run bpf program, but none is configured
                }

                prepare_uretprobe
              }

   ->     uretprobe-hit
            handle_swbp
              uprobe_handle_trampoline
                handle_uretprobe_chain
                {

                  for_each_uprobe_consumer {
                    
                    // consumer for task 1019
                    uretprobe_dispatcher
                      uretprobe_perf_func
                        -> runs bpf program

                    // consumer for task 1018
                    uretprobe_dispatcher
                      uretprobe_perf_func
                        -> runs bpf program

                  }
                }

          sched-out 1019
            perf_trace_del


and I think the same will happen for perf record in this case where instead of
running the program we will execute perf_tp_event

I think the uretprobe_dispatcher could call filter as suggested in the original
patch.. but I'm not sure we need to remove the uprobe from handle_uretprobe_chain
like we do in handler_chain.. maybe just to save the next uprobe hit which would
remove the uprobe?

jirka

Jiri Olsa Aug. 27, 2024, 1:32 p.m. UTC | #25

On Tue, Aug 27, 2024 at 12:40:52PM +0200, Oleg Nesterov wrote:
> On 08/27, Jiri Olsa wrote:
> >
> > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote:
> > >
> > > So perhaps we need
> > >
> > > 	-	if (link->task && current->mm != link->task->mm)
> > > 	+	if (link->task && !same_thread_group(current, link->task))
> > >
> > > in uprobe_prog_run() to make "filter by *process*" true, but this won't
> > > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter().
> >
> > would the same_thread_group(current, link->task) work in such case?
> > (zombie leader with other alive threads)
> 
> Why not? task_struct->signal is stable, it can't be changed.
> 
> But again, uprobe_multi_link_filter() won't work if the leader,
> uprobe->link->task, exits or it has already exited.
> 
> Perhaps something like the additional change below...
> 
> Oleg.
> 
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -3322,13 +3322,28 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe,
>  	return err;
>  }
>  
> +
>  static bool
>  uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm)
>  {
>  	struct bpf_uprobe *uprobe;
> +	struct task_struct *task, *t;
> +	bool ret = false;
>  
>  	uprobe = container_of(con, struct bpf_uprobe, consumer);
> -	return uprobe->link->task->mm == mm;
> +	task = uprobe->link->task;
> +
> +	rcu_read_lock();
> +	for_each_thread(task, t) {
> +		struct mm_struct *mm = READ_ONCE(t->mm);
> +		if (mm) {
> +			ret = t->mm == mm;
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();

that seems expensive if there's many threads

could we check the leader first and only if it's gone fallback to this?

thanks,
jirka

Jiri Olsa Aug. 27, 2024, 1:45 p.m. UTC | #26

On Tue, Aug 27, 2024 at 03:07:01PM +0200, Jiri Olsa wrote:
> On Tue, Aug 27, 2024 at 12:29:38AM +0200, Oleg Nesterov wrote:
> > On 08/27, Jiri Olsa wrote:
> > >
> > > did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting:
> > >
> > > [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }'
> > > Attaching 1 probe...
> > >
> > >         uprobe_register+1
> > 
> > so I guess you are on tip/perf/core which killed uprobe_register_refctr()
> > and changed bpf_uprobe_multi_link_attach() to use uprobe_register
> > 
> > >         bpf_uprobe_multi_link_attach+685
> > >         __sys_bpf+9395
> > >         __x64_sys_bpf+26
> > >         do_syscall_64+128
> > >         entry_SYSCALL_64_after_hwframe+118
> > >
> > >
> > > I'm not sure what's bpftrace version in fedora 40, I'm using upstream build:
> > 
> > bpftrace v0.20.1
> > 
> > > [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi
> > >   uprobe_multi: yes
> > 
> > Aha, I get
> > 
> > 	uprobe_multi: no
> > 
> > OK. So, on your setup bpftrace uses bpf_uprobe_multi_link_attach()
> > and this implies ->ret_handler = uprobe_multi_link_ret_handler()
> > which calls uprobe_prog_run() which does
> > 
> > 	if (link->task && current->mm != link->task->mm)
> > 		return 0;
> > 
> > So, can you reproduce the problem reported by Tianyi on your setup?
> 
> yes, I can repduce the issue with uretprobe on top of perf event uprobe
> 
> running 2 tasks of the test code:
> 
> 	int func() {
> 		return 0;
> 	}
> 
> 	int main() {
> 	    printf("pid: %d\n", getpid());
> 	    while (1) {
> 		sleep(2);
> 		func();
> 	    }
> 	}
> 
> and running 2 instances of bpftrace (each with separate pid):
> 
> 	[root@qemu ex]# ../bpftrace/build/src/bpftrace -p 1018 -e 'uretprobe:./test:func { printf("%d\n", pid); }'
> 	Attaching 1 probe...
> 	1018
> 	1017
> 	1018
> 	1017
> 
> 	[root@qemu ex]# ../bpftrace/build/src/bpftrace -p 1017 -e 'uretprobe:./test:func { printf("%d\n", pid); }'
> 	Attaching 1 probe...
> 	1017
> 	1018
> 	1017
> 	1018
> 
> will execute bpf program twice for each bpftrace instance, like:
> 
>           sched-in 1018 
>             perf_trace_add
> 
>    ->     uprobe-hit
>             handle_swbp
>               handler_chain
>               {
>                 for_each_uprobe_consumer {
> 
>                   // consumer for task 1019
>                   uprobe_dispatcher
>                     uprobe_perf_func
>                       uprobe_perf_filter return false
> 
>                   // consumer for task 1018
>                   uprobe_dispatcher
>                     uprobe_perf_func
>                       uprobe_perf_filter return true
>                        -> could run bpf program, but none is configured
>                 }
> 
>                 prepare_uretprobe
>               }
> 
>    ->     uretprobe-hit
>             handle_swbp
>               uprobe_handle_trampoline
>                 handle_uretprobe_chain
>                 {
> 
>                   for_each_uprobe_consumer {
>                     
>                     // consumer for task 1019
>                     uretprobe_dispatcher
>                       uretprobe_perf_func
>                         -> runs bpf program
> 
>                     // consumer for task 1018
>                     uretprobe_dispatcher
>                       uretprobe_perf_func
>                         -> runs bpf program
> 
>                   }
>                 }
> 
>           sched-out 1019

ugh... should be 'sched-out 1018'

jirka

>             perf_trace_del
> 
> 
> and I think the same will happen for perf record in this case where instead of
> running the program we will execute perf_tp_event
> 
> I think the uretprobe_dispatcher could call filter as suggested in the original
> patch.. but I'm not sure we need to remove the uprobe from handle_uretprobe_chain
> like we do in handler_chain.. maybe just to save the next uprobe hit which would
> remove the uprobe?
> 
> jirka

Oleg Nesterov Aug. 27, 2024, 2:26 p.m. UTC | #27

On 08/27, Jiri Olsa wrote:
>
> On Tue, Aug 27, 2024 at 12:40:52PM +0200, Oleg Nesterov wrote:
> >  static bool
> >  uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm)
> >  {
> >  	struct bpf_uprobe *uprobe;
> > +	struct task_struct *task, *t;
> > +	bool ret = false;
> >
> >  	uprobe = container_of(con, struct bpf_uprobe, consumer);
> > -	return uprobe->link->task->mm == mm;
> > +	task = uprobe->link->task;
> > +
> > +	rcu_read_lock();
> > +	for_each_thread(task, t) {
> > +		struct mm_struct *mm = READ_ONCE(t->mm);
> > +		if (mm) {
> > +			ret = t->mm == mm;
> > +			break;
> > +		}
> > +	}
> > +	rcu_read_unlock();
>
> that seems expensive if there's many threads

many threads with ->mm == NULL? In the likely case for_each_thread()
stops after the first t->mm check.

> could we check the leader first and only if it's gone fallback to this?

up to you..

Oleg.

Jiri Olsa Aug. 27, 2024, 2:41 p.m. UTC | #28

On Tue, Aug 27, 2024 at 04:26:08PM +0200, Oleg Nesterov wrote:
> On 08/27, Jiri Olsa wrote:
> >
> > On Tue, Aug 27, 2024 at 12:40:52PM +0200, Oleg Nesterov wrote:
> > >  static bool
> > >  uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm)
> > >  {
> > >  	struct bpf_uprobe *uprobe;
> > > +	struct task_struct *task, *t;
> > > +	bool ret = false;
> > >
> > >  	uprobe = container_of(con, struct bpf_uprobe, consumer);
> > > -	return uprobe->link->task->mm == mm;
> > > +	task = uprobe->link->task;
> > > +
> > > +	rcu_read_lock();
> > > +	for_each_thread(task, t) {
> > > +		struct mm_struct *mm = READ_ONCE(t->mm);
> > > +		if (mm) {
> > > +			ret = t->mm == mm;
> > > +			break;
> > > +		}
> > > +	}
> > > +	rcu_read_unlock();
> >
> > that seems expensive if there's many threads
> 
> many threads with ->mm == NULL? In the likely case for_each_thread()
> stops after the first t->mm check.

aah the mm will be the same.. right, nice

thanks,
jirka

Oleg Nesterov Aug. 27, 2024, 4:45 p.m. UTC | #29

On 08/27, Jiri Olsa wrote:
>
> On Tue, Aug 27, 2024 at 12:29:38AM +0200, Oleg Nesterov wrote:
> >
> > So, can you reproduce the problem reported by Tianyi on your setup?
>
> yes, I can repduce the issue with uretprobe on top of perf event uprobe

...

>    ->     uretprobe-hit
>             handle_swbp
>               uprobe_handle_trampoline
>                 handle_uretprobe_chain
>                 {
>
>                   for_each_uprobe_consumer {
>
>                     // consumer for task 1019
>                     uretprobe_dispatcher
>                       uretprobe_perf_func
>                         -> runs bpf program
>
>                     // consumer for task 1018
>                     uretprobe_dispatcher
>                       uretprobe_perf_func
>                         -> runs bpf program

Confused...

I naively thought that if bpftrace uses bpf_uprobe_multi_link_attach() then
it won't use perf/trace_uprobe, and uretprobe-hit will result in

	// current->pid == 1018

	for_each_uprobe_consumer {
		// consumer for task 1019
		uprobe_multi_link_ret_handler
		    uprobe_prog_run
		       -> current->mm != link->task->mm, return

		// consumer for task 1018
		uprobe_multi_link_ret_handler
		    uprobe_prog_run
		       -> current->mm == link->task->mm, run bpf
	}

> I think the uretprobe_dispatcher could call filter as suggested in the original
> patch..

OK, agreed.

> but I'm not sure we need to remove the uprobe from handle_uretprobe_chain
> like we do in handler_chain..

Me too. In any case this is another issue.

Oleg.

Oleg Nesterov Aug. 27, 2024, 8:19 p.m. UTC | #30

Sorry for another reply, I just noticed I missed one part of your email...

On 08/27, Jiri Olsa wrote:
>
>    ->     uretprobe-hit
>             handle_swbp
>               uprobe_handle_trampoline
>                 handle_uretprobe_chain
>                 {
>
>                   for_each_uprobe_consumer {
>
>                     // consumer for task 1019
>                     uretprobe_dispatcher
>                       uretprobe_perf_func
>                         -> runs bpf program
>
>                     // consumer for task 1018
>                     uretprobe_dispatcher
>                       uretprobe_perf_func
>                         -> runs bpf program
>
>                   }
>                 }
>
> and I think the same will happen for perf record in this case where instead of
> running the program we will execute perf_tp_event

Hmm. Really? In this case these 2 different consumers will have the different
trace_event_call's, so

	// consumer for task 1019
	uretprobe_dispatcher
	  uretprobe_perf_func
	    __uprobe_perf_func
	      perf_tp_event

won't store the event because this_cpu_ptr(call->perf_events) should be
hlist_empty() on this CPU, the perf_event for task 1019 wasn't scheduled in
on this CPU...

No?

Ok, looks like I'm totally confused ;)

Oleg.

Jiri Olsa Aug. 28, 2024, 11:40 a.m. UTC | #31

On Tue, Aug 27, 2024 at 06:45:45PM +0200, Oleg Nesterov wrote:
> On 08/27, Jiri Olsa wrote:
> >
> > On Tue, Aug 27, 2024 at 12:29:38AM +0200, Oleg Nesterov wrote:
> > >
> > > So, can you reproduce the problem reported by Tianyi on your setup?
> >
> > yes, I can repduce the issue with uretprobe on top of perf event uprobe
> 
> ...
> 
> >    ->     uretprobe-hit
> >             handle_swbp
> >               uprobe_handle_trampoline
> >                 handle_uretprobe_chain
> >                 {
> >
> >                   for_each_uprobe_consumer {
> >
> >                     // consumer for task 1019
> >                     uretprobe_dispatcher
> >                       uretprobe_perf_func
> >                         -> runs bpf program
> >
> >                     // consumer for task 1018
> >                     uretprobe_dispatcher
> >                       uretprobe_perf_func
> >                         -> runs bpf program
> 
> Confused...
> 
> I naively thought that if bpftrace uses bpf_uprobe_multi_link_attach() then
> it won't use perf/trace_uprobe, and uretprobe-hit will result in

right, this path is for the case when bpftrace attach to single uprobe,
but there are 2 instances of bpftrace 

jirka

> 
> 	// current->pid == 1018
> 
> 	for_each_uprobe_consumer {
> 		// consumer for task 1019
> 		uprobe_multi_link_ret_handler
> 		    uprobe_prog_run
> 		       -> current->mm != link->task->mm, return
> 
> 		// consumer for task 1018
> 		uprobe_multi_link_ret_handler
> 		    uprobe_prog_run
> 		       -> current->mm == link->task->mm, run bpf
> 	}
> 
> > I think the uretprobe_dispatcher could call filter as suggested in the original
> > patch..
> 
> OK, agreed.
> 
> > but I'm not sure we need to remove the uprobe from handle_uretprobe_chain
> > like we do in handler_chain..
> 
> Me too. In any case this is another issue.
> 
> Oleg.
>

Jiri Olsa Aug. 28, 2024, 11:46 a.m. UTC | #32

On Tue, Aug 27, 2024 at 10:19:26PM +0200, Oleg Nesterov wrote:
> Sorry for another reply, I just noticed I missed one part of your email...
> 
> On 08/27, Jiri Olsa wrote:
> >
> >    ->     uretprobe-hit
> >             handle_swbp
> >               uprobe_handle_trampoline
> >                 handle_uretprobe_chain
> >                 {
> >
> >                   for_each_uprobe_consumer {
> >
> >                     // consumer for task 1019
> >                     uretprobe_dispatcher
> >                       uretprobe_perf_func
> >                         -> runs bpf program
> >
> >                     // consumer for task 1018
> >                     uretprobe_dispatcher
> >                       uretprobe_perf_func
> >                         -> runs bpf program
> >
> >                   }
> >                 }
> >
> > and I think the same will happen for perf record in this case where instead of
> > running the program we will execute perf_tp_event
> 
> Hmm. Really? In this case these 2 different consumers will have the different
> trace_event_call's, so
> 
> 	// consumer for task 1019
> 	uretprobe_dispatcher
> 	  uretprobe_perf_func
> 	    __uprobe_perf_func
> 	      perf_tp_event
> 
> won't store the event because this_cpu_ptr(call->perf_events) should be
> hlist_empty() on this CPU, the perf_event for task 1019 wasn't scheduled in
> on this CPU...

I'll double check on that, but because there's no filter for uretprobe
I think it will be stored under 1018 event

> 
> No?
> 
> Ok, looks like I'm totally confused ;)

I'm working on bpf selftests for above (uprobe and uprobe_multi paths)
I plan to send them together with fixes we discussed earlier

I'm hoping this will make it more clear

jirka

Oleg Nesterov Aug. 29, 2024, 3:20 p.m. UTC | #33

On 08/28, Jiri Olsa wrote:
>
> On Tue, Aug 27, 2024 at 10:19:26PM +0200, Oleg Nesterov wrote:
> > Hmm. Really? In this case these 2 different consumers will have the different
> > trace_event_call's, so
> >
> > 	// consumer for task 1019
> > 	uretprobe_dispatcher
> > 	  uretprobe_perf_func
> > 	    __uprobe_perf_func
> > 	      perf_tp_event
> >
> > won't store the event because this_cpu_ptr(call->perf_events) should be
> > hlist_empty() on this CPU, the perf_event for task 1019 wasn't scheduled in
> > on this CPU...
>
> I'll double check on that,

Yes, please.

> but because there's no filter for uretprobe
> I think it will be stored under 1018 event
...
> I'm working on bpf selftests for above (uprobe and uprobe_multi paths)

Meanwhile, I decided to try to test this case too ;)

test.c:

	#include <unistd.h>

	int func(int i)
	{
		return i;
	}

	int main(void)
	{
		int i;
		for (i = 0;; ++i) {
			sleep(1);
			func(i);
		}
		return 0;
	}

run_probe.c:

	#include "./include/uapi/linux/perf_event.h"
	#define _GNU_SOURCE
	#include <sys/syscall.h>
	#include <sys/ioctl.h>
	#include <assert.h>
	#include <unistd.h>
	#include <stdlib.h>
	#include <stdio.h>

	// cat /sys/bus/event_source/devices/uprobe/type
	#define UPROBE_TYPE	9

	void run_probe(const char *file, unsigned offset, int pid)
	{
		struct perf_event_attr attr = {
			.type		= UPROBE_TYPE,
			.config		= 1, // ret-probe
			.uprobe_path	= (unsigned long)file,
			.probe_offset	= offset,
			.size		= sizeof(struct perf_event_attr),
		};

		int fd = syscall(__NR_perf_event_open, &attr, pid, 0, -1, 0);
		assert(fd >= 0);

		assert(ioctl(fd, PERF_EVENT_IOC_ENABLE, 0) == 0);

		for (;;)
			pause();
	}

	int main(int argc, const char *argv[])
	{
		int pid = atoi(argv[1]);
		run_probe("./test", 0x536, pid);
		return 0;
	}

Now, with the kernel patch below applied, I do

	$ ./test &
	$ PID1=$!
	$ ./test &
	$ PID2=$!

	$ ./run_probe $PID1 &
	$ ./run_probe $PID2 &

and the kernel prints:

	CHAIN
	trace_uprobe: HANDLER pid=46 consumers_target=46 stored=1
	trace_uprobe: HANDLER pid=46 consumers_target=45 stored=0
	CHAIN
	trace_uprobe: HANDLER pid=45 consumers_target=46 stored=0
	trace_uprobe: HANDLER pid=45 consumers_target=45 stored=1
	CHAIN
	trace_uprobe: HANDLER pid=46 consumers_target=46 stored=1
	trace_uprobe: HANDLER pid=46 consumers_target=45 stored=0
	CHAIN
	trace_uprobe: HANDLER pid=45 consumers_target=46 stored=0
	trace_uprobe: HANDLER pid=45 consumers_target=45 stored=1
	CHAIN
	trace_uprobe: HANDLER pid=46 consumers_target=46 stored=1
	trace_uprobe: HANDLER pid=46 consumers_target=45 stored=0
	CHAIN
	trace_uprobe: HANDLER pid=45 consumers_target=46 stored=0
	trace_uprobe: HANDLER pid=45 consumers_target=45 stored=1
	CHAIN
	trace_uprobe: HANDLER pid=46 consumers_target=46 stored=1
	trace_uprobe: HANDLER pid=46 consumers_target=45 stored=0
	CHAIN
	trace_uprobe: HANDLER pid=45 consumers_target=46 stored=0
	trace_uprobe: HANDLER pid=45 consumers_target=45 stored=1

and so on.

As you can see, perf_trace_buf_submit/etc is never called for the "unfiltered"
consumer, so I still think that perf is fine wrt filtering. But I can be easily
wrong, please check.

Oleg.


diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index acc73c1bc54c..14aa92a78d6d 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -2150,6 +2150,8 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs)
 	struct uprobe *uprobe = ri->uprobe;
 	struct uprobe_consumer *uc;
 
+	pr_crit("CHAIN\n");
+
 	rcu_read_lock_trace();
 	list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) {
 		if (uc->ret_handler)
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index f7443e996b1b..e4eaa0363742 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1364,7 +1364,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
 	return ret;
 }
 
-static void __uprobe_perf_func(struct trace_uprobe *tu,
+static int __uprobe_perf_func(struct trace_uprobe *tu,
 			       unsigned long func, struct pt_regs *regs,
 			       struct uprobe_cpu_buffer **ucbp)
 {
@@ -1375,6 +1375,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 	void *data;
 	int size, esize;
 	int rctx;
+	int ret = 0;
 
 #ifdef CONFIG_BPF_EVENTS
 	if (bpf_prog_array_valid(call)) {
@@ -1382,7 +1383,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 
 		ret = bpf_prog_run_array_uprobe(call->prog_array, regs, bpf_prog_run);
 		if (!ret)
-			return;
+			return -1;
 	}
 #endif /* CONFIG_BPF_EVENTS */
 
@@ -1392,12 +1393,13 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 	size = esize + ucb->dsize;
 	size = ALIGN(size + sizeof(u32), sizeof(u64)) - sizeof(u32);
 	if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough"))
-		return;
+		return -1;
 
 	preempt_disable();
 	head = this_cpu_ptr(call->perf_events);
 	if (hlist_empty(head))
 		goto out;
+	ret = 1;
 
 	entry = perf_trace_buf_alloc(size, NULL, &rctx);
 	if (!entry)
@@ -1421,6 +1423,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 			      head, NULL);
  out:
 	preempt_enable();
+	return ret;
 }
 
 /* uprobe profile handler */
@@ -1439,7 +1442,15 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func,
 				struct pt_regs *regs,
 				struct uprobe_cpu_buffer **ucbp)
 {
-	__uprobe_perf_func(tu, func, regs, ucbp);
+	struct trace_uprobe_filter *filter = tu->tp.event->filter;
+	struct perf_event *event = list_first_entry(&filter->perf_events,
+					struct perf_event, hw.tp_list);
+
+	int r = __uprobe_perf_func(tu, func, regs, ucbp);
+
+	pr_crit("HANDLER pid=%d consumers_target=%d stored=%d\n",
+		current->pid, event->hw.target ? event->hw.target->pid : -1, r);
+
 }
 
 int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type,

Jiri Olsa Aug. 29, 2024, 7:46 p.m. UTC | #34

On Thu, Aug 29, 2024 at 05:20:33PM +0200, Oleg Nesterov wrote:

SNIP

> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index f7443e996b1b..e4eaa0363742 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -1364,7 +1364,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, struct mm_struct *mm)
>  	return ret;
>  }
>  
> -static void __uprobe_perf_func(struct trace_uprobe *tu,
> +static int __uprobe_perf_func(struct trace_uprobe *tu,
>  			       unsigned long func, struct pt_regs *regs,
>  			       struct uprobe_cpu_buffer **ucbp)
>  {
> @@ -1375,6 +1375,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
>  	void *data;
>  	int size, esize;
>  	int rctx;
> +	int ret = 0;
>  
>  #ifdef CONFIG_BPF_EVENTS
>  	if (bpf_prog_array_valid(call)) {
> @@ -1382,7 +1383,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
>  
>  		ret = bpf_prog_run_array_uprobe(call->prog_array, regs, bpf_prog_run);
>  		if (!ret)
> -			return;
> +			return -1;
>  	}
>  #endif /* CONFIG_BPF_EVENTS */
>  
> @@ -1392,12 +1393,13 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
>  	size = esize + ucb->dsize;
>  	size = ALIGN(size + sizeof(u32), sizeof(u64)) - sizeof(u32);
>  	if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough"))
> -		return;
> +		return -1;
>  
>  	preempt_disable();
>  	head = this_cpu_ptr(call->perf_events);
>  	if (hlist_empty(head))
>  		goto out;

right.. if the event is not added by perf_trace_add on this cpu
it won't go pass this point, so no problem for perf

but the issue is with bpf program triggered earlier by return uprobe
created via perf event and [1] patch seems to fix that

I sent out the bpf selftest that triggers the issue [2]

thanks,
jirka


[1] https://lore.kernel.org/linux-trace-kernel/ME0P300MB0416034322B9915ECD3888649D882@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM/
[2] https://lore.kernel.org/bpf/20240829194505.402807-1-jolsa@kernel.org/T/#u


> +	ret = 1;
>  
>  	entry = perf_trace_buf_alloc(size, NULL, &rctx);
>  	if (!entry)
> @@ -1421,6 +1423,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
>  			      head, NULL);
>   out:
>  	preempt_enable();
> +	return ret;
>  }
>  
>  /* uprobe profile handler */
> @@ -1439,7 +1442,15 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func,
>  				struct pt_regs *regs,
>  				struct uprobe_cpu_buffer **ucbp)
>  {
> -	__uprobe_perf_func(tu, func, regs, ucbp);
> +	struct trace_uprobe_filter *filter = tu->tp.event->filter;
> +	struct perf_event *event = list_first_entry(&filter->perf_events,
> +					struct perf_event, hw.tp_list);
> +
> +	int r = __uprobe_perf_func(tu, func, regs, ucbp);
> +
> +	pr_crit("HANDLER pid=%d consumers_target=%d stored=%d\n",
> +		current->pid, event->hw.target ? event->hw.target->pid : -1, r);
> +
>  }
>  
>  int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type,
>

Oleg Nesterov Aug. 29, 2024, 9:12 p.m. UTC | #35

Ah. we certainly misunderstand each other.

On 08/29, Jiri Olsa wrote:
>
> On Thu, Aug 29, 2024 at 05:20:33PM +0200, Oleg Nesterov wrote:
>
> SNIP

SNIP

> right.. if the event is not added by perf_trace_add on this cpu
> it won't go pass this point, so no problem for perf

Yes, and this is what I tried to verify. In your previous email you said

	and I think the same will happen for perf record in this case where instead of
	running the program we will execute perf_tp_event

and I tried verify this can't happen. So no problem for perf ;)

> but the issue is with bpf program triggered earlier by return uprobe

Well, the issue with bpf program (with the bpf_prog_array_valid(call) code
in __uprobe_perf_func) was clear from the very beginning, no questions.

> and [1] patch seems to fix that

I'd say this patch fixes the symptoms, and it doesn't fix all the problems.
But I can't suggest anything better for bpf code, so I won't really argue.
However the changelog and even the subject is wrong.

> I sent out the bpf selftest that triggers the issue [2]

Thanks, I'll try take a look tomorrow.

Oleg.

Jiri Olsa Aug. 29, 2024, 11:22 p.m. UTC | #36

On Thu, Aug 29, 2024 at 11:12:41PM +0200, Oleg Nesterov wrote:
> Ah. we certainly misunderstand each other.
> 
> On 08/29, Jiri Olsa wrote:
> >
> > On Thu, Aug 29, 2024 at 05:20:33PM +0200, Oleg Nesterov wrote:
> >
> > SNIP
> 
> SNIP
> 
> > right.. if the event is not added by perf_trace_add on this cpu
> > it won't go pass this point, so no problem for perf
> 
> Yes, and this is what I tried to verify. In your previous email you said
> 
> 	and I think the same will happen for perf record in this case where instead of
> 	running the program we will execute perf_tp_event
> 
> and I tried verify this can't happen. So no problem for perf ;)

yea, I was wrong, you should be used to it by now ;-)

> 
> > but the issue is with bpf program triggered earlier by return uprobe
> 
> Well, the issue with bpf program (with the bpf_prog_array_valid(call) code
> in __uprobe_perf_func) was clear from the very beginning, no questions.
> 
> > and [1] patch seems to fix that
> 
> I'd say this patch fixes the symptoms, and it doesn't fix all the problems.
> But I can't suggest anything better for bpf code, so I won't really argue.
> However the changelog and even the subject is wrong.
> 
> > I sent out the bpf selftest that triggers the issue [2]
> 
> Thanks, I'll try take a look tomorrow.

thanks,
jirka

Oleg Nesterov Aug. 30, 2024, 10:12 a.m. UTC | #37

The whole discussion was very confusing (yes, I too contributed to the
confusion ;), let me try to summarise.

> U(ret)probes are designed to be filterable using the PID, which is the
> second parameter in the perf_event_open syscall. Currently, uprobe works
> well with the filtering, but uretprobe is not affected by it.

And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
misunderstands the purpose of uprobe_perf_filter().

Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
does the filtering by the PID, it doesn't. We can simply kill this function
and perf will work correctly. The perf layer in __uprobe_perf_func() does
the filtering when perf_event->hw.target != NULL.

So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
unapply_uprobe() in handler_chain().

Suppose you do, say,

	$ perf probe -x /path/to/libc some_hot_function
or
	$ perf probe -x /path/to/libc some_hot_function%return
then
	$perf record -e ... -p 1

to trace the usage of some_hot_function() in the init process. Everything
will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().

But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
So the child C will hit this breakpoint and cal handle_swbp/etc for no
reason every time it calls some_hot_function(), not good.

That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
call unapply_uprobe() which will remove this breakpoint from C->mm.

> We found that the filter function was not invoked when uretprobe was
> initially implemented, and this has been existing for ten years.

See above, this is correct.

Note also that if you only use perf-probe + perf-record, no matter how
many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into
uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe().

---------------------------------------------------------------------------
Now lets return to BPF and this particular problem. I won't really argue
with this patch, but

	- Please change the subject and update the changelog,
	  "Fixes: c1ae5c75e103" and the whole reasoning is misleading
	  and wrong, IMO.

	- This patch won't fix all problems because uprobe_perf_filter()
	  filters by mm, not by pid. The changelog/patch assumes that it
	  is a "PID filter", but it is not.

	  See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/
	  If the traced process does clone(CLONE_VM), bpftrace will hit the
	  similar problem, with uprobe or uretprobe.

	- So I still think that the "right" fix should change the
	  bpf_prog_run_array_uprobe() paths somehow, but I know nothing
	  about bpf.

Oleg.

Oleg Nesterov Aug. 30, 2024, 12:23 p.m. UTC | #38

On 08/30, Oleg Nesterov wrote:
>
> 	- This patch won't fix all problems because uprobe_perf_filter()
> 	  filters by mm, not by pid. The changelog/patch assumes that it
> 	  is a "PID filter", but it is not.
>
>	  See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/

Let me provide another example.

spawn.c:

	#include <spawn.h>
	#include <unistd.h>

	int main(int argc, char *argv[])
	{
		int pid;
		sleep(3); // wait for bpftrace to start
		posix_spawn(&pid, argv[1], NULL, NULL, argv+1, NULL);
		for (;;)
			pause();
	}

Now,

	$ ./spawn /usr/bin/true &
	bpftrace -p $! -e 'uprobe:/usr/lib64/libc.so.6:sigprocmask { printf("%d\n", pid); }'
	[2] 212851
	Attaching 1 probe...
	212855
	^C
	$ ./spawn /xxxxxxxxxxxx &
	bpftrace -p $! -e 'uprobe:/usr/lib64/libc.so.6:_exit { printf("%d\n", pid); }'
	[3] 212860
	Attaching 1 probe...
	212865
	^C

This patch can't help in this case.

Oleg.

Jiri Olsa Aug. 30, 2024, 1:34 p.m. UTC | #39

On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote:
> The whole discussion was very confusing (yes, I too contributed to the
> confusion ;), let me try to summarise.
> 
> > U(ret)probes are designed to be filterable using the PID, which is the
> > second parameter in the perf_event_open syscall. Currently, uprobe works
> > well with the filtering, but uretprobe is not affected by it.
> 
> And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
> misunderstands the purpose of uprobe_perf_filter().
> 
> Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
> does the filtering by the PID, it doesn't. We can simply kill this function
> and perf will work correctly. The perf layer in __uprobe_perf_func() does
> the filtering when perf_event->hw.target != NULL.
> 
> So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
> the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
> unapply_uprobe() in handler_chain().
> 
> Suppose you do, say,
> 
> 	$ perf probe -x /path/to/libc some_hot_function
> or
> 	$ perf probe -x /path/to/libc some_hot_function%return
> then
> 	$perf record -e ... -p 1
> 
> to trace the usage of some_hot_function() in the init process. Everything
> will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().
> 
> But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
> So the child C will hit this breakpoint and cal handle_swbp/etc for no
> reason every time it calls some_hot_function(), not good.
> 
> That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
> UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
> call unapply_uprobe() which will remove this breakpoint from C->mm.

thanks for the info, I wasn't aware this was the intention

uprobe_multi does not have perf event mechanism/check, so it's using
the filter function to do the process filtering.. which is not working
properly as you pointed out earlier

> 
> > We found that the filter function was not invoked when uretprobe was
> > initially implemented, and this has been existing for ten years.
> 
> See above, this is correct.
> 
> Note also that if you only use perf-probe + perf-record, no matter how
> many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into
> uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe().
> 
> ---------------------------------------------------------------------------
> Now lets return to BPF and this particular problem. I won't really argue
> with this patch, but
> 
> 	- Please change the subject and update the changelog,
> 	  "Fixes: c1ae5c75e103" and the whole reasoning is misleading
> 	  and wrong, IMO.
> 
> 	- This patch won't fix all problems because uprobe_perf_filter()
> 	  filters by mm, not by pid. The changelog/patch assumes that it
> 	  is a "PID filter", but it is not.
> 
> 	  See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/
> 	  If the traced process does clone(CLONE_VM), bpftrace will hit the
> 	  similar problem, with uprobe or uretprobe.
> 
> 	- So I still think that the "right" fix should change the
> 	  bpf_prog_run_array_uprobe() paths somehow, but I know nothing
> 	  about bpf.

to follow the perf event filter properly, bpf_prog_run_array_uprobe should
be called in perf_tp_event after the perf_tp_event_match call, but that's
already under preempt_disable, so that's why it's called before that

jirka

Andrii Nakryiko Aug. 30, 2024, 3:51 p.m. UTC | #40

On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote:
> > The whole discussion was very confusing (yes, I too contributed to the
> > confusion ;), let me try to summarise.
> >
> > > U(ret)probes are designed to be filterable using the PID, which is the
> > > second parameter in the perf_event_open syscall. Currently, uprobe works
> > > well with the filtering, but uretprobe is not affected by it.
> >
> > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
> > misunderstands the purpose of uprobe_perf_filter().
> >
> > Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
> > does the filtering by the PID, it doesn't. We can simply kill this function
> > and perf will work correctly. The perf layer in __uprobe_perf_func() does
> > the filtering when perf_event->hw.target != NULL.
> >
> > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
> > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
> > unapply_uprobe() in handler_chain().
> >
> > Suppose you do, say,
> >
> >       $ perf probe -x /path/to/libc some_hot_function
> > or
> >       $ perf probe -x /path/to/libc some_hot_function%return
> > then
> >       $perf record -e ... -p 1
> >
> > to trace the usage of some_hot_function() in the init process. Everything
> > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().
> >
> > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
> > So the child C will hit this breakpoint and cal handle_swbp/etc for no
> > reason every time it calls some_hot_function(), not good.
> >
> > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
> > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
> > call unapply_uprobe() which will remove this breakpoint from C->mm.
>
> thanks for the info, I wasn't aware this was the intention
>
> uprobe_multi does not have perf event mechanism/check, so it's using
> the filter function to do the process filtering.. which is not working
> properly as you pointed out earlier

So this part I don't completely get. I get that using task->mm
comparison is wrong due to CLONE_VM, but why same_thread_group() check
is wrong? I.e., why task->signal comparison is wrong?

Sorry, I tried to follow the thread, but there were a lot of emails
with a bunch of confusion, so I'm not sure I got all the details.

>
> >
> > > We found that the filter function was not invoked when uretprobe was
> > > initially implemented, and this has been existing for ten years.
> >
> > See above, this is correct.
> >
> > Note also that if you only use perf-probe + perf-record, no matter how
> > many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into
> > uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe().
> >
> > ---------------------------------------------------------------------------
> > Now lets return to BPF and this particular problem. I won't really argue
> > with this patch, but
> >
> >       - Please change the subject and update the changelog,
> >         "Fixes: c1ae5c75e103" and the whole reasoning is misleading
> >         and wrong, IMO.
> >
> >       - This patch won't fix all problems because uprobe_perf_filter()
> >         filters by mm, not by pid. The changelog/patch assumes that it
> >         is a "PID filter", but it is not.
> >
> >         See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/
> >         If the traced process does clone(CLONE_VM), bpftrace will hit the
> >         similar problem, with uprobe or uretprobe.
> >
> >       - So I still think that the "right" fix should change the
> >         bpf_prog_run_array_uprobe() paths somehow, but I know nothing
> >         about bpf.
>
> to follow the perf event filter properly, bpf_prog_run_array_uprobe should
> be called in perf_tp_event after the perf_tp_event_match call, but that's
> already under preempt_disable, so that's why it's called before that
>
> jirka

Tianyi Liu Sept. 1, 2024, 7:22 p.m. UTC | #41

On Fri, Aug 30, 2024 at 18:12:41PM +0800, Oleg Nesterov wrote:
> The whole discussion was very confusing (yes, I too contributed to the
> confusion ;), let me try to summarise.
> 
> > U(ret)probes are designed to be filterable using the PID, which is the
> > second parameter in the perf_event_open syscall. Currently, uprobe works
> > well with the filtering, but uretprobe is not affected by it.
> 
> And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
> misunderstands the purpose of uprobe_perf_filter().
> 
> Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
> does the filtering by the PID, it doesn't. We can simply kill this function
> and perf will work correctly. The perf layer in __uprobe_perf_func() does
> the filtering when perf_event->hw.target != NULL.
> 
> So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
> the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
> unapply_uprobe() in handler_chain().
> 
> Suppose you do, say,
> 
> 	$ perf probe -x /path/to/libc some_hot_function
> or
> 	$ perf probe -x /path/to/libc some_hot_function%return
> then
> 	$perf record -e ... -p 1
> 
> to trace the usage of some_hot_function() in the init process. Everything
> will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().
> 
> But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
> So the child C will hit this breakpoint and cal handle_swbp/etc for no
> reason every time it calls some_hot_function(), not good.
> 
> That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
> UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
> call unapply_uprobe() which will remove this breakpoint from C->mm.
> 
> > We found that the filter function was not invoked when uretprobe was
> > initially implemented, and this has been existing for ten years.
> 
> See above, this is correct.
> 
> Note also that if you only use perf-probe + perf-record, no matter how
> many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into
> uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe().
> 

Thanks for the detailed explanation above, I can understand the code now. 
Yes, I completely misunderstood the purpose of uprobe_perf_filter, 
sorry for the confusion.

> ---------------------------------------------------------------------------
> Now lets return to BPF and this particular problem. I won't really argue
> with this patch, but
> 
> 	- Please change the subject and update the changelog,
> 	  "Fixes: c1ae5c75e103" and the whole reasoning is misleading
> 	  and wrong, IMO.
> 
> 	- This patch won't fix all problems because uprobe_perf_filter()
> 	  filters by mm, not by pid. The changelog/patch assumes that it
> 	  is a "PID filter", but it is not.
> 
> 	  See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/
> 	  If the traced process does clone(CLONE_VM), bpftrace will hit the
> 	  similar problem, with uprobe or uretprobe.
> 
> 	- So I still think that the "right" fix should change the
> 	  bpf_prog_run_array_uprobe() paths somehow, but I know nothing
> 	  about bpf.

I agree that this patch does not address the issue correctly. 
The PID filter should be implemented within bpf_prog_run_array_uprobe, 
or alternatively, bpf_prog_run_array_uprobe should be called after 
perf_tp_event_match to reuse the filtering mechanism provided by perf.

Also, uretprobe may need UPROBE_HANDLER_REMOVE, similar to uprobe.

For now, please forget the original patch as we need a new solution ;)


Thanks for your time,

Oleg Nesterov Sept. 1, 2024, 11:26 p.m. UTC | #42

On 09/02, Tianyi Liu wrote:
>
> On Fri, Aug 30, 2024 at 18:12:41PM +0800, Oleg Nesterov wrote:
> >
> > 	- So I still think that the "right" fix should change the
> > 	  bpf_prog_run_array_uprobe() paths somehow, but I know nothing
> > 	  about bpf.
>
> I agree that this patch does not address the issue correctly.
> The PID filter should be implemented within bpf_prog_run_array_uprobe,

OK,

> or alternatively, bpf_prog_run_array_uprobe should be called after
> perf_tp_event_match to reuse the filtering mechanism provided by perf.

No, no, perf_tp_event_match() has nothing to do with pid/mm filtering in
this case, afaics.

See https://lore.kernel.org/all/20240829152032.GA23996@redhat.com/
and perf_trace_add() which adds the event to this_cpu_ptr(call->perf_events),
see also the hlist_empty(head) check in __uprobe_perf_func().

Again, again, I can be easily wrong, I forgot everything I knew (not too much)
about perf, but at least in the case above perf_tp_event_match() is not even
called.

> Also, uretprobe may need UPROBE_HANDLER_REMOVE, similar to uprobe.

May be... Not sure. But why do you think we want it?

And... I think that BPF has even more problems with filtering. Not sure,
I'll try to write another test-case tomorrow.

Oleg.

Jiri Olsa Sept. 2, 2024, 9:11 a.m. UTC | #43

On Fri, Aug 30, 2024 at 08:51:12AM -0700, Andrii Nakryiko wrote:
> On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote:
> > > The whole discussion was very confusing (yes, I too contributed to the
> > > confusion ;), let me try to summarise.
> > >
> > > > U(ret)probes are designed to be filterable using the PID, which is the
> > > > second parameter in the perf_event_open syscall. Currently, uprobe works
> > > > well with the filtering, but uretprobe is not affected by it.
> > >
> > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
> > > misunderstands the purpose of uprobe_perf_filter().
> > >
> > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
> > > does the filtering by the PID, it doesn't. We can simply kill this function
> > > and perf will work correctly. The perf layer in __uprobe_perf_func() does
> > > the filtering when perf_event->hw.target != NULL.
> > >
> > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
> > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
> > > unapply_uprobe() in handler_chain().
> > >
> > > Suppose you do, say,
> > >
> > >       $ perf probe -x /path/to/libc some_hot_function
> > > or
> > >       $ perf probe -x /path/to/libc some_hot_function%return
> > > then
> > >       $perf record -e ... -p 1
> > >
> > > to trace the usage of some_hot_function() in the init process. Everything
> > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().
> > >
> > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
> > > So the child C will hit this breakpoint and cal handle_swbp/etc for no
> > > reason every time it calls some_hot_function(), not good.
> > >
> > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
> > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
> > > call unapply_uprobe() which will remove this breakpoint from C->mm.
> >
> > thanks for the info, I wasn't aware this was the intention
> >
> > uprobe_multi does not have perf event mechanism/check, so it's using
> > the filter function to do the process filtering.. which is not working
> > properly as you pointed out earlier
> 
> So this part I don't completely get. I get that using task->mm
> comparison is wrong due to CLONE_VM, but why same_thread_group() check
> is wrong? I.e., why task->signal comparison is wrong?

the way I understand it is that we take the group leader task and
store it in bpf_uprobe_multi_link::task

but it can exit while the rest of the threads is still running so
the uprobe_multi_link_filter won't match them (leader->mm is NULL)

Oleg suggested change below (in addition to same_thread_group change)
to take that in account

jirka


---
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 98e395f1baae..9e6b390aa6da 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx
 			 struct mm_struct *mm)
 {
 	struct bpf_uprobe *uprobe;
+	struct task_struct *task, *t;
+	bool ret = false;
 
 	uprobe = container_of(con, struct bpf_uprobe, consumer);
-	return uprobe->link->task->mm == mm;
+	task = uprobe->link->task;
+
+	rcu_read_lock();
+	for_each_thread(task, t) {
+		struct mm_struct *mm = READ_ONCE(t->mm);
+		if (mm) {
+			ret = t->mm == mm;
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return ret;
 }
 
 static int

Oleg Nesterov Sept. 2, 2024, 5:17 p.m. UTC | #44

On 09/02, Oleg Nesterov wrote:
>
> And... I think that BPF has even more problems with filtering. Not sure,
> I'll try to write another test-case tomorrow.

See below. This test-case needs a one-liner patch at the end, but this is only
because I have no idea how to add BPF_EMIT_CALL(BPF_FUNC_trace_printk) into
"struct bpf_insn insns[]" correctly. Is there a simple-to-use user-space tool
which can translate 'bpf_trace_printk("Hello world\n", 13)' into bpf_insn[] ???

So. The CONFIG_BPF_EVENTS code in __uprobe_perf_func() assumes that it "owns"
tu->consumer and uprobe_perf_filter(), but this is not true in general.

test.c:
	#include <unistd.h>

	int func(int i)
	{
		return i;
	}

	int main(void)
	{
		int i;
		for (i = 0;; ++i) {
			sleep(1);
			func(i);
		}
		return 0;
	}

run_prog.c
	// cc -I./tools/include -I./tools/include/uapi -Wall
	#include "./include/generated/uapi/linux/version.h"
	#include <linux/perf_event.h>
	#include <linux/filter.h>
	#define _GNU_SOURCE
	#include <sys/syscall.h>
	#include <sys/ioctl.h>
	#include <assert.h>
	#include <unistd.h>
	#include <stdlib.h>

	int prog_load(void)
	{
		struct bpf_insn insns[] = {
			BPF_MOV64_IMM(BPF_REG_0, 0),
			BPF_EXIT_INSN(),
		};

		union bpf_attr attr = {
			.prog_type	= BPF_PROG_TYPE_KPROBE,
			.insns		= (unsigned long)insns,
			.insn_cnt	= sizeof(insns) / sizeof(insns[0]),
			.license	= (unsigned long)"GPL",
			.kern_version	= LINUX_VERSION_CODE, // unneeded
		};

		return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
	}

	void run_probe(int eid, int pid)
	{
		struct perf_event_attr attr = {
			.type	= PERF_TYPE_TRACEPOINT,
			.config	= eid,
			.size	= sizeof(struct perf_event_attr),
		};

		int fd = syscall(__NR_perf_event_open, &attr, pid, 0, -1, 0);
		assert(fd >= 0);

		int pfd = prog_load();
		assert(pfd >= 0);

		assert(ioctl(fd, PERF_EVENT_IOC_SET_BPF, pfd) == 0);
		assert(ioctl(fd, PERF_EVENT_IOC_ENABLE, 0) == 0);

		for (;;)
			pause();
	}

	int main(int argc, const char *argv[])
	{
		int eid = atoi(argv[1]);
		int pid = atoi(argv[2]);
		run_probe(eid, pid);
		return 0;
	}

Now,

	$ ./test &
	$ PID1=$!
	$ ./test &
	$ PID2=$!
	$ perf probe -x ./test -a func
	$ ./run_prog `cat /sys/kernel/debug/tracing/events/probe_test/func/id` $PID1 &

dmesg -c:
	trace_uprobe: BPF_FUNC: pid=50 ret=0
	trace_uprobe: BPF_FUNC: pid=50 ret=0
	trace_uprobe: BPF_FUNC: pid=50 ret=0
	trace_uprobe: BPF_FUNC: pid=50 ret=0
	...

So far so good. Now,

	$ perf record -e probe_test:func -p $PID2 -- sleep 10 &

This creates another PERF_TYPE_TRACEPOINT perf_event which shares
trace_uprobe/consumer/filter with the perf_event created by run_prog.

dmesg -c:
	trace_uprobe: BPF_FUNC: pid=51 ret=0
	trace_uprobe: BPF_FUNC: pid=50 ret=0
	trace_uprobe: BPF_FUNC: pid=51 ret=0
	trace_uprobe: BPF_FUNC: pid=50 ret=0
	...

until perf-record exits. and after that

	$ perf script

reports nothing.

So, in this case:

	- run_prog's bpf program is called when current->pid == $PID2, this patch
	  (or any other change in trace_uprobe.c) can't help.

	- run_prog's bpf program "steals" __uprobe_perf_func() from /usr/bin/perf

and to me this is yet another indication that we need some changes in the
bpf_prog_run_array_uprobe() paths, or even in the user-space bpftrace/whatever's
code.

And. Why the "if (bpf_prog_array_valid(call))" block in __uprobe_perf_func() returns?
Why this depends on "ret == 0" ??? I fail to understand this logic.

Oleg.

--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1381,6 +1381,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 		u32 ret;
 
 		ret = bpf_prog_run_array_uprobe(call->prog_array, regs, bpf_prog_run);
+		pr_crit("BPF_FUNC: pid=%d ret=%d\n", current->pid, ret);
 		if (!ret)
 			return;
 	}

Jiri Olsa Sept. 3, 2024, 2:33 p.m. UTC | #45

On Mon, Sep 02, 2024 at 07:17:45PM +0200, Oleg Nesterov wrote:
> On 09/02, Oleg Nesterov wrote:
> >
> > And... I think that BPF has even more problems with filtering. Not sure,
> > I'll try to write another test-case tomorrow.
> 
> See below. This test-case needs a one-liner patch at the end, but this is only
> because I have no idea how to add BPF_EMIT_CALL(BPF_FUNC_trace_printk) into
> "struct bpf_insn insns[]" correctly. Is there a simple-to-use user-space tool
> which can translate 'bpf_trace_printk("Hello world\n", 13)' into bpf_insn[] ???
> 
> So. The CONFIG_BPF_EVENTS code in __uprobe_perf_func() assumes that it "owns"
> tu->consumer and uprobe_perf_filter(), but this is not true in general.
> 
> test.c:
> 	#include <unistd.h>
> 
> 	int func(int i)
> 	{
> 		return i;
> 	}
> 
> 	int main(void)
> 	{
> 		int i;
> 		for (i = 0;; ++i) {
> 			sleep(1);
> 			func(i);
> 		}
> 		return 0;
> 	}
> 
> run_prog.c
> 	// cc -I./tools/include -I./tools/include/uapi -Wall
> 	#include "./include/generated/uapi/linux/version.h"
> 	#include <linux/perf_event.h>
> 	#include <linux/filter.h>
> 	#define _GNU_SOURCE
> 	#include <sys/syscall.h>
> 	#include <sys/ioctl.h>
> 	#include <assert.h>
> 	#include <unistd.h>
> 	#include <stdlib.h>
> 
> 	int prog_load(void)
> 	{
> 		struct bpf_insn insns[] = {
> 			BPF_MOV64_IMM(BPF_REG_0, 0),
> 			BPF_EXIT_INSN(),
> 		};
> 
> 		union bpf_attr attr = {
> 			.prog_type	= BPF_PROG_TYPE_KPROBE,
> 			.insns		= (unsigned long)insns,
> 			.insn_cnt	= sizeof(insns) / sizeof(insns[0]),
> 			.license	= (unsigned long)"GPL",
> 			.kern_version	= LINUX_VERSION_CODE, // unneeded
> 		};
> 
> 		return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
> 	}
> 
> 	void run_probe(int eid, int pid)
> 	{
> 		struct perf_event_attr attr = {
> 			.type	= PERF_TYPE_TRACEPOINT,
> 			.config	= eid,
> 			.size	= sizeof(struct perf_event_attr),
> 		};
> 
> 		int fd = syscall(__NR_perf_event_open, &attr, pid, 0, -1, 0);
> 		assert(fd >= 0);
> 
> 		int pfd = prog_load();
> 		assert(pfd >= 0);
> 
> 		assert(ioctl(fd, PERF_EVENT_IOC_SET_BPF, pfd) == 0);
> 		assert(ioctl(fd, PERF_EVENT_IOC_ENABLE, 0) == 0);
> 
> 		for (;;)
> 			pause();
> 	}
> 
> 	int main(int argc, const char *argv[])
> 	{
> 		int eid = atoi(argv[1]);
> 		int pid = atoi(argv[2]);
> 		run_probe(eid, pid);
> 		return 0;
> 	}
> 
> Now,
> 
> 	$ ./test &
> 	$ PID1=$!
> 	$ ./test &
> 	$ PID2=$!
> 	$ perf probe -x ./test -a func
> 	$ ./run_prog `cat /sys/kernel/debug/tracing/events/probe_test/func/id` $PID1 &
> 
> dmesg -c:
> 	trace_uprobe: BPF_FUNC: pid=50 ret=0
> 	trace_uprobe: BPF_FUNC: pid=50 ret=0
> 	trace_uprobe: BPF_FUNC: pid=50 ret=0
> 	trace_uprobe: BPF_FUNC: pid=50 ret=0
> 	...
> 
> So far so good. Now,
> 
> 	$ perf record -e probe_test:func -p $PID2 -- sleep 10 &
> 
> This creates another PERF_TYPE_TRACEPOINT perf_event which shares
> trace_uprobe/consumer/filter with the perf_event created by run_prog.
> 
> dmesg -c:
> 	trace_uprobe: BPF_FUNC: pid=51 ret=0
> 	trace_uprobe: BPF_FUNC: pid=50 ret=0
> 	trace_uprobe: BPF_FUNC: pid=51 ret=0
> 	trace_uprobe: BPF_FUNC: pid=50 ret=0
> 	...
> 
> until perf-record exits. and after that
> 
> 	$ perf script
> 
> reports nothing.
> 
> So, in this case:
> 
> 	- run_prog's bpf program is called when current->pid == $PID2, this patch
> 	  (or any other change in trace_uprobe.c) can't help.
> 
> 	- run_prog's bpf program "steals" __uprobe_perf_func() from /usr/bin/perf

ok, there's just one call instance (struct trace_event_call) shared
with all the uprobe tracepoints, so if there's bpf program attached
to any uprobe tracepoint, it will not continue and send the perf event
for any other uprobe tracepoint (without the bpf program attached)

I think there might be similar issue with tracepoints and kprobes

> 
> and to me this is yet another indication that we need some changes in the
> bpf_prog_run_array_uprobe() paths, or even in the user-space bpftrace/whatever's
> code.
> 
> And. Why the "if (bpf_prog_array_valid(call))" block in __uprobe_perf_func() returns?
> Why this depends on "ret == 0" ??? I fail to understand this logic.

I'd think the intention was that if there's bpf program we don't emit
the perf event.. and we never thought about having them (event with bpf
program and standard perf event) co-existing together

problem is that the perf event pid filtering is implemented through the
call->perf_events list 

jirka

Andrii Nakryiko Sept. 3, 2024, 6:09 p.m. UTC | #46

On Mon, Sep 2, 2024 at 2:11 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Fri, Aug 30, 2024 at 08:51:12AM -0700, Andrii Nakryiko wrote:
> > On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote:
> > > > The whole discussion was very confusing (yes, I too contributed to the
> > > > confusion ;), let me try to summarise.
> > > >
> > > > > U(ret)probes are designed to be filterable using the PID, which is the
> > > > > second parameter in the perf_event_open syscall. Currently, uprobe works
> > > > > well with the filtering, but uretprobe is not affected by it.
> > > >
> > > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
> > > > misunderstands the purpose of uprobe_perf_filter().
> > > >
> > > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
> > > > does the filtering by the PID, it doesn't. We can simply kill this function
> > > > and perf will work correctly. The perf layer in __uprobe_perf_func() does
> > > > the filtering when perf_event->hw.target != NULL.
> > > >
> > > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
> > > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
> > > > unapply_uprobe() in handler_chain().
> > > >
> > > > Suppose you do, say,
> > > >
> > > >       $ perf probe -x /path/to/libc some_hot_function
> > > > or
> > > >       $ perf probe -x /path/to/libc some_hot_function%return
> > > > then
> > > >       $perf record -e ... -p 1
> > > >
> > > > to trace the usage of some_hot_function() in the init process. Everything
> > > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().
> > > >
> > > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
> > > > So the child C will hit this breakpoint and cal handle_swbp/etc for no
> > > > reason every time it calls some_hot_function(), not good.
> > > >
> > > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
> > > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
> > > > call unapply_uprobe() which will remove this breakpoint from C->mm.
> > >
> > > thanks for the info, I wasn't aware this was the intention
> > >
> > > uprobe_multi does not have perf event mechanism/check, so it's using
> > > the filter function to do the process filtering.. which is not working
> > > properly as you pointed out earlier
> >
> > So this part I don't completely get. I get that using task->mm
> > comparison is wrong due to CLONE_VM, but why same_thread_group() check
> > is wrong? I.e., why task->signal comparison is wrong?
>
> the way I understand it is that we take the group leader task and
> store it in bpf_uprobe_multi_link::task
>
> but it can exit while the rest of the threads is still running so
> the uprobe_multi_link_filter won't match them (leader->mm is NULL)

Aren't we conflating two things here? Yes, from what Oleg explained,
it's clear that using task->mm is wrong. So that is what I feel is the
main issue. We shouldn't use task->mm at all, only task->signal should
be used instead. We should fix that (in bpf tree, please).

But I don't get the concern about linux->mm or linux->signal becoming
NULL because of a task existing. Look at put_task_struct(), it WILL
NOT call __put_task_struct() (which then calls put_signal_struct()),
so task->signal at least will be there and valid until multi-uprobe is
detached and we call put_task().

So. Can you please send fixes against the bpf tree, switching to
task->signal? And maybe also include the fix to prevent
UPROBE_HANDLER_REMOVE to be returned from the BPF program?

This thread is almost 50 emails deep now, we should break out of it.
We can argue on your actual fixes. :)

>
> Oleg suggested change below (in addition to same_thread_group change)
> to take that in account
>
> jirka
>
>
> ---
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 98e395f1baae..9e6b390aa6da 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx
>                          struct mm_struct *mm)
>  {
>         struct bpf_uprobe *uprobe;
> +       struct task_struct *task, *t;
> +       bool ret = false;
>
>         uprobe = container_of(con, struct bpf_uprobe, consumer);
> -       return uprobe->link->task->mm == mm;
> +       task = uprobe->link->task;
> +
> +       rcu_read_lock();
> +       for_each_thread(task, t) {
> +               struct mm_struct *mm = READ_ONCE(t->mm);
> +               if (mm) {
> +                       ret = t->mm == mm;
> +                       break;
> +               }
> +       }
> +       rcu_read_unlock();
> +
> +       return ret;
>  }
>
>  static int

Andrii Nakryiko Sept. 3, 2024, 6:11 p.m. UTC | #47

On Tue, Sep 3, 2024 at 11:09 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Mon, Sep 2, 2024 at 2:11 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Fri, Aug 30, 2024 at 08:51:12AM -0700, Andrii Nakryiko wrote:
> > > On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > >
> > > > On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote:
> > > > > The whole discussion was very confusing (yes, I too contributed to the
> > > > > confusion ;), let me try to summarise.
> > > > >
> > > > > > U(ret)probes are designed to be filterable using the PID, which is the
> > > > > > second parameter in the perf_event_open syscall. Currently, uprobe works
> > > > > > well with the filtering, but uretprobe is not affected by it.
> > > > >
> > > > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
> > > > > misunderstands the purpose of uprobe_perf_filter().
> > > > >
> > > > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
> > > > > does the filtering by the PID, it doesn't. We can simply kill this function
> > > > > and perf will work correctly. The perf layer in __uprobe_perf_func() does
> > > > > the filtering when perf_event->hw.target != NULL.
> > > > >
> > > > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
> > > > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
> > > > > unapply_uprobe() in handler_chain().
> > > > >
> > > > > Suppose you do, say,
> > > > >
> > > > >       $ perf probe -x /path/to/libc some_hot_function
> > > > > or
> > > > >       $ perf probe -x /path/to/libc some_hot_function%return
> > > > > then
> > > > >       $perf record -e ... -p 1
> > > > >
> > > > > to trace the usage of some_hot_function() in the init process. Everything
> > > > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().
> > > > >
> > > > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
> > > > > So the child C will hit this breakpoint and cal handle_swbp/etc for no
> > > > > reason every time it calls some_hot_function(), not good.
> > > > >
> > > > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
> > > > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
> > > > > call unapply_uprobe() which will remove this breakpoint from C->mm.
> > > >
> > > > thanks for the info, I wasn't aware this was the intention
> > > >
> > > > uprobe_multi does not have perf event mechanism/check, so it's using
> > > > the filter function to do the process filtering.. which is not working
> > > > properly as you pointed out earlier
> > >
> > > So this part I don't completely get. I get that using task->mm
> > > comparison is wrong due to CLONE_VM, but why same_thread_group() check
> > > is wrong? I.e., why task->signal comparison is wrong?
> >
> > the way I understand it is that we take the group leader task and
> > store it in bpf_uprobe_multi_link::task
> >
> > but it can exit while the rest of the threads is still running so
> > the uprobe_multi_link_filter won't match them (leader->mm is NULL)
>
> Aren't we conflating two things here? Yes, from what Oleg explained,
> it's clear that using task->mm is wrong. So that is what I feel is the
> main issue. We shouldn't use task->mm at all, only task->signal should
> be used instead. We should fix that (in bpf tree, please).
>
> But I don't get the concern about linux->mm or linux->signal becoming

correction, we shouldn't worry about *linux->signal* becoming NULL.
linux->mm can become NULL, but we don't care about that (once we fix
filtering logic in multi-uprobe).

> NULL because of a task existing. Look at put_task_struct(), it WILL
> NOT call __put_task_struct() (which then calls put_signal_struct()),
> so task->signal at least will be there and valid until multi-uprobe is
> detached and we call put_task().
>
> So. Can you please send fixes against the bpf tree, switching to
> task->signal? And maybe also include the fix to prevent
> UPROBE_HANDLER_REMOVE to be returned from the BPF program?
>
> This thread is almost 50 emails deep now, we should break out of it.
> We can argue on your actual fixes. :)
>
> >
> > Oleg suggested change below (in addition to same_thread_group change)
> > to take that in account
> >
> > jirka
> >
> >
> > ---
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 98e395f1baae..9e6b390aa6da 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx
> >                          struct mm_struct *mm)
> >  {
> >         struct bpf_uprobe *uprobe;
> > +       struct task_struct *task, *t;
> > +       bool ret = false;
> >
> >         uprobe = container_of(con, struct bpf_uprobe, consumer);
> > -       return uprobe->link->task->mm == mm;
> > +       task = uprobe->link->task;
> > +
> > +       rcu_read_lock();
> > +       for_each_thread(task, t) {
> > +               struct mm_struct *mm = READ_ONCE(t->mm);
> > +               if (mm) {
> > +                       ret = t->mm == mm;
> > +                       break;
> > +               }
> > +       }
> > +       rcu_read_unlock();
> > +
> > +       return ret;
> >  }
> >
> >  static int

Jiri Olsa Sept. 3, 2024, 7:15 p.m. UTC | #48

On Tue, Sep 03, 2024 at 11:11:06AM -0700, Andrii Nakryiko wrote:

SNIP

> > Aren't we conflating two things here? Yes, from what Oleg explained,
> > it's clear that using task->mm is wrong. So that is what I feel is the
> > main issue. We shouldn't use task->mm at all, only task->signal should
> > be used instead. We should fix that (in bpf tree, please).
> >
> > But I don't get the concern about linux->mm or linux->signal becoming
> 
> correction, we shouldn't worry about *linux->signal* becoming NULL.
> linux->mm can become NULL, but we don't care about that (once we fix
> filtering logic in multi-uprobe).
> 
> > NULL because of a task existing. Look at put_task_struct(), it WILL
> > NOT call __put_task_struct() (which then calls put_signal_struct()),
> > so task->signal at least will be there and valid until multi-uprobe is
> > detached and we call put_task().
> >
> > So. Can you please send fixes against the bpf tree, switching to
> > task->signal? And maybe also include the fix to prevent
> > UPROBE_HANDLER_REMOVE to be returned from the BPF program?

ok, it's uprobe-multi specific, let's discuss that over the change
itself, I'll try to send it soon

jirka


> >
> > This thread is almost 50 emails deep now, we should break out of it.
> > We can argue on your actual fixes. :)
> >
> > >
> > > Oleg suggested change below (in addition to same_thread_group change)
> > > to take that in account
> > >
> > > jirka
> > >
> > >
> > > ---
> > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > index 98e395f1baae..9e6b390aa6da 100644
> > > --- a/kernel/trace/bpf_trace.c
> > > +++ b/kernel/trace/bpf_trace.c
> > > @@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx
> > >                          struct mm_struct *mm)
> > >  {
> > >         struct bpf_uprobe *uprobe;
> > > +       struct task_struct *task, *t;
> > > +       bool ret = false;
> > >
> > >         uprobe = container_of(con, struct bpf_uprobe, consumer);
> > > -       return uprobe->link->task->mm == mm;
> > > +       task = uprobe->link->task;
> > > +
> > > +       rcu_read_lock();
> > > +       for_each_thread(task, t) {
> > > +               struct mm_struct *mm = READ_ONCE(t->mm);
> > > +               if (mm) {
> > > +                       ret = t->mm == mm;
> > > +                       break;
> > > +               }
> > > +       }
> > > +       rcu_read_unlock();
> > > +
> > > +       return ret;
> > >  }
> > >
> > >  static int

Jiri Olsa Sept. 6, 2024, 10:43 a.m. UTC | #49

On Mon, Sep 02, 2024 at 03:22:25AM +0800, Tianyi Liu wrote:
> On Fri, Aug 30, 2024 at 18:12:41PM +0800, Oleg Nesterov wrote:
> > The whole discussion was very confusing (yes, I too contributed to the
> > confusion ;), let me try to summarise.
> > 
> > > U(ret)probes are designed to be filterable using the PID, which is the
> > > second parameter in the perf_event_open syscall. Currently, uprobe works
> > > well with the filtering, but uretprobe is not affected by it.
> > 
> > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func()
> > misunderstands the purpose of uprobe_perf_filter().
> > 
> > Lets forget about BPF for the moment. It is not that uprobe_perf_filter()
> > does the filtering by the PID, it doesn't. We can simply kill this function
> > and perf will work correctly. The perf layer in __uprobe_perf_func() does
> > the filtering when perf_event->hw.target != NULL.
> > 
> > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid
> > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger
> > unapply_uprobe() in handler_chain().
> > 
> > Suppose you do, say,
> > 
> > 	$ perf probe -x /path/to/libc some_hot_function
> > or
> > 	$ perf probe -x /path/to/libc some_hot_function%return
> > then
> > 	$perf record -e ... -p 1
> > 
> > to trace the usage of some_hot_function() in the init process. Everything
> > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter().
> > 
> > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf.
> > So the child C will hit this breakpoint and cal handle_swbp/etc for no
> > reason every time it calls some_hot_function(), not good.
> > 
> > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns
> > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will
> > call unapply_uprobe() which will remove this breakpoint from C->mm.
> > 
> > > We found that the filter function was not invoked when uretprobe was
> > > initially implemented, and this has been existing for ten years.
> > 
> > See above, this is correct.
> > 
> > Note also that if you only use perf-probe + perf-record, no matter how
> > many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into
> > uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe().
> > 
> 
> Thanks for the detailed explanation above, I can understand the code now. 
> Yes, I completely misunderstood the purpose of uprobe_perf_filter, 
> sorry for the confusion.
> 
> > ---------------------------------------------------------------------------
> > Now lets return to BPF and this particular problem. I won't really argue
> > with this patch, but
> > 
> > 	- Please change the subject and update the changelog,
> > 	  "Fixes: c1ae5c75e103" and the whole reasoning is misleading
> > 	  and wrong, IMO.
> > 
> > 	- This patch won't fix all problems because uprobe_perf_filter()
> > 	  filters by mm, not by pid. The changelog/patch assumes that it
> > 	  is a "PID filter", but it is not.
> > 
> > 	  See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/
> > 	  If the traced process does clone(CLONE_VM), bpftrace will hit the
> > 	  similar problem, with uprobe or uretprobe.
> > 
> > 	- So I still think that the "right" fix should change the
> > 	  bpf_prog_run_array_uprobe() paths somehow, but I know nothing
> > 	  about bpf.
> 
> I agree that this patch does not address the issue correctly. 
> The PID filter should be implemented within bpf_prog_run_array_uprobe, 
> or alternatively, bpf_prog_run_array_uprobe should be called after 
> perf_tp_event_match to reuse the filtering mechanism provided by perf.
> 
> Also, uretprobe may need UPROBE_HANDLER_REMOVE, similar to uprobe.
> 
> For now, please forget the original patch as we need a new solution ;)

hi,
any chance we could go with your fix until we find better solution?

it's simple and it fixes most of the cases for return uprobe pid filter
for events with bpf programs.. I know during the discussion we found
that standard perf record path won't work if there's bpf program
attached on the same event, but I think that likely it needs more
changes and also it's not a common use case

would you consider sending another version addressing Oleg's points
for changelog above?

thanks,
jirka

Oleg Nesterov Sept. 6, 2024, 7:18 p.m. UTC | #50

On 09/06, Jiri Olsa wrote:
>
> On Mon, Sep 02, 2024 at 03:22:25AM +0800, Tianyi Liu wrote:
> >
> > For now, please forget the original patch as we need a new solution ;)
>
> hi,
> any chance we could go with your fix until we find better solution?

Well, as I said from the very beginning I won't really argue even if
I obviously don't like this change very much. As long as the changelog /
comments clearly explain this change. I understand that sometimes an
ugly/incomplete/whatever workaround is better than nothing.

> it's simple and it fixes most of the cases for return uprobe pid filter
> for events with bpf programs..

But to remind it doesn't even fixes all the filtering problems with uprobes,
not uretprobes,

> I know during the discussion we found
> that standard perf record path won't work if there's bpf program
> attached on the same event,

Ah. Yes, this is another problem I tried to point out. But if we discuss
the filtering we can forget about /usr/bin/perf.

Again, again, again, I know nothing about bpf. But it seems to me that
perf_event_attach_bpf_prog() allows to attach up to BPF_TRACE_MAX_PROGS
progs to event->tp_event->prog_array, and then bpf_prog_run_array_uprobe()
should run them all. Right?

So I think that if you run 2 instances of run_prog from my last test-case
with $PID1 and $PID2, the filtering will be broken again. Both instances
will share the same trace_event_call and the same trace_uprobe_filter.

> and also it's not a common use case

OK.

And btw... Can bpftrace attach to the uprobe tp?

	# perf probe -x ./test -a func
	Added new event:
	  probe_test:func      (on func in /root/TTT/test)

	You can now use it in all perf tools, such as:

		perf record -e probe_test:func -aR sleep 1

	# bpftrace -e 'tracepoint:probe_test:func { printf("%d\n", pid); }'
	Attaching 1 probe...
	ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
	ERROR: Error attaching probe: tracepoint:probe_test:func

Oleg.

Tianyi Liu Sept. 7, 2024, 7:19 p.m. UTC | #51

On Mon, Sep 06, 2024 at 18:43:00AM +0800, Jiri Olsa wrote:

SNIP

> > > ---------------------------------------------------------------------------
> > > Now lets return to BPF and this particular problem. I won't really argue
> > > with this patch, but
> > > 
> > > 	- Please change the subject and update the changelog,
> > > 	  "Fixes: c1ae5c75e103" and the whole reasoning is misleading
> > > 	  and wrong, IMO.
> > > 
> > > 	- This patch won't fix all problems because uprobe_perf_filter()
> > > 	  filters by mm, not by pid. The changelog/patch assumes that it
> > > 	  is a "PID filter", but it is not.
> > > 
> > > 	  See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/
> > > 	  If the traced process does clone(CLONE_VM), bpftrace will hit the
> > > 	  similar problem, with uprobe or uretprobe.
> > > 
> > > 	- So I still think that the "right" fix should change the
> > > 	  bpf_prog_run_array_uprobe() paths somehow, but I know nothing
> > > 	  about bpf.
> > 
> > I agree that this patch does not address the issue correctly. 
> > The PID filter should be implemented within bpf_prog_run_array_uprobe, 
> > or alternatively, bpf_prog_run_array_uprobe should be called after 
> > perf_tp_event_match to reuse the filtering mechanism provided by perf.
> > 
> > Also, uretprobe may need UPROBE_HANDLER_REMOVE, similar to uprobe.
> > 
> > For now, please forget the original patch as we need a new solution ;)
> 
> hi,
> any chance we could go with your fix until we find better solution?
> 
> it's simple and it fixes most of the cases for return uprobe pid filter
> for events with bpf programs.. I know during the discussion we found
> that standard perf record path won't work if there's bpf program
> attached on the same event, but I think that likely it needs more
> changes and also it's not a common use case
> 
> would you consider sending another version addressing Oleg's points
> for changelog above?

My pleasure, I'll resend the updated patch in a new thread.

Based on previous discussions, `uprobe_perf_filter` acts as a preliminary
filter that removes breakpoints when they are no longer needed.
More complex filtering mechanisms related to perf are implemented in
perf-specific paths.

From my understanding, the original patch attempted to partially implement
UPROBE_HANDLER_REMOVE (since it didn't actually remove the breakpoint but
only prevented it from entering the BPF-related code).
I'm trying to provide a complete implementation, i.e., removing the
breakpoint when `uprobe_perf_filter` returns false, similar to how uprobe
functions. However, this would require merging the following functions,
because they will almost be the same:

uprobe_perf_func / uretprobe_perf_func
uprobe_dispatcher / uretprobe_dispatcher
handler_chain / handle_uretprobe_chain

I'm not sure why the paths of uprobe and uretprobe are entirely different.
I suspect that uretprobe might have been implemented later than uprobe and
was only partially implemented.

Oleg, do you have more insights on this?
In your opinion, does uretprobe need UPROBE_HANDLER_REMOVE?
If so, is it a good idea to merge the paths of uprobe and uretprobe?

Regardless, if we only need a temporary and incomplete fix, I will modify
only the commit message according to Oleg's suggestions and resend it.

I'm aware that using `uprobe_perf_filter` in `uretprobe_perf_func` is not
the solution for BPF filtering. I'm just trying to alleviate the issue
in some simple cases.

Sorry for the late reply as I spent a long time looking at the details
discussed above.

Thanks,

Oleg Nesterov Sept. 8, 2024, 1:15 p.m. UTC | #52

On 09/08, Tianyi Liu wrote:
>
> On Mon, Sep 06, 2024 at 18:43:00AM +0800, Jiri Olsa wrote:
>
> > would you consider sending another version addressing Oleg's points
> > for changelog above?
>
> My pleasure, I'll resend the updated patch in a new thread.
>
> Based on previous discussions, `uprobe_perf_filter` acts as a preliminary
> filter that removes breakpoints when they are no longer needed.

Well. Not only. See the usage of consumer_filter() and filter_chain() in
register_for_each_vma().

> More complex filtering mechanisms related to perf are implemented in
> perf-specific paths.

The perf paths in __uprobe_perf_func() do the filtering based on
perf_event->hw.target, that is all.

But uprobe_perf_filter() or any other consumer->filter() simply can't rely
on pid/task, it has to check ->mm.

> From my understanding, the original patch attempted to partially implement
> UPROBE_HANDLER_REMOVE (since it didn't actually remove the breakpoint but
> only prevented it from entering the BPF-related code).

Confused...

Your patch can help bpftrace although it (or any other change in
trace_uprobe.c) can't not actually fix all the problems with bpf/filtering
even if we forget about ret-probes.

And I don't understand how this relates to UPROBE_HANDLER_REMOVE...

> I'm trying to provide a complete implementation, i.e., removing the
> breakpoint when `uprobe_perf_filter` returns false, similar to how uprobe
> functions. However, this would require merging the following functions,
> because they will almost be the same:
>
> uprobe_perf_func / uretprobe_perf_func
> uprobe_dispatcher / uretprobe_dispatcher
> handler_chain / handle_uretprobe_chain

Sorry, I don't understand... Yes, uprobe_dispatcher and uretprobe_dispatcher
can share more code or even unified, but

> I suspect that uretprobe might have been implemented later than uprobe

Yes,

> and was only partially implemented.

what do you mean?

But whatever you meant, I agree that this code doesn't look pretty and can
be cleanuped.

> In your opinion, does uretprobe need UPROBE_HANDLER_REMOVE?

Probably. But this has absolutely nothing to do with the filtering problem?
Can we discuss this separately?

> I'm aware that using `uprobe_perf_filter` in `uretprobe_perf_func` is not
> the solution for BPF filtering. I'm just trying to alleviate the issue
> in some simple cases.

Agreed.

-------------------------------------------------------------------------------
To summarise.

This code is very old, and it was written for /usr/bin/perf which attaches
to the tracepoint. So multiple instances of perf-record will share the same
consumer/trace_event_call/filter. uretprobe_perf_func() doesn't call
uprobe_perf_filter() because (if /usr/bin/perf is the only user) in the likely
case it would burn CPU and return true. Quite possibly this design was not
optimal from the very beginning, I simply can't recall why the is_ret_probe()
consumer has ->handler != NULL, but it was not buggy.

Now we have bpf, create_local_trace_uprobe(), etc. So lets add another
uprobe_perf_filter() into uretprobe_perf_func() as your patch did.

Then we can probably change uprobe_handle_trampoline() to do
unapply_uprobe() if all the ret-handlers return UPROBE_HANDLER_REMOVE, like
handler_chain() does.

Then we can probably cleanup/simplify trace_uprobe.c, in partucular we can
change alloc_trace_uprobe()

-	tu->consumer.handler = uprobe_dispatcher;
-	if (is_ret)
-		tu->consumer.ret_handler = uretprobe_dispatcher;
+	if (is_ret)
+		tu->consumer.ret_handler = uretprobe_dispatcher;
+	else
+		tu->consumer.handler = uprobe_dispatcher;

and do more (including unrelated) cleanups.

But lets do this step-by-step.

And lets not mix the filtering issues with the UPROBE_HANDLER_REMOVE logic,
to me this adds the unnecessary confusion.

Oleg.

Andrii Nakryiko Sept. 9, 2024, 1:16 a.m. UTC | #53

On Sun, Sep 8, 2024 at 6:15 AM Oleg Nesterov <oleg@redhat.com> wrote:
>
> On 09/08, Tianyi Liu wrote:
> >
> > On Mon, Sep 06, 2024 at 18:43:00AM +0800, Jiri Olsa wrote:
> >
> > > would you consider sending another version addressing Oleg's points
> > > for changelog above?
> >
> > My pleasure, I'll resend the updated patch in a new thread.
> >
> > Based on previous discussions, `uprobe_perf_filter` acts as a preliminary
> > filter that removes breakpoints when they are no longer needed.
>
> Well. Not only. See the usage of consumer_filter() and filter_chain() in
> register_for_each_vma().
>
> > More complex filtering mechanisms related to perf are implemented in
> > perf-specific paths.
>
> The perf paths in __uprobe_perf_func() do the filtering based on
> perf_event->hw.target, that is all.
>
> But uprobe_perf_filter() or any other consumer->filter() simply can't rely
> on pid/task, it has to check ->mm.
>
> > From my understanding, the original patch attempted to partially implement
> > UPROBE_HANDLER_REMOVE (since it didn't actually remove the breakpoint but
> > only prevented it from entering the BPF-related code).
>
> Confused...
>
> Your patch can help bpftrace although it (or any other change in
> trace_uprobe.c) can't not actually fix all the problems with bpf/filtering
> even if we forget about ret-probes.
>
> And I don't understand how this relates to UPROBE_HANDLER_REMOVE...
>
> > I'm trying to provide a complete implementation, i.e., removing the
> > breakpoint when `uprobe_perf_filter` returns false, similar to how uprobe
> > functions. However, this would require merging the following functions,
> > because they will almost be the same:
> >
> > uprobe_perf_func / uretprobe_perf_func
> > uprobe_dispatcher / uretprobe_dispatcher
> > handler_chain / handle_uretprobe_chain
>
> Sorry, I don't understand... Yes, uprobe_dispatcher and uretprobe_dispatcher
> can share more code or even unified, but
>
> > I suspect that uretprobe might have been implemented later than uprobe
>
> Yes,
>
> > and was only partially implemented.
>
> what do you mean?
>
> But whatever you meant, I agree that this code doesn't look pretty and can
> be cleanuped.
>
> > In your opinion, does uretprobe need UPROBE_HANDLER_REMOVE?
>
> Probably. But this has absolutely nothing to do with the filtering problem?
> Can we discuss this separately?
>
> > I'm aware that using `uprobe_perf_filter` in `uretprobe_perf_func` is not
> > the solution for BPF filtering. I'm just trying to alleviate the issue
> > in some simple cases.
>
> Agreed.
>
> -------------------------------------------------------------------------------
> To summarise.
>
> This code is very old, and it was written for /usr/bin/perf which attaches
> to the tracepoint. So multiple instances of perf-record will share the same
> consumer/trace_event_call/filter. uretprobe_perf_func() doesn't call
> uprobe_perf_filter() because (if /usr/bin/perf is the only user) in the likely
> case it would burn CPU and return true. Quite possibly this design was not
> optimal from the very beginning, I simply can't recall why the is_ret_probe()
> consumer has ->handler != NULL, but it was not buggy.
>
> Now we have bpf, create_local_trace_uprobe(), etc. So lets add another
> uprobe_perf_filter() into uretprobe_perf_func() as your patch did.
>
> Then we can probably change uprobe_handle_trampoline() to do
> unapply_uprobe() if all the ret-handlers return UPROBE_HANDLER_REMOVE, like
> handler_chain() does.
>
> Then we can probably cleanup/simplify trace_uprobe.c, in partucular we can
> change alloc_trace_uprobe()
>
> -       tu->consumer.handler = uprobe_dispatcher;
> -       if (is_ret)
> -               tu->consumer.ret_handler = uretprobe_dispatcher;
> +       if (is_ret)
> +               tu->consumer.ret_handler = uretprobe_dispatcher;
> +       else
> +               tu->consumer.handler = uprobe_dispatcher;
>
> and do more (including unrelated) cleanups.
>
> But lets do this step-by-step.
>
> And lets not mix the filtering issues with the UPROBE_HANDLER_REMOVE logic,
> to me this adds the unnecessary confusion.

+100 to all of the above. This has been a very long and winding
thread, but uretprobes are still being triggered unnecessarily :)
Let's fix the bleeding and talk and work on improving all of the other
issues in separate efforts.

>
> Oleg.
>

Jiri Olsa Sept. 9, 2024, 10:41 a.m. UTC | #54

On Fri, Sep 06, 2024 at 09:18:15PM +0200, Oleg Nesterov wrote:
> On 09/06, Jiri Olsa wrote:
> >
> > On Mon, Sep 02, 2024 at 03:22:25AM +0800, Tianyi Liu wrote:
> > >
> > > For now, please forget the original patch as we need a new solution ;)
> >
> > hi,
> > any chance we could go with your fix until we find better solution?
> 
> Well, as I said from the very beginning I won't really argue even if
> I obviously don't like this change very much. As long as the changelog /
> comments clearly explain this change. I understand that sometimes an
> ugly/incomplete/whatever workaround is better than nothing.
> 
> > it's simple and it fixes most of the cases for return uprobe pid filter
> > for events with bpf programs..
> 
> But to remind it doesn't even fixes all the filtering problems with uprobes,
> not uretprobes,
> 
> > I know during the discussion we found
> > that standard perf record path won't work if there's bpf program
> > attached on the same event,
> 
> Ah. Yes, this is another problem I tried to point out. But if we discuss
> the filtering we can forget about /usr/bin/perf.
> 
> Again, again, again, I know nothing about bpf. But it seems to me that
> perf_event_attach_bpf_prog() allows to attach up to BPF_TRACE_MAX_PROGS
> progs to event->tp_event->prog_array, and then bpf_prog_run_array_uprobe()
> should run them all. Right?
> 
> So I think that if you run 2 instances of run_prog from my last test-case
> with $PID1 and $PID2, the filtering will be broken again. Both instances
> will share the same trace_event_call and the same trace_uprobe_filter.
> 
> > and also it's not a common use case
> 
> OK.
> 
> And btw... Can bpftrace attach to the uprobe tp?
> 
> 	# perf probe -x ./test -a func
> 	Added new event:
> 	  probe_test:func      (on func in /root/TTT/test)
> 
> 	You can now use it in all perf tools, such as:
> 
> 		perf record -e probe_test:func -aR sleep 1
> 
> 	# bpftrace -e 'tracepoint:probe_test:func { printf("%d\n", pid); }'
> 	Attaching 1 probe...
> 	ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
> 	ERROR: Error attaching probe: tracepoint:probe_test:func

the problem here is that bpftrace assumes BPF_PROG_TYPE_TRACEPOINT type
for bpf program, but that will fail in perf_event_set_bpf_prog where
perf event will be identified as uprobe and demands bpf program type
to be BPF_PROG_TYPE_KPROBE

there's same issue with kprobe as well:

	# perf probe -a ksys_read
	Added new event:
	  probe:ksys_read      (on ksys_read)

	You can now use it in all perf tools, such as:

		perf record -e probe:ksys_read -aR sleep 1


	# bpftrace -e 'tracepoint:probe:ksys_read { printf("%d\n", pid); }'
	Attaching 1 probe...
	ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
	ERROR: Error attaching probe: tracepoint:probe:ksys_read

I'm not sure there's an easy way to filter these events from bpftrce -l
output (or change the program type accordingly), because I don't think
there's a way to find out the tracepoint subtype (kprobe/uprobe) from
the tracefs record

jirka

Oleg Nesterov Sept. 9, 2024, 6:34 p.m. UTC | #55

On 09/09, Jiri Olsa wrote:
>
> On Fri, Sep 06, 2024 at 09:18:15PM +0200, Oleg Nesterov wrote:
> >
> > And btw... Can bpftrace attach to the uprobe tp?
> >
> > 	# perf probe -x ./test -a func
> > 	Added new event:
> > 	  probe_test:func      (on func in /root/TTT/test)
> >
> > 	You can now use it in all perf tools, such as:
> >
> > 		perf record -e probe_test:func -aR sleep 1
> >
> > 	# bpftrace -e 'tracepoint:probe_test:func { printf("%d\n", pid); }'
> > 	Attaching 1 probe...
> > 	ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
> > 	ERROR: Error attaching probe: tracepoint:probe_test:func
>
> the problem here is that bpftrace assumes BPF_PROG_TYPE_TRACEPOINT type
> for bpf program, but that will fail in perf_event_set_bpf_prog where
> perf event will be identified as uprobe and demands bpf program type
> to be BPF_PROG_TYPE_KPROBE

Yes, thanks, I know,

> I don't think
> there's a way to find out the tracepoint subtype (kprobe/uprobe) from
> the tracefs record

Hmm, indeed. it seems that it is not possible to derive tp_event->flags
from tracefs...

Perhaps bpftrace could look for probe_test:func in [uk]probe_events?
Or simply retry ioctl(PERF_EVENT_IOC_SET_BPF) with BPF_PROG_TYPE_KPROBE
if BPF_PROG_TYPE_TRACEPOINT returns EINVAL? Ugly, yes.

Oleg.

Jiri Olsa Sept. 10, 2024, 8:45 a.m. UTC | #56

On Mon, Sep 09, 2024 at 08:34:36PM +0200, Oleg Nesterov wrote:
> On 09/09, Jiri Olsa wrote:
> >
> > On Fri, Sep 06, 2024 at 09:18:15PM +0200, Oleg Nesterov wrote:
> > >
> > > And btw... Can bpftrace attach to the uprobe tp?
> > >
> > > 	# perf probe -x ./test -a func
> > > 	Added new event:
> > > 	  probe_test:func      (on func in /root/TTT/test)
> > >
> > > 	You can now use it in all perf tools, such as:
> > >
> > > 		perf record -e probe_test:func -aR sleep 1
> > >
> > > 	# bpftrace -e 'tracepoint:probe_test:func { printf("%d\n", pid); }'
> > > 	Attaching 1 probe...
> > > 	ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
> > > 	ERROR: Error attaching probe: tracepoint:probe_test:func
> >
> > the problem here is that bpftrace assumes BPF_PROG_TYPE_TRACEPOINT type
> > for bpf program, but that will fail in perf_event_set_bpf_prog where
> > perf event will be identified as uprobe and demands bpf program type
> > to be BPF_PROG_TYPE_KPROBE
> 
> Yes, thanks, I know,
> 
> > I don't think
> > there's a way to find out the tracepoint subtype (kprobe/uprobe) from
> > the tracefs record
> 
> Hmm, indeed. it seems that it is not possible to derive tp_event->flags
> from tracefs...
> 
> Perhaps bpftrace could look for probe_test:func in [uk]probe_events?
> Or simply retry ioctl(PERF_EVENT_IOC_SET_BPF) with BPF_PROG_TYPE_KPROBE
> if BPF_PROG_TYPE_TRACEPOINT returns EINVAL? Ugly, yes.

yep, but will probably work, I created issue https://github.com/bpftrace/bpftrace/issues/3447
so it's not lost

jirka

diff mbox series

Patch

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index c98e3b3386ba..c7e2a0962928 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1443,6 +1443,9 @@  static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func,
 				struct pt_regs *regs,
 				struct uprobe_cpu_buffer **ucbp)
 {
+	if (!uprobe_perf_filter(&tu->consumer, 0, current->mm))
+		return;
+
 	__uprobe_perf_func(tu, func, regs, ucbp);
 }