Message ID | ME0P300MB0416034322B9915ECD3888649D882@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Masami Hiramatsu |
Headers | show |
Series | [v2] tracing/uprobe: Add missing PID filter for uretprobe | expand |
On Fri, 23 Aug 2024 21:53:00 +0800 Tianyi Liu <i.pear@outlook.com> wrote: > U(ret)probes are designed to be filterable using the PID, which is the > second parameter in the perf_event_open syscall. Currently, uprobe works > well with the filtering, but uretprobe is not affected by it. This often > leads to users being disturbed by events from uninterested processes while > using uretprobe. > > We found that the filter function was not invoked when uretprobe was > initially implemented, and this has been existing for ten years. We have > tested the patch under our workload, binding eBPF programs to uretprobe > tracepoints, and confirmed that it resolved our problem. Is this eBPF related problem? It seems only perf record is also affected. Let me try. > > Following are the steps to reproduce the issue: > > Step 1. Compile the following reproducer program: > ``` > > int main() { > printf("pid: %d\n", getpid()); > while (1) { > sleep(2); > void *ptr = malloc(1024); > free(ptr); > } > } > ``` > We will then use uretprobe to trace the `malloc` function. OK, and run perf probe to add an event on malloc's return. $ sudo ~/bin/perf probe -x ./malloc-run --add malloc%return Added new event: probe_malloc:malloc__return (on malloc%return in /home/mhiramat/ksrc/linux/malloc-run) You can now use it in all perf tools, such as: perf record -e probe_malloc:malloc__return -aR sleep 1 > > Step 2. Run two instances of the reproducer program and record their PIDs. $ ./malloc-run & ./malloc-run & [1] 93927 [2] 93928 pid: 93927 pid: 93928 And trace one of them; $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ] And dump the data; $ sudo ~/bin/perf script malloc-run 93928 [004] 351736.730649: raw_syscalls:sys_exit: NR 230 = 0 malloc-run 93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) malloc-run 93928 [004] 351736.730696: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, malloc-run 93928 [004] 351738.730857: raw_syscalls:sys_exit: NR 230 = 0 malloc-run 93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) malloc-run 93928 [004] 351738.730883: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, malloc-run 93928 [004] 351740.731110: raw_syscalls:sys_exit: NR 230 = 0 malloc-run 93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) malloc-run 93928 [004] 351740.731127: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, Hmm, it seems to trace one pid data. (without this change) If this changes eBPF behavior, I would like to involve eBPF people to ask this is OK. As far as from the viewpoint of perf tool, current code works. But I agree that current code is a bit strange. Oleg, do you know anything? Thank you, > > Step 3. Use uretprobe to trace each of the two running reproducers > separately. We use bpftrace to make it easier to reproduce. Please run two > instances of bpftrace simultaneously: the first instance filters events > from PID1, and the second instance filters events from PID2. > > The expected behavior is that each bpftrace instance would only print > events matching its respective PID filter. However, in practice, both > bpftrace instances receive events from both processes, the PID filter is > ineffective at this moment: > > Before: > ``` > PID1=55256 > bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > Attaching 1 probe... > time=0 pid=55256 > time=2 pid=55273 > time=2 pid=55256 > time=4 pid=55273 > time=4 pid=55256 > time=6 pid=55273 > time=6 pid=55256 > > PID2=55273 > bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > Attaching 1 probe... > time=0 pid=55273 > time=0 pid=55256 > time=2 pid=55273 > time=2 pid=55256 > time=4 pid=55273 > time=4 pid=55256 > time=6 pid=55273 > time=6 pid=55256 > ``` > > After: Both bpftrace instances will show the expected behavior, only > printing events from the PID specified by their respective filters: > ``` > PID1=1621 > bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > Attaching 1 probe... > time=0 pid=1621 > time=2 pid=1621 > time=4 pid=1621 > time=6 pid=1621 > > PID2=1633 > bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > Attaching 1 probe... > time=0 pid=1633 > time=2 pid=1633 > time=4 pid=1633 > time=6 pid=1633 > ``` > > Fixes: c1ae5c75e103 ("uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()") > Cc: Alban Crequy <albancrequy@linux.microsoft.com> > Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com> > Signed-off-by: Tianyi Liu <i.pear@outlook.com> > --- > Changes in v2: > - Drop cover letter and update commit message. > - Link to v1: https://lore.kernel.org/linux-trace-kernel/ME0P300MB04166144CDF92A72B9E1BAEA9D8F2@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM/ > --- > kernel/trace/trace_uprobe.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > index c98e3b3386ba..c7e2a0962928 100644 > --- a/kernel/trace/trace_uprobe.c > +++ b/kernel/trace/trace_uprobe.c > @@ -1443,6 +1443,9 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func, > struct pt_regs *regs, > struct uprobe_cpu_buffer **ucbp) > { > + if (!uprobe_perf_filter(&tu->consumer, 0, current->mm)) > + return; > + > __uprobe_perf_func(tu, func, regs, ucbp); > } > > -- > 2.34.1 >
adding Jiri and bpftrace folks On Fri, Aug 23, 2024 at 10:44 AM Masami Hiramatsu <mhiramat@kernel.org> wrote: > > On Fri, 23 Aug 2024 21:53:00 +0800 > Tianyi Liu <i.pear@outlook.com> wrote: > > > U(ret)probes are designed to be filterable using the PID, which is the > > second parameter in the perf_event_open syscall. Currently, uprobe works > > well with the filtering, but uretprobe is not affected by it. This often > > leads to users being disturbed by events from uninterested processes while > > using uretprobe. > > > > We found that the filter function was not invoked when uretprobe was > > initially implemented, and this has been existing for ten years. We have > > tested the patch under our workload, binding eBPF programs to uretprobe > > tracepoints, and confirmed that it resolved our problem. > > Is this eBPF related problem? It seems only perf record is also affected. > Let me try. > > > > > > Following are the steps to reproduce the issue: > > > > Step 1. Compile the following reproducer program: > > ``` > > > > int main() { > > printf("pid: %d\n", getpid()); > > while (1) { > > sleep(2); > > void *ptr = malloc(1024); > > free(ptr); > > } > > } > > ``` > > We will then use uretprobe to trace the `malloc` function. > > OK, and run perf probe to add an event on malloc's return. > > $ sudo ~/bin/perf probe -x ./malloc-run --add malloc%return > Added new event: > probe_malloc:malloc__return (on malloc%return in /home/mhiramat/ksrc/linux/malloc-run) > > You can now use it in all perf tools, such as: > > perf record -e probe_malloc:malloc__return -aR sleep 1 > > > > > Step 2. Run two instances of the reproducer program and record their PIDs. > > $ ./malloc-run & ./malloc-run & > [1] 93927 > [2] 93928 > pid: 93927 > pid: 93928 > > And trace one of them; > > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 > ^C[ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ] > > And dump the data; > > $ sudo ~/bin/perf script > malloc-run 93928 [004] 351736.730649: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351736.730696: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > malloc-run 93928 [004] 351738.730857: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351738.730883: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > malloc-run 93928 [004] 351740.731110: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351740.731127: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > Hmm, it seems to trace one pid data. (without this change) > If this changes eBPF behavior, I would like to involve eBPF people to ask > this is OK. As far as from the viewpoint of perf tool, current code works. > > But I agree that current code is a bit strange. Oleg, do you know anything? > > Thank you, > > > > > Step 3. Use uretprobe to trace each of the two running reproducers > > separately. We use bpftrace to make it easier to reproduce. Please run two > > instances of bpftrace simultaneously: the first instance filters events > > from PID1, and the second instance filters events from PID2. > > > > The expected behavior is that each bpftrace instance would only print > > events matching its respective PID filter. However, in practice, both > > bpftrace instances receive events from both processes, the PID filter is > > ineffective at this moment: > > > > Before: > > ``` > > PID1=55256 > > bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > > Attaching 1 probe... > > time=0 pid=55256 > > time=2 pid=55273 > > time=2 pid=55256 > > time=4 pid=55273 > > time=4 pid=55256 > > time=6 pid=55273 > > time=6 pid=55256 > > > > PID2=55273 > > bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > > Attaching 1 probe... > > time=0 pid=55273 > > time=0 pid=55256 > > time=2 pid=55273 > > time=2 pid=55256 > > time=4 pid=55273 > > time=4 pid=55256 > > time=6 pid=55273 > > time=6 pid=55256 > > ``` This is a bit confusing, because even if the kernel-side uretprobe handler doesn't do the filtering by itself, uprobe subsystem shouldn't install breakpoints on processes which don't have uretprobe requested for (unless I'm missing something, of course). It still needs to be fixed like you do in your patch, though. Even more, we probably need a similar UPROBE_HANDLER_REMOVE handling in handle_uretprobe_chain() to clean up breakpoint for processes which don't have uretprobe attached anymore (but I think that's a separate follow up). Anyways, I think fixing this won't affect BPF in the sense that this is clearly a bug and shouldn't happen. Furthermore, in multi-uprobe/multi-uretprobe implementation we already filter this out correctly, so I don't think anyone does (neither should) rely on this buggy behavior. > > > > After: Both bpftrace instances will show the expected behavior, only > > printing events from the PID specified by their respective filters: > > ``` > > PID1=1621 > > bpftrace -p $PID1 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > > Attaching 1 probe... > > time=0 pid=1621 > > time=2 pid=1621 > > time=4 pid=1621 > > time=6 pid=1621 > > > > PID2=1633 > > bpftrace -p $PID2 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' > > Attaching 1 probe... > > time=0 pid=1633 > > time=2 pid=1633 > > time=4 pid=1633 > > time=6 pid=1633 > > ``` > > > > Fixes: c1ae5c75e103 ("uprobes/tracing: Introduce is_ret_probe() and uretprobe_dispatcher()") > > Cc: Alban Crequy <albancrequy@linux.microsoft.com> > > Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com> > > Signed-off-by: Tianyi Liu <i.pear@outlook.com> > > --- > > Changes in v2: > > - Drop cover letter and update commit message. > > - Link to v1: https://lore.kernel.org/linux-trace-kernel/ME0P300MB04166144CDF92A72B9E1BAEA9D8F2@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM/ > > --- > > kernel/trace/trace_uprobe.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > > index c98e3b3386ba..c7e2a0962928 100644 > > --- a/kernel/trace/trace_uprobe.c > > +++ b/kernel/trace/trace_uprobe.c > > @@ -1443,6 +1443,9 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func, > > struct pt_regs *regs, > > struct uprobe_cpu_buffer **ucbp) > > { > > + if (!uprobe_perf_filter(&tu->consumer, 0, current->mm)) > > + return; > > + > > __uprobe_perf_func(tu, func, regs, ucbp); > > } > > > > -- > > 2.34.1 > > > > > -- > Masami Hiramatsu (Google) <mhiramat@kernel.org> >
Hi Masami and Andrii: I would like to share more information and ideas, but I'm possibly wrong. > > U(ret)probes are designed to be filterable using the PID, which is the > > second parameter in the perf_event_open syscall. Currently, uprobe works > > well with the filtering, but uretprobe is not affected by it. This often > > leads to users being disturbed by events from uninterested processes while > > using uretprobe. > > > > We found that the filter function was not invoked when uretprobe was > > initially implemented, and this has been existing for ten years. We have > > tested the patch under our workload, binding eBPF programs to uretprobe > > tracepoints, and confirmed that it resolved our problem. > > Is this eBPF related problem? It seems only perf record is also affected. > Let me try. I guess it should be a general issue and is not specific to BPF, because the BPF handler is only a event "consumer". > > And trace one of them; > > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 > A key trigger here is that the two tracer instances (either `bpftrace` or `perf record`) must be running *simultaneously*. One of them should use PID1 as filter, while the other uses PID2. I think the reason why only tracing PID1 fails to trigger the bug is that, uprobe uses some form of copy-on-write mechanism to create independent .text pages for the traced process. For example, if only PID1 is being traced, then only the libc.so used by PID1 is patched. Other processes will continue to use the unpatched original libc.so, so the event isn't triggered. In my reproduction example, the two bpftrace instances must be running at the same moment. > This is a bit confusing, because even if the kernel-side uretprobe > handler doesn't do the filtering by itself, uprobe subsystem shouldn't > install breakpoints on processes which don't have uretprobe requested > for (unless I'm missing something, of course). There're two tracers, one attached to PID1, and the other attached to PID2, as described above. > It still needs to be fixed like you do in your patch, though. Even > more, we probably need a similar UPROBE_HANDLER_REMOVE handling in > handle_uretprobe_chain() to clean up breakpoint for processes which > don't have uretprobe attached anymore (but I think that's a separate > follow up). Agreed, the implementation of uretprobe should be almost the same as uprobe, but it seems uretprobe was ignored in previous modifications. > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 > ^C[ perf record: Woken up 1 times to write data ] > [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ] > > And dump the data; > > $ sudo ~/bin/perf script > malloc-run 93928 [004] 351736.730649: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351736.730696: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > malloc-run 93928 [004] 351738.730857: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351738.730883: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > malloc-run 93928 [004] 351740.731110: raw_syscalls:sys_exit: NR 230 = 0 > malloc-run 93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > malloc-run 93928 [004] 351740.731127: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > Hmm, it seems to trace one pid data. (without this change) > If this changes eBPF behavior, I would like to involve eBPF people to ask > this is OK. As far as from the viewpoint of perf tool, current code works. I tried this and also couldn't reproduce the bug. Even when running two perf instances simultaneously, `perf record` (or perhaps `perf trace` for convenience) only outputs events from the corresponding PID as expected. I initially suspected that perf might be applying another filter in user space, but that doesn't seem to be the case. I'll need to conduct further debugging, which might take some time. I also tried combining bpftrace with `perf trace`. Specifically, I used `perf trace` for PID1 and bpftrace for PID2. `perf trace` still only outputs events from PID1, but bpftrace prints events from both PIDs. I'm not yet sure why this is happening. Thanks so much,
On Sat, 24 Aug 2024 13:49:26 +0800 Tianyi Liu <i.pear@outlook.com> wrote: > Hi Masami and Andrii: > > I would like to share more information and ideas, but I'm possibly wrong. > > > > U(ret)probes are designed to be filterable using the PID, which is the > > > second parameter in the perf_event_open syscall. Currently, uprobe works > > > well with the filtering, but uretprobe is not affected by it. This often > > > leads to users being disturbed by events from uninterested processes while > > > using uretprobe. > > > > > > We found that the filter function was not invoked when uretprobe was > > > initially implemented, and this has been existing for ten years. We have > > > tested the patch under our workload, binding eBPF programs to uretprobe > > > tracepoints, and confirmed that it resolved our problem. > > > > Is this eBPF related problem? It seems only perf record is also affected. > > Let me try. > > I guess it should be a general issue and is not specific to BPF, because > the BPF handler is only a event "consumer". > > > > > And trace one of them; > > > > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 > > > > A key trigger here is that the two tracer instances (either `bpftrace` or > `perf record`) must be running *simultaneously*. One of them should use > PID1 as filter, while the other uses PID2. Even if I run `perf trace record` simultaneously, it filters the event correctly. I just ran; sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93927 -o trace1.data -- sleep 50 & sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 -o trace2.data -- sleep 50 & And dump trace1.data and trace2.data by `perf script`. > > I think the reason why only tracing PID1 fails to trigger the bug is that, > uprobe uses some form of copy-on-write mechanism to create independent > .text pages for the traced process. For example, if only PID1 is being > traced, then only the libc.so used by PID1 is patched. Other processes > will continue to use the unpatched original libc.so, so the event isn't > triggered. In my reproduction example, the two bpftrace instances must be > running at the same moment. > > > This is a bit confusing, because even if the kernel-side uretprobe > > handler doesn't do the filtering by itself, uprobe subsystem shouldn't > > install breakpoints on processes which don't have uretprobe requested > > for (unless I'm missing something, of course). > > There're two tracers, one attached to PID1, and the other attached > to PID2, as described above. Yeah, but perf works fine. Maybe perf only enables its ring buffer for the specific process and read from the specific ring buffer (Note, perf has per-process ring buffer IIRC.) How does the bpftracer implement it? > > > It still needs to be fixed like you do in your patch, though. Even > > more, we probably need a similar UPROBE_HANDLER_REMOVE handling in > > handle_uretprobe_chain() to clean up breakpoint for processes which > > don't have uretprobe attached anymore (but I think that's a separate > > follow up). > > Agreed, the implementation of uretprobe should be almost the same as > uprobe, but it seems uretprobe was ignored in previous modifications. OK, I just have a confirmation from BPF people, because I could not reproduce the issue with perf tool. > > > $ sudo ~/bin/perf trace record -e probe_malloc:malloc__return -p 93928 > > ^C[ perf record: Woken up 1 times to write data ] > > [ perf record: Captured and wrote 0.031 MB perf.data (9 samples) ] > > > > And dump the data; > > > > $ sudo ~/bin/perf script > > malloc-run 93928 [004] 351736.730649: raw_syscalls:sys_exit: NR 230 = 0 > > malloc-run 93928 [004] 351736.730694: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > > malloc-run 93928 [004] 351736.730696: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > malloc-run 93928 [004] 351738.730857: raw_syscalls:sys_exit: NR 230 = 0 > > malloc-run 93928 [004] 351738.730869: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > > malloc-run 93928 [004] 351738.730883: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > malloc-run 93928 [004] 351740.731110: raw_syscalls:sys_exit: NR 230 = 0 > > malloc-run 93928 [004] 351740.731125: probe_malloc:malloc__return: (561cfdeb30c0 <- 561cfdeb3204) > > malloc-run 93928 [004] 351740.731127: raw_syscalls:sys_enter: NR 230 (0, 0, 7ffc7a5c5380, 7ffc7a5c5380, 561d2940f6b0, > > > > Hmm, it seems to trace one pid data. (without this change) > > If this changes eBPF behavior, I would like to involve eBPF people to ask > > this is OK. As far as from the viewpoint of perf tool, current code works. > > I tried this and also couldn't reproduce the bug. Even when running two > perf instances simultaneously, `perf record` (or perhaps `perf trace` for > convenience) only outputs events from the corresponding PID as expected. > I initially suspected that perf might be applying another filter in user > space, but that doesn't seem to be the case. I'll need to conduct further > debugging, which might take some time. > > I also tried combining bpftrace with `perf trace`. Specifically, I used > `perf trace` for PID1 and bpftrace for PID2. `perf trace` still only > outputs events from PID1, but bpftrace prints events from both PIDs. > I'm not yet sure why this is happening. Yeah, if perf only reads the specific process's ring buffer, it should work without this filter. Thanks, > > Thanks so much,
On 08/23, Andrii Nakryiko wrote: > > This is a bit confusing, because even if the kernel-side uretprobe > handler doesn't do the filtering by itself, uprobe subsystem shouldn't > install breakpoints on processes which don't have uretprobe requested > for (unless I'm missing something, of course). Yes, but in this case uprobe_register/apply will install breakpoints on both tasks, with PID1 and PID2. > It still needs to be fixed Agreed, > like you do in your patch, Not sure... > more, we probably need a similar UPROBE_HANDLER_REMOVE handling in > handle_uretprobe_chain() to clean up breakpoint for processes which > don't have uretprobe attached anymore (but I think that's a separate > follow up). Probably yes... but yes, this is another issue. Oleg.
On 08/24, Tianyi Liu wrote: > > A key trigger here is that the two tracer instances (either `bpftrace` or > `perf record`) must be running *simultaneously*. One of them should use > PID1 as filter, while the other uses PID2. Yes. > Agreed, the implementation of uretprobe should be almost the same as > uprobe, but it seems uretprobe was ignored in previous modifications. I forgot EVERYTHING about the code in kernel/trace/trace_uprobe.c, but at first glance I am not sure that your patch is what we need... At least I certainly disagree with "Fixes: c1ae5c75e103" ;) uretprobe_perf_func/etc was designed for perf, and afaics this code still works fine even if you run 2 perf-record's with -p PID1/PID2 at the same time. BPF hacks/hooks were added later, so perhaps this should be fixed in the bpf code, but I have no idea what bpftrace does... Oleg.
On 08/25, Oleg Nesterov wrote: > > At least I certainly disagree with "Fixes: c1ae5c75e103" ;) > > uretprobe_perf_func/etc was designed for perf, and afaics this code still > works fine even if you run 2 perf-record's with -p PID1/PID2 at the same > time. Forgot to mention... And note that in this case uprobe_perf_func()->uprobe_perf_filter() will never return false, and this is correct. Oleg.
On 08/25, Oleg Nesterov wrote: > > At least I certainly disagree with "Fixes: c1ae5c75e103" ;) > > uretprobe_perf_func/etc was designed for perf, and afaics this code still > works fine even if you run 2 perf-record's with -p PID1/PID2 at the same > time. > > BPF hacks/hooks were added later, so perhaps this should be fixed in the > bpf code, but I have no idea what bpftrace does... And I can't install bpftrace on my old Fedora 23 working laptop ;) Yes, yes, I know, I should upgrade it. For the moment, please forget about ret-probes. Could you compile this program #define _GNU_SOURCE #include <unistd.h> #include <sched.h> #include <signal.h> int func(int i) { return i; } int test(void *arg) { int i; for (i = 0;; ++i) { sleep(1); func(i); } return 0; } int main(void) { static char stack[65536]; clone(test, stack + sizeof(stack)/2, CLONE_VM|SIGCHLD, NULL); test(NULL); return 0; } and then do something like $ ./test & $ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }' I hope that the syntax of the 2nd command is correct... I _think_ that it will print 2 pids too. But "perf-record -p" works as expected. Oleg.
On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote: > On 08/25, Oleg Nesterov wrote: > > > > At least I certainly disagree with "Fixes: c1ae5c75e103" ;) > > > > uretprobe_perf_func/etc was designed for perf, and afaics this code still > > works fine even if you run 2 perf-record's with -p PID1/PID2 at the same > > time. > > > > BPF hacks/hooks were added later, so perhaps this should be fixed in the > > bpf code, but I have no idea what bpftrace does... > > And I can't install bpftrace on my old Fedora 23 working laptop ;) Yes, yes, > I know, I should upgrade it. > > For the moment, please forget about ret-probes. Could you compile this program > > #define _GNU_SOURCE > #include <unistd.h> > #include <sched.h> > #include <signal.h> > > int func(int i) > { > return i; > } > > int test(void *arg) > { > int i; > for (i = 0;; ++i) { > sleep(1); > func(i); > } > return 0; > } > > int main(void) > { > static char stack[65536]; > > clone(test, stack + sizeof(stack)/2, CLONE_VM|SIGCHLD, NULL); > test(NULL); > > return 0; > } > > and then do something like > > $ ./test & > $ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }' > > I hope that the syntax of the 2nd command is correct... > > I _think_ that it will print 2 pids too. yes.. but with CLONE_VM both processes share 'mm' so they are threads, and at least uprobe_multi filters by process [1] now.. ;-) > > But "perf-record -p" works as expected. I wonder it's because there's the perf layer that schedules each uprobe event only when its process (PID1/2) is scheduled in and will receive events only from that cpu while the process is running on it jirka [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic > > Oleg. >
On 08/26, Jiri Olsa wrote: > > On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote: > > $ ./test & > > $ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }' > > > > I hope that the syntax of the 2nd command is correct... > > > > I _think_ that it will print 2 pids too. > > yes.. but with CLONE_VM both processes share 'mm' Yes sure, > so they are threads, Well this depends on definition ;) but the CLONE_VM child is not a sub-thread, it has another TGID. See below. > and at least uprobe_multi filters by process [1] now.. ;-) OK, if you say that this behaviour is fine I won't argue, I simply do not know. But see below. > > But "perf-record -p" works as expected. > > I wonder it's because there's the perf layer that schedules each > uprobe event only when its process (PID1/2) is scheduled in and will > receive events only from that cpu while the process is running on it Not sure I understand... The task which hits the breakpoint is always current, it is always scheduled in. The main purpose of uprobe_perf_func()->uprobe_perf_filter() is NOT that we want to avoid __uprobe_perf_func() although this makes sense. The main purpose is that we want to remove the breakpoints in current->mm when uprobe_perf_filter() returns false, that is why UPROBE_HANDLER_REMOVE. IOW, the main purpose is not penalise user-space unnecessarily. IIUC (but I am not sure), perf-record -p will work "correctly" even if we remove uprobe_perf_filter() altogether. IIRC the perf layer does its own filtering but I forgot everything. And this makes me think that perhaps BPF can't rely on uprobe_perf_filter() either, even we forget about ret-probes. > [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic Looks obviously wrong... get_pid_task(PIDTYPE_TGID) can return a zombie leader with ->mm == NULL while other threads and thus the whole process is still alive. And again, the changelog says "the intent for PID filtering it to filter by *process*", but clone(CLONE_VM) creates another process, not a thread. So perhaps we need - if (link->task && current->mm != link->task->mm) + if (link->task && !same_thread_group(current, link->task)) in uprobe_prog_run() to make "filter by *process*" true, but this won't fix the problem with link->task->mm == NULL in uprobe_multi_link_filter(). Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not... Then which userspace tool uses this code? ;) Oleg.
On 08/26, Oleg Nesterov wrote: > > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not... Given that the patch from Tianyi makes the difference, bpftrace should trigger the "#ifdef CONFIG_BPF_EVENTS" code in __uprobe_perf_func(), and bpf_prog_run_array_uprobe() should return zero, correct? > Then which userspace tool uses this code? ;) Yes ;) Oleg.
On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > On 08/26, Jiri Olsa wrote: > > > > On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote: > > > $ ./test & > > > $ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }' > > > > > > I hope that the syntax of the 2nd command is correct... > > > > > > I _think_ that it will print 2 pids too. > > > > yes.. but with CLONE_VM both processes share 'mm' > > Yes sure, > > > so they are threads, > > Well this depends on definition ;) but the CLONE_VM child is not a sub-thread, > it has another TGID. See below. > > > and at least uprobe_multi filters by process [1] now.. ;-) > > OK, if you say that this behaviour is fine I won't argue, I simply do not know. > But see below. > > > > But "perf-record -p" works as expected. > > > > I wonder it's because there's the perf layer that schedules each > > uprobe event only when its process (PID1/2) is scheduled in and will > > receive events only from that cpu while the process is running on it > > Not sure I understand... The task which hits the breakpoint is always > current, it is always scheduled in. hum, I might be missing something, but ;-) assuming we have 2 tasks, each with perf uprobe event assigned in perf path there's uprobe_perf_func which calls the uprobe_perf_filter and if it returns true it then goes: uprobe_perf_func __uprobe_perf_func perf_trace_buf_submit perf_tp_event { hlist_for_each_entry_rcu(event, head, hlist_entry) { if (perf_tp_event_match(event, &data, regs)) { perf_swevent_event(event, count, &data, regs); } so it will store the event only to perf event which is added for the task that is currently scheduled in, so it's not stored to the other task event in comparison with uprobe_multi path, where uprobe_multi_link_filter will pass and then it goes: handler_chain uprobe_multi_link_handler bpf_prog_run - executes bpf program the bpf program will get executed for uprobe from both tasks > > The main purpose of uprobe_perf_func()->uprobe_perf_filter() is NOT that > we want to avoid __uprobe_perf_func() although this makes sense. > > The main purpose is that we want to remove the breakpoints in current->mm > when uprobe_perf_filter() returns false, that is why UPROBE_HANDLER_REMOVE. > IOW, the main purpose is not penalise user-space unnecessarily. > > IIUC (but I am not sure), perf-record -p will work "correctly" even if we > remove uprobe_perf_filter() altogether. IIRC the perf layer does its own > filtering but I forgot everything. I think that's what I tried to describe above > > And this makes me think that perhaps BPF can't rely on uprobe_perf_filter() > either, even we forget about ret-probes. > > > [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic > > Looks obviously wrong... get_pid_task(PIDTYPE_TGID) can return a zombie > leader with ->mm == NULL while other threads and thus the whole process > is still alive. > > And again, the changelog says "the intent for PID filtering it to filter by > *process*", but clone(CLONE_VM) creates another process, not a thread. > > So perhaps we need > > - if (link->task && current->mm != link->task->mm) > + if (link->task && !same_thread_group(current, link->task)) that seems correct, need to check > > in uprobe_prog_run() to make "filter by *process*" true, but this won't > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter(). > > > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not... > Then which userspace tool uses this code? ;) yes, it will trigger if you attach to multiple uprobes, like with your test example with: # bpftrace -p xxx -e 'uprobe:./ex:func* { printf("%d\n", pid); }' thanks, jirka
Hi Oleg, > For the moment, please forget about ret-probes. Could you compile this program > > #define _GNU_SOURCE > #include <unistd.h> > #include <sched.h> > #include <signal.h> > > int func(int i) > { > return i; > } > > int test(void *arg) > { > int i; > for (i = 0;; ++i) { > sleep(1); > func(i); > } > return 0; > } > > int main(void) > { > static char stack[65536]; > > clone(test, stack + sizeof(stack)/2, CLONE_VM|SIGCHLD, NULL); > test(NULL); > > return 0; > } > > and then do something like > > $ ./test & > $ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }' > > I hope that the syntax of the 2nd command is correct... > > I _think_ that it will print 2 pids too. > > But "perf-record -p" works as expected. Yes, the output from bpftrace and perf matches what you described: $ ./tester & [1] 158592 $ perf probe -x tester --add func Added new event: probe_tester:func (on func in /root/test/tester) $ bpftrace -p 158592 -e 'uprobe:./tester:func { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' Attaching 1 probe... time=0 pid=158592 time=0 pid=158594 time=1 pid=158592 time=1 pid=158594 time=2 pid=158592 time=2 pid=158594 time=3 pid=158592 time=3 pid=158594 $ perf record -e probe_tester:func -p 158592 -o 158592 [ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.022 MB 158592 (19 samples) ] $ perf script -i 158592 tester 158592 [006] 246475.295762: probe_tester:func: (55a6def14149) tester 158592 [006] 246476.295828: probe_tester:func: (55a6def14149) tester 158592 [006] 246477.295892: probe_tester:func: (55a6def14149) tester 158592 [006] 246478.295958: probe_tester:func: (55a6def14149) tester 158592 [006] 246479.296024: probe_tester:func: (55a6def14149) tester 158592 [010] 246480.296202: probe_tester:func: (55a6def14149) tester 158592 [010] 246481.296360: probe_tester:func: (55a6def14149) [...] Thanks,
On 08/26, Jiri Olsa wrote: > > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > > On 08/26, Jiri Olsa wrote: > > > > > > > But "perf-record -p" works as expected. > > > > > > I wonder it's because there's the perf layer that schedules each > > > uprobe event only when its process (PID1/2) is scheduled in and will > > > receive events only from that cpu while the process is running on it > > > > Not sure I understand... The task which hits the breakpoint is always > > current, it is always scheduled in. > > hum, I might be missing something, but ;-) No it is me ;) at least I certainly misunderstood your "scheduled in". > assuming we have 2 tasks, each with perf uprobe event assigned > > in perf path there's uprobe_perf_func which calls the uprobe_perf_filter > and if it returns true it then goes: > > uprobe_perf_func > __uprobe_perf_func > perf_trace_buf_submit > perf_tp_event > { > > hlist_for_each_entry_rcu(event, head, hlist_entry) { > if (perf_tp_event_match(event, &data, regs)) { > perf_swevent_event(event, count, &data, regs); Aha. So, in this particular case, when the CLONE_VM child hits the bp and calls uprobe_perf_func(), even perf_tp_event_match() won't be called, head is hlist_empty(), right? Thanks! > in comparison with uprobe_multi path, where uprobe_multi_link_filter Yeah, this is clear. > > IIUC (but I am not sure), perf-record -p will work "correctly" even if we > > remove uprobe_perf_filter() altogether. IIRC the perf layer does its own > > filtering but I forgot everything. > > I think that's what I tried to describe above Yes, thanks. Oleg.
This is offtopic, sorry for the spam, but... On 08/26, Jiri Olsa wrote: > > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > > > > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not... > > Then which userspace tool uses this code? ;) > > yes, it will trigger if you attach to multiple uprobes, like with your > test example with: > > # bpftrace -p xxx -e 'uprobe:./ex:func* { printf("%d\n", pid); }' Hmm. I reserved the testing machine with fedora 40 to play with bpftrace. dummy.c: #include <unistd.h> void func1(void) {} void func2(void) {} int main(void) { for (;;) pause(); } If I do # ./dummy & # bpftrace -p $! -e 'uprobe:./dummy:func* { printf("%d\n", pid); }' and run # bpftrace -e 'kprobe:__uprobe_register { printf("%s\n", kstack); }' on another console I get Attaching 1 probe... __uprobe_register+1 probe_event_enable+399 perf_trace_event_init+440 perf_uprobe_init+152 perf_uprobe_event_init+74 perf_try_init_event+71 perf_event_alloc+1681 __do_sys_perf_event_open+447 do_syscall_64+130 entry_SYSCALL_64_after_hwframe+118 __uprobe_register+1 probe_event_enable+399 perf_trace_event_init+440 perf_uprobe_init+152 perf_uprobe_event_init+74 perf_try_init_event+71 perf_event_alloc+1681 __do_sys_perf_event_open+447 do_syscall_64+130 entry_SYSCALL_64_after_hwframe+118 so it seems that bpftrace doesn't use bpf_uprobe_multi_link_attach() (called by sys_bpf(BPF_LINK_CREATE) ?) in this case. But again, this is offtopic, please forget. Oleg.
On Mon, Aug 26, 2024 at 11:25:53PM +0200, Oleg Nesterov wrote: > This is offtopic, sorry for the spam, but... > > On 08/26, Jiri Olsa wrote: > > > > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > > > > > > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not... > > > Then which userspace tool uses this code? ;) > > > > yes, it will trigger if you attach to multiple uprobes, like with your > > test example with: > > > > # bpftrace -p xxx -e 'uprobe:./ex:func* { printf("%d\n", pid); }' > > Hmm. I reserved the testing machine with fedora 40 to play with bpftrace. > > dummy.c: > > #include <unistd.h> > > void func1(void) {} > void func2(void) {} > > int main(void) { for (;;) pause(); } > > If I do > > # ./dummy & > # bpftrace -p $! -e 'uprobe:./dummy:func* { printf("%d\n", pid); }' > > and run > > # bpftrace -e 'kprobe:__uprobe_register { printf("%s\n", kstack); }' did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting: [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }' Attaching 1 probe... uprobe_register+1 bpf_uprobe_multi_link_attach+685 __sys_bpf+9395 __x64_sys_bpf+26 do_syscall_64+128 entry_SYSCALL_64_after_hwframe+118 I'm not sure what's bpftrace version in fedora 40, I'm using upstream build: [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi uprobe_multi: yes [root@qemu ex]# ../bpftrace/build/src/bpftrace --version bpftrace v0.20.0 jirka > > on another console I get > > Attaching 1 probe... > > __uprobe_register+1 > probe_event_enable+399 > perf_trace_event_init+440 > perf_uprobe_init+152 > perf_uprobe_event_init+74 > perf_try_init_event+71 > perf_event_alloc+1681 > __do_sys_perf_event_open+447 > do_syscall_64+130 > entry_SYSCALL_64_after_hwframe+118 > > __uprobe_register+1 > probe_event_enable+399 > perf_trace_event_init+440 > perf_uprobe_init+152 > perf_uprobe_event_init+74 > perf_try_init_event+71 > perf_event_alloc+1681 > __do_sys_perf_event_open+447 > do_syscall_64+130 > entry_SYSCALL_64_after_hwframe+118 > > so it seems that bpftrace doesn't use bpf_uprobe_multi_link_attach() > (called by sys_bpf(BPF_LINK_CREATE) ?) in this case. > > But again, this is offtopic, please forget. > > Oleg. >
On Mon, Aug 26, 2024 at 3:01 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Mon, Aug 26, 2024 at 11:25:53PM +0200, Oleg Nesterov wrote: > > This is offtopic, sorry for the spam, but... > > > > On 08/26, Jiri Olsa wrote: > > > > > > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > > > > > > > > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not... > > > > Then which userspace tool uses this code? ;) > > > > > > yes, it will trigger if you attach to multiple uprobes, like with your > > > test example with: > > > > > > # bpftrace -p xxx -e 'uprobe:./ex:func* { printf("%d\n", pid); }' > > > > Hmm. I reserved the testing machine with fedora 40 to play with bpftrace. > > > > dummy.c: > > > > #include <unistd.h> > > > > void func1(void) {} > > void func2(void) {} > > > > int main(void) { for (;;) pause(); } > > > > If I do > > > > # ./dummy & > > # bpftrace -p $! -e 'uprobe:./dummy:func* { printf("%d\n", pid); }' > > > > and run > > > > # bpftrace -e 'kprobe:__uprobe_register { printf("%s\n", kstack); }' > > did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting: > > [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }' > Attaching 1 probe... > > uprobe_register+1 > bpf_uprobe_multi_link_attach+685 > __sys_bpf+9395 > __x64_sys_bpf+26 > do_syscall_64+128 > entry_SYSCALL_64_after_hwframe+118 > > > I'm not sure what's bpftrace version in fedora 40, I'm using upstream build: > > [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi > uprobe_multi: yes > [root@qemu ex]# ../bpftrace/build/src/bpftrace --version > bpftrace v0.20.0 > > So what's the conclusion for the original issue? Do we have a path forward with the fix? > jirka > > > > > on another console I get > > > > Attaching 1 probe... > > > > __uprobe_register+1 > > probe_event_enable+399 > > perf_trace_event_init+440 > > perf_uprobe_init+152 > > perf_uprobe_event_init+74 > > perf_try_init_event+71 > > perf_event_alloc+1681 > > __do_sys_perf_event_open+447 > > do_syscall_64+130 > > entry_SYSCALL_64_after_hwframe+118 > > > > __uprobe_register+1 > > probe_event_enable+399 > > perf_trace_event_init+440 > > perf_uprobe_init+152 > > perf_uprobe_event_init+74 > > perf_try_init_event+71 > > perf_event_alloc+1681 > > __do_sys_perf_event_open+447 > > do_syscall_64+130 > > entry_SYSCALL_64_after_hwframe+118 > > > > so it seems that bpftrace doesn't use bpf_uprobe_multi_link_attach() > > (called by sys_bpf(BPF_LINK_CREATE) ?) in this case. > > > > But again, this is offtopic, please forget. > > > > Oleg. > >
On 08/27, Jiri Olsa wrote: > > did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting: > > [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }' > Attaching 1 probe... > > uprobe_register+1 so I guess you are on tip/perf/core which killed uprobe_register_refctr() and changed bpf_uprobe_multi_link_attach() to use uprobe_register > bpf_uprobe_multi_link_attach+685 > __sys_bpf+9395 > __x64_sys_bpf+26 > do_syscall_64+128 > entry_SYSCALL_64_after_hwframe+118 > > > I'm not sure what's bpftrace version in fedora 40, I'm using upstream build: bpftrace v0.20.1 > [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi > uprobe_multi: yes Aha, I get uprobe_multi: no OK. So, on your setup bpftrace uses bpf_uprobe_multi_link_attach() and this implies ->ret_handler = uprobe_multi_link_ret_handler() which calls uprobe_prog_run() which does if (link->task && current->mm != link->task->mm) return 0; So, can you reproduce the problem reported by Tianyi on your setup? Oleg.
On 08/27 05:26, Oleg Nesterov wrote: > __uprobe_register+1 > probe_event_enable+399 > perf_trace_event_init+440 > perf_uprobe_init+152 > perf_uprobe_event_init+74 > perf_try_init_event+71 > perf_event_alloc+1681 > __do_sys_perf_event_open+447 > do_syscall_64+130 > entry_SYSCALL_64_after_hwframe+118 > > so it seems that bpftrace doesn't use bpf_uprobe_multi_link_attach() > (called by sys_bpf(BPF_LINK_CREATE) ?) in this case. I'm using bpftrace v0.21.2 with "uprobe_multi: no", and I got the same stack as yours. When I use strace to trace bpftrace, I get: ``` $ strace bpftrace -p 38962 -e 'uretprobe:libc:malloc { printf("time=%llu pid=%d\n", elapsed / 1000000000, pid); }' 2>&1 | grep perf_event_open perf_event_open({type=0x7 /* PERF_TYPE_??? */, size=0x88 /* PERF_ATTR_SIZE_??? */,config=0x1, sample_period=1, sample_type=0, read_format=0, precise_ip=0 /* arbitrary skid */, ...}, 38962, -1, -1, PERF_FLAG_FD_CLOEXEC) = 12 $ cat /sys/bus/event_source/devices/uprobe/type 7 ``` Here bpftrace is using "uprobe" as the event type, which is registered in ``` static struct pmu perf_uprobe = { .task_ctx_nr = perf_sw_context, .event_init = perf_uprobe_event_init, .add = perf_trace_add, .del = perf_trace_del, .start = perf_swevent_start, .stop = perf_swevent_stop, .read = perf_swevent_read, .attr_groups = uprobe_attr_groups, }; perf_pmu_register(&perf_uprobe, "uprobe", -1); ``` While I use strace for perf, I get: ``` $ strace perf trace -e probe_libc:malloc__return -p 38962 2>&1 |grep perf_event_open [...] perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1641, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW, read_format=PERF_FORMAT_ID, disabled=1, mmap=1, comm=1, task=1, precise_ip=0 /* arbitrary skid */, sample_id_all=1, exclude_guest=1, mmap2=1, comm_exec=1, ksymbol=1, bpf_event=1, ...}, 38962, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3 ``` The PERF_TYPE_TRACEPOINT is registered in: ``` static struct pmu perf_tracepoint = { .task_ctx_nr = perf_sw_context, .event_init = perf_tp_event_init, .add = perf_trace_add, .del = perf_trace_del, .start = perf_swevent_start, .stop = perf_swevent_stop, .read = perf_swevent_read, }; perf_pmu_register(&perf_tracepoint, "tracepoint", PERF_TYPE_TRACEPOINT); ``` So there's a slight difference between the event types being used. I haven't found the connection between this and the filtering, just WFI. By the way, I used bpftrace just for convenience in reproducing the issue. This bug was originally discovered while using cilium/ebpf (a golang ebpf library), which also uses "uprobe" as the event type [1]. [1] https://github.com/cilium/ebpf/blob/main/link/kprobe.go#L251
On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > On 08/26, Jiri Olsa wrote: > > > > On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote: > > > $ ./test & > > > $ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }' > > > > > > I hope that the syntax of the 2nd command is correct... > > > > > > I _think_ that it will print 2 pids too. > > > > yes.. but with CLONE_VM both processes share 'mm' > > Yes sure, > > > so they are threads, > > Well this depends on definition ;) but the CLONE_VM child is not a sub-thread, > it has another TGID. See below. > > > and at least uprobe_multi filters by process [1] now.. ;-) > > OK, if you say that this behaviour is fine I won't argue, I simply do not know. > But see below. > > > > But "perf-record -p" works as expected. > > > > I wonder it's because there's the perf layer that schedules each > > uprobe event only when its process (PID1/2) is scheduled in and will > > receive events only from that cpu while the process is running on it > > Not sure I understand... The task which hits the breakpoint is always > current, it is always scheduled in. > > The main purpose of uprobe_perf_func()->uprobe_perf_filter() is NOT that > we want to avoid __uprobe_perf_func() although this makes sense. > > The main purpose is that we want to remove the breakpoints in current->mm > when uprobe_perf_filter() returns false, that is why UPROBE_HANDLER_REMOVE. > IOW, the main purpose is not penalise user-space unnecessarily. > > IIUC (but I am not sure), perf-record -p will work "correctly" even if we > remove uprobe_perf_filter() altogether. IIRC the perf layer does its own > filtering but I forgot everything. > > And this makes me think that perhaps BPF can't rely on uprobe_perf_filter() > either, even we forget about ret-probes. > > > [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic > > Looks obviously wrong... get_pid_task(PIDTYPE_TGID) can return a zombie > leader with ->mm == NULL while other threads and thus the whole process > is still alive. > > And again, the changelog says "the intent for PID filtering it to filter by > *process*", but clone(CLONE_VM) creates another process, not a thread. > > So perhaps we need > > - if (link->task && current->mm != link->task->mm) > + if (link->task && !same_thread_group(current, link->task)) > > in uprobe_prog_run() to make "filter by *process*" true, but this won't > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter(). would the same_thread_group(current, link->task) work in such case? (zombie leader with other alive threads) jirka > > > Does bpftrace use bpf_uprobe_multi_link_attach/etc ? I guess not... > Then which userspace tool uses this code? ;) > > Oleg. >
On Tue, Aug 27, 2024 at 12:08:39PM +0200, Jiri Olsa wrote: > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > > On 08/26, Jiri Olsa wrote: > > > > > > On Mon, Aug 26, 2024 at 12:40:18AM +0200, Oleg Nesterov wrote: > > > > $ ./test & > > > > $ bpftrace -p $! -e 'uprobe:./test:func { printf("%d\n", pid); }' > > > > > > > > I hope that the syntax of the 2nd command is correct... > > > > > > > > I _think_ that it will print 2 pids too. > > > > > > yes.. but with CLONE_VM both processes share 'mm' > > > > Yes sure, > > > > > so they are threads, > > > > Well this depends on definition ;) but the CLONE_VM child is not a sub-thread, > > it has another TGID. See below. > > > > > and at least uprobe_multi filters by process [1] now.. ;-) > > > > OK, if you say that this behaviour is fine I won't argue, I simply do not know. > > But see below. > > > > > > But "perf-record -p" works as expected. > > > > > > I wonder it's because there's the perf layer that schedules each > > > uprobe event only when its process (PID1/2) is scheduled in and will > > > receive events only from that cpu while the process is running on it > > > > Not sure I understand... The task which hits the breakpoint is always > > current, it is always scheduled in. > > > > The main purpose of uprobe_perf_func()->uprobe_perf_filter() is NOT that > > we want to avoid __uprobe_perf_func() although this makes sense. > > > > The main purpose is that we want to remove the breakpoints in current->mm > > when uprobe_perf_filter() returns false, that is why UPROBE_HANDLER_REMOVE. > > IOW, the main purpose is not penalise user-space unnecessarily. > > > > IIUC (but I am not sure), perf-record -p will work "correctly" even if we > > remove uprobe_perf_filter() altogether. IIRC the perf layer does its own > > filtering but I forgot everything. > > > > And this makes me think that perhaps BPF can't rely on uprobe_perf_filter() > > either, even we forget about ret-probes. > > > > > [1] 46ba0e49b642 bpf: fix multi-uprobe PID filtering logic > > > > Looks obviously wrong... get_pid_task(PIDTYPE_TGID) can return a zombie > > leader with ->mm == NULL while other threads and thus the whole process > > is still alive. > > > > And again, the changelog says "the intent for PID filtering it to filter by > > *process*", but clone(CLONE_VM) creates another process, not a thread. > > > > So perhaps we need > > > > - if (link->task && current->mm != link->task->mm) > > + if (link->task && !same_thread_group(current, link->task)) > > > > in uprobe_prog_run() to make "filter by *process*" true, but this won't > > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter(). > > would the same_thread_group(current, link->task) work in such case? > (zombie leader with other alive threads) should uprobe_perf_filter use same_thread_group as well instead of mm pointers check? jirka
On 08/27, Jiri Olsa wrote: > > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > > > > So perhaps we need > > > > - if (link->task && current->mm != link->task->mm) > > + if (link->task && !same_thread_group(current, link->task)) > > > > in uprobe_prog_run() to make "filter by *process*" true, but this won't > > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter(). > > would the same_thread_group(current, link->task) work in such case? > (zombie leader with other alive threads) Why not? task_struct->signal is stable, it can't be changed. But again, uprobe_multi_link_filter() won't work if the leader, uprobe->link->task, exits or it has already exited. Perhaps something like the additional change below... Oleg. --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3322,13 +3322,28 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe, return err; } + static bool uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm) { struct bpf_uprobe *uprobe; + struct task_struct *task, *t; + bool ret = false; uprobe = container_of(con, struct bpf_uprobe, consumer); - return uprobe->link->task->mm == mm; + task = uprobe->link->task; + + rcu_read_lock(); + for_each_thread(task, t) { + struct mm_struct *mm = READ_ONCE(t->mm); + if (mm) { + ret = t->mm == mm; + break; + } + } + rcu_read_unlock(); + + return ret; } static int
On 08/27, Jiri Olsa wrote: > > On Tue, Aug 27, 2024 at 12:08:39PM +0200, Jiri Olsa wrote: > > > > > > - if (link->task && current->mm != link->task->mm) > > > + if (link->task && !same_thread_group(current, link->task)) > > > > > > in uprobe_prog_run() to make "filter by *process*" true, but this won't > > > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter(). > > > > would the same_thread_group(current, link->task) work in such case? > > (zombie leader with other alive threads) > > should uprobe_perf_filter use same_thread_group as well instead > of mm pointers check? uprobe_perf_filter or uprobe_multi_link_filter ? In any case I don't think same_thread_group(current, whatever) can work. For example, uc->filter() can be called from uprobe_register() paths. In this case "current" is the unrelated task which does, say, perf-record, etc. Even if uc->filter() was only called from handler_chain(), it couldn't work, think of UPROBE_HANDLER_REMOVE. See also another email I sent a minute ago. Oleg.
On Tue, Aug 27, 2024 at 12:29:38AM +0200, Oleg Nesterov wrote: > On 08/27, Jiri Olsa wrote: > > > > did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting: > > > > [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }' > > Attaching 1 probe... > > > > uprobe_register+1 > > so I guess you are on tip/perf/core which killed uprobe_register_refctr() > and changed bpf_uprobe_multi_link_attach() to use uprobe_register > > > bpf_uprobe_multi_link_attach+685 > > __sys_bpf+9395 > > __x64_sys_bpf+26 > > do_syscall_64+128 > > entry_SYSCALL_64_after_hwframe+118 > > > > > > I'm not sure what's bpftrace version in fedora 40, I'm using upstream build: > > bpftrace v0.20.1 > > > [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi > > uprobe_multi: yes > > Aha, I get > > uprobe_multi: no > > OK. So, on your setup bpftrace uses bpf_uprobe_multi_link_attach() > and this implies ->ret_handler = uprobe_multi_link_ret_handler() > which calls uprobe_prog_run() which does > > if (link->task && current->mm != link->task->mm) > return 0; > > So, can you reproduce the problem reported by Tianyi on your setup? yes, I can repduce the issue with uretprobe on top of perf event uprobe running 2 tasks of the test code: int func() { return 0; } int main() { printf("pid: %d\n", getpid()); while (1) { sleep(2); func(); } } and running 2 instances of bpftrace (each with separate pid): [root@qemu ex]# ../bpftrace/build/src/bpftrace -p 1018 -e 'uretprobe:./test:func { printf("%d\n", pid); }' Attaching 1 probe... 1018 1017 1018 1017 [root@qemu ex]# ../bpftrace/build/src/bpftrace -p 1017 -e 'uretprobe:./test:func { printf("%d\n", pid); }' Attaching 1 probe... 1017 1018 1017 1018 will execute bpf program twice for each bpftrace instance, like: sched-in 1018 perf_trace_add -> uprobe-hit handle_swbp handler_chain { for_each_uprobe_consumer { // consumer for task 1019 uprobe_dispatcher uprobe_perf_func uprobe_perf_filter return false // consumer for task 1018 uprobe_dispatcher uprobe_perf_func uprobe_perf_filter return true -> could run bpf program, but none is configured } prepare_uretprobe } -> uretprobe-hit handle_swbp uprobe_handle_trampoline handle_uretprobe_chain { for_each_uprobe_consumer { // consumer for task 1019 uretprobe_dispatcher uretprobe_perf_func -> runs bpf program // consumer for task 1018 uretprobe_dispatcher uretprobe_perf_func -> runs bpf program } } sched-out 1019 perf_trace_del and I think the same will happen for perf record in this case where instead of running the program we will execute perf_tp_event I think the uretprobe_dispatcher could call filter as suggested in the original patch.. but I'm not sure we need to remove the uprobe from handle_uretprobe_chain like we do in handler_chain.. maybe just to save the next uprobe hit which would remove the uprobe? jirka
On Tue, Aug 27, 2024 at 12:40:52PM +0200, Oleg Nesterov wrote: > On 08/27, Jiri Olsa wrote: > > > > On Mon, Aug 26, 2024 at 01:57:52PM +0200, Oleg Nesterov wrote: > > > > > > So perhaps we need > > > > > > - if (link->task && current->mm != link->task->mm) > > > + if (link->task && !same_thread_group(current, link->task)) > > > > > > in uprobe_prog_run() to make "filter by *process*" true, but this won't > > > fix the problem with link->task->mm == NULL in uprobe_multi_link_filter(). > > > > would the same_thread_group(current, link->task) work in such case? > > (zombie leader with other alive threads) > > Why not? task_struct->signal is stable, it can't be changed. > > But again, uprobe_multi_link_filter() won't work if the leader, > uprobe->link->task, exits or it has already exited. > > Perhaps something like the additional change below... > > Oleg. > > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -3322,13 +3322,28 @@ static int uprobe_prog_run(struct bpf_uprobe *uprobe, > return err; > } > > + > static bool > uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm) > { > struct bpf_uprobe *uprobe; > + struct task_struct *task, *t; > + bool ret = false; > > uprobe = container_of(con, struct bpf_uprobe, consumer); > - return uprobe->link->task->mm == mm; > + task = uprobe->link->task; > + > + rcu_read_lock(); > + for_each_thread(task, t) { > + struct mm_struct *mm = READ_ONCE(t->mm); > + if (mm) { > + ret = t->mm == mm; > + break; > + } > + } > + rcu_read_unlock(); that seems expensive if there's many threads could we check the leader first and only if it's gone fallback to this? thanks, jirka
On Tue, Aug 27, 2024 at 03:07:01PM +0200, Jiri Olsa wrote: > On Tue, Aug 27, 2024 at 12:29:38AM +0200, Oleg Nesterov wrote: > > On 08/27, Jiri Olsa wrote: > > > > > > did you just bpftrace-ed bpftrace? ;-) on my setup I'm getting: > > > > > > [root@qemu ex]# ../bpftrace/build/src/bpftrace -e 'kprobe:uprobe_register { printf("%s\n", kstack); }' > > > Attaching 1 probe... > > > > > > uprobe_register+1 > > > > so I guess you are on tip/perf/core which killed uprobe_register_refctr() > > and changed bpf_uprobe_multi_link_attach() to use uprobe_register > > > > > bpf_uprobe_multi_link_attach+685 > > > __sys_bpf+9395 > > > __x64_sys_bpf+26 > > > do_syscall_64+128 > > > entry_SYSCALL_64_after_hwframe+118 > > > > > > > > > I'm not sure what's bpftrace version in fedora 40, I'm using upstream build: > > > > bpftrace v0.20.1 > > > > > [root@qemu ex]# ../bpftrace/build/src/bpftrace --info 2>&1 | grep uprobe_multi > > > uprobe_multi: yes > > > > Aha, I get > > > > uprobe_multi: no > > > > OK. So, on your setup bpftrace uses bpf_uprobe_multi_link_attach() > > and this implies ->ret_handler = uprobe_multi_link_ret_handler() > > which calls uprobe_prog_run() which does > > > > if (link->task && current->mm != link->task->mm) > > return 0; > > > > So, can you reproduce the problem reported by Tianyi on your setup? > > yes, I can repduce the issue with uretprobe on top of perf event uprobe > > running 2 tasks of the test code: > > int func() { > return 0; > } > > int main() { > printf("pid: %d\n", getpid()); > while (1) { > sleep(2); > func(); > } > } > > and running 2 instances of bpftrace (each with separate pid): > > [root@qemu ex]# ../bpftrace/build/src/bpftrace -p 1018 -e 'uretprobe:./test:func { printf("%d\n", pid); }' > Attaching 1 probe... > 1018 > 1017 > 1018 > 1017 > > [root@qemu ex]# ../bpftrace/build/src/bpftrace -p 1017 -e 'uretprobe:./test:func { printf("%d\n", pid); }' > Attaching 1 probe... > 1017 > 1018 > 1017 > 1018 > > will execute bpf program twice for each bpftrace instance, like: > > sched-in 1018 > perf_trace_add > > -> uprobe-hit > handle_swbp > handler_chain > { > for_each_uprobe_consumer { > > // consumer for task 1019 > uprobe_dispatcher > uprobe_perf_func > uprobe_perf_filter return false > > // consumer for task 1018 > uprobe_dispatcher > uprobe_perf_func > uprobe_perf_filter return true > -> could run bpf program, but none is configured > } > > prepare_uretprobe > } > > -> uretprobe-hit > handle_swbp > uprobe_handle_trampoline > handle_uretprobe_chain > { > > for_each_uprobe_consumer { > > // consumer for task 1019 > uretprobe_dispatcher > uretprobe_perf_func > -> runs bpf program > > // consumer for task 1018 > uretprobe_dispatcher > uretprobe_perf_func > -> runs bpf program > > } > } > > sched-out 1019 ugh... should be 'sched-out 1018' jirka > perf_trace_del > > > and I think the same will happen for perf record in this case where instead of > running the program we will execute perf_tp_event > > I think the uretprobe_dispatcher could call filter as suggested in the original > patch.. but I'm not sure we need to remove the uprobe from handle_uretprobe_chain > like we do in handler_chain.. maybe just to save the next uprobe hit which would > remove the uprobe? > > jirka
On 08/27, Jiri Olsa wrote: > > On Tue, Aug 27, 2024 at 12:40:52PM +0200, Oleg Nesterov wrote: > > static bool > > uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm) > > { > > struct bpf_uprobe *uprobe; > > + struct task_struct *task, *t; > > + bool ret = false; > > > > uprobe = container_of(con, struct bpf_uprobe, consumer); > > - return uprobe->link->task->mm == mm; > > + task = uprobe->link->task; > > + > > + rcu_read_lock(); > > + for_each_thread(task, t) { > > + struct mm_struct *mm = READ_ONCE(t->mm); > > + if (mm) { > > + ret = t->mm == mm; > > + break; > > + } > > + } > > + rcu_read_unlock(); > > that seems expensive if there's many threads many threads with ->mm == NULL? In the likely case for_each_thread() stops after the first t->mm check. > could we check the leader first and only if it's gone fallback to this? up to you.. Oleg.
On Tue, Aug 27, 2024 at 04:26:08PM +0200, Oleg Nesterov wrote: > On 08/27, Jiri Olsa wrote: > > > > On Tue, Aug 27, 2024 at 12:40:52PM +0200, Oleg Nesterov wrote: > > > static bool > > > uprobe_multi_link_filter(struct uprobe_consumer *con, struct mm_struct *mm) > > > { > > > struct bpf_uprobe *uprobe; > > > + struct task_struct *task, *t; > > > + bool ret = false; > > > > > > uprobe = container_of(con, struct bpf_uprobe, consumer); > > > - return uprobe->link->task->mm == mm; > > > + task = uprobe->link->task; > > > + > > > + rcu_read_lock(); > > > + for_each_thread(task, t) { > > > + struct mm_struct *mm = READ_ONCE(t->mm); > > > + if (mm) { > > > + ret = t->mm == mm; > > > + break; > > > + } > > > + } > > > + rcu_read_unlock(); > > > > that seems expensive if there's many threads > > many threads with ->mm == NULL? In the likely case for_each_thread() > stops after the first t->mm check. aah the mm will be the same.. right, nice thanks, jirka
On 08/27, Jiri Olsa wrote: > > On Tue, Aug 27, 2024 at 12:29:38AM +0200, Oleg Nesterov wrote: > > > > So, can you reproduce the problem reported by Tianyi on your setup? > > yes, I can repduce the issue with uretprobe on top of perf event uprobe ... > -> uretprobe-hit > handle_swbp > uprobe_handle_trampoline > handle_uretprobe_chain > { > > for_each_uprobe_consumer { > > // consumer for task 1019 > uretprobe_dispatcher > uretprobe_perf_func > -> runs bpf program > > // consumer for task 1018 > uretprobe_dispatcher > uretprobe_perf_func > -> runs bpf program Confused... I naively thought that if bpftrace uses bpf_uprobe_multi_link_attach() then it won't use perf/trace_uprobe, and uretprobe-hit will result in // current->pid == 1018 for_each_uprobe_consumer { // consumer for task 1019 uprobe_multi_link_ret_handler uprobe_prog_run -> current->mm != link->task->mm, return // consumer for task 1018 uprobe_multi_link_ret_handler uprobe_prog_run -> current->mm == link->task->mm, run bpf } > I think the uretprobe_dispatcher could call filter as suggested in the original > patch.. OK, agreed. > but I'm not sure we need to remove the uprobe from handle_uretprobe_chain > like we do in handler_chain.. Me too. In any case this is another issue. Oleg.
Sorry for another reply, I just noticed I missed one part of your email... On 08/27, Jiri Olsa wrote: > > -> uretprobe-hit > handle_swbp > uprobe_handle_trampoline > handle_uretprobe_chain > { > > for_each_uprobe_consumer { > > // consumer for task 1019 > uretprobe_dispatcher > uretprobe_perf_func > -> runs bpf program > > // consumer for task 1018 > uretprobe_dispatcher > uretprobe_perf_func > -> runs bpf program > > } > } > > and I think the same will happen for perf record in this case where instead of > running the program we will execute perf_tp_event Hmm. Really? In this case these 2 different consumers will have the different trace_event_call's, so // consumer for task 1019 uretprobe_dispatcher uretprobe_perf_func __uprobe_perf_func perf_tp_event won't store the event because this_cpu_ptr(call->perf_events) should be hlist_empty() on this CPU, the perf_event for task 1019 wasn't scheduled in on this CPU... No? Ok, looks like I'm totally confused ;) Oleg.
On Tue, Aug 27, 2024 at 06:45:45PM +0200, Oleg Nesterov wrote: > On 08/27, Jiri Olsa wrote: > > > > On Tue, Aug 27, 2024 at 12:29:38AM +0200, Oleg Nesterov wrote: > > > > > > So, can you reproduce the problem reported by Tianyi on your setup? > > > > yes, I can repduce the issue with uretprobe on top of perf event uprobe > > ... > > > -> uretprobe-hit > > handle_swbp > > uprobe_handle_trampoline > > handle_uretprobe_chain > > { > > > > for_each_uprobe_consumer { > > > > // consumer for task 1019 > > uretprobe_dispatcher > > uretprobe_perf_func > > -> runs bpf program > > > > // consumer for task 1018 > > uretprobe_dispatcher > > uretprobe_perf_func > > -> runs bpf program > > Confused... > > I naively thought that if bpftrace uses bpf_uprobe_multi_link_attach() then > it won't use perf/trace_uprobe, and uretprobe-hit will result in right, this path is for the case when bpftrace attach to single uprobe, but there are 2 instances of bpftrace jirka > > // current->pid == 1018 > > for_each_uprobe_consumer { > // consumer for task 1019 > uprobe_multi_link_ret_handler > uprobe_prog_run > -> current->mm != link->task->mm, return > > // consumer for task 1018 > uprobe_multi_link_ret_handler > uprobe_prog_run > -> current->mm == link->task->mm, run bpf > } > > > I think the uretprobe_dispatcher could call filter as suggested in the original > > patch.. > > OK, agreed. > > > but I'm not sure we need to remove the uprobe from handle_uretprobe_chain > > like we do in handler_chain.. > > Me too. In any case this is another issue. > > Oleg. >
On Tue, Aug 27, 2024 at 10:19:26PM +0200, Oleg Nesterov wrote: > Sorry for another reply, I just noticed I missed one part of your email... > > On 08/27, Jiri Olsa wrote: > > > > -> uretprobe-hit > > handle_swbp > > uprobe_handle_trampoline > > handle_uretprobe_chain > > { > > > > for_each_uprobe_consumer { > > > > // consumer for task 1019 > > uretprobe_dispatcher > > uretprobe_perf_func > > -> runs bpf program > > > > // consumer for task 1018 > > uretprobe_dispatcher > > uretprobe_perf_func > > -> runs bpf program > > > > } > > } > > > > and I think the same will happen for perf record in this case where instead of > > running the program we will execute perf_tp_event > > Hmm. Really? In this case these 2 different consumers will have the different > trace_event_call's, so > > // consumer for task 1019 > uretprobe_dispatcher > uretprobe_perf_func > __uprobe_perf_func > perf_tp_event > > won't store the event because this_cpu_ptr(call->perf_events) should be > hlist_empty() on this CPU, the perf_event for task 1019 wasn't scheduled in > on this CPU... I'll double check on that, but because there's no filter for uretprobe I think it will be stored under 1018 event > > No? > > Ok, looks like I'm totally confused ;) I'm working on bpf selftests for above (uprobe and uprobe_multi paths) I plan to send them together with fixes we discussed earlier I'm hoping this will make it more clear jirka
On 08/28, Jiri Olsa wrote: > > On Tue, Aug 27, 2024 at 10:19:26PM +0200, Oleg Nesterov wrote: > > Hmm. Really? In this case these 2 different consumers will have the different > > trace_event_call's, so > > > > // consumer for task 1019 > > uretprobe_dispatcher > > uretprobe_perf_func > > __uprobe_perf_func > > perf_tp_event > > > > won't store the event because this_cpu_ptr(call->perf_events) should be > > hlist_empty() on this CPU, the perf_event for task 1019 wasn't scheduled in > > on this CPU... > > I'll double check on that, Yes, please. > but because there's no filter for uretprobe > I think it will be stored under 1018 event ... > I'm working on bpf selftests for above (uprobe and uprobe_multi paths) Meanwhile, I decided to try to test this case too ;) test.c: #include <unistd.h> int func(int i) { return i; } int main(void) { int i; for (i = 0;; ++i) { sleep(1); func(i); } return 0; } run_probe.c: #include "./include/uapi/linux/perf_event.h" #define _GNU_SOURCE #include <sys/syscall.h> #include <sys/ioctl.h> #include <assert.h> #include <unistd.h> #include <stdlib.h> #include <stdio.h> // cat /sys/bus/event_source/devices/uprobe/type #define UPROBE_TYPE 9 void run_probe(const char *file, unsigned offset, int pid) { struct perf_event_attr attr = { .type = UPROBE_TYPE, .config = 1, // ret-probe .uprobe_path = (unsigned long)file, .probe_offset = offset, .size = sizeof(struct perf_event_attr), }; int fd = syscall(__NR_perf_event_open, &attr, pid, 0, -1, 0); assert(fd >= 0); assert(ioctl(fd, PERF_EVENT_IOC_ENABLE, 0) == 0); for (;;) pause(); } int main(int argc, const char *argv[]) { int pid = atoi(argv[1]); run_probe("./test", 0x536, pid); return 0; } Now, with the kernel patch below applied, I do $ ./test & $ PID1=$! $ ./test & $ PID2=$! $ ./run_probe $PID1 & $ ./run_probe $PID2 & and the kernel prints: CHAIN trace_uprobe: HANDLER pid=46 consumers_target=46 stored=1 trace_uprobe: HANDLER pid=46 consumers_target=45 stored=0 CHAIN trace_uprobe: HANDLER pid=45 consumers_target=46 stored=0 trace_uprobe: HANDLER pid=45 consumers_target=45 stored=1 CHAIN trace_uprobe: HANDLER pid=46 consumers_target=46 stored=1 trace_uprobe: HANDLER pid=46 consumers_target=45 stored=0 CHAIN trace_uprobe: HANDLER pid=45 consumers_target=46 stored=0 trace_uprobe: HANDLER pid=45 consumers_target=45 stored=1 CHAIN trace_uprobe: HANDLER pid=46 consumers_target=46 stored=1 trace_uprobe: HANDLER pid=46 consumers_target=45 stored=0 CHAIN trace_uprobe: HANDLER pid=45 consumers_target=46 stored=0 trace_uprobe: HANDLER pid=45 consumers_target=45 stored=1 CHAIN trace_uprobe: HANDLER pid=46 consumers_target=46 stored=1 trace_uprobe: HANDLER pid=46 consumers_target=45 stored=0 CHAIN trace_uprobe: HANDLER pid=45 consumers_target=46 stored=0 trace_uprobe: HANDLER pid=45 consumers_target=45 stored=1 and so on. As you can see, perf_trace_buf_submit/etc is never called for the "unfiltered" consumer, so I still think that perf is fine wrt filtering. But I can be easily wrong, please check. Oleg. diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index acc73c1bc54c..14aa92a78d6d 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -2150,6 +2150,8 @@ handle_uretprobe_chain(struct return_instance *ri, struct pt_regs *regs) struct uprobe *uprobe = ri->uprobe; struct uprobe_consumer *uc; + pr_crit("CHAIN\n"); + rcu_read_lock_trace(); list_for_each_entry_rcu(uc, &uprobe->consumers, cons_node, rcu_read_lock_trace_held()) { if (uc->ret_handler) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index f7443e996b1b..e4eaa0363742 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1364,7 +1364,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, struct mm_struct *mm) return ret; } -static void __uprobe_perf_func(struct trace_uprobe *tu, +static int __uprobe_perf_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, struct uprobe_cpu_buffer **ucbp) { @@ -1375,6 +1375,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, void *data; int size, esize; int rctx; + int ret = 0; #ifdef CONFIG_BPF_EVENTS if (bpf_prog_array_valid(call)) { @@ -1382,7 +1383,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, ret = bpf_prog_run_array_uprobe(call->prog_array, regs, bpf_prog_run); if (!ret) - return; + return -1; } #endif /* CONFIG_BPF_EVENTS */ @@ -1392,12 +1393,13 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, size = esize + ucb->dsize; size = ALIGN(size + sizeof(u32), sizeof(u64)) - sizeof(u32); if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough")) - return; + return -1; preempt_disable(); head = this_cpu_ptr(call->perf_events); if (hlist_empty(head)) goto out; + ret = 1; entry = perf_trace_buf_alloc(size, NULL, &rctx); if (!entry) @@ -1421,6 +1423,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, head, NULL); out: preempt_enable(); + return ret; } /* uprobe profile handler */ @@ -1439,7 +1442,15 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, struct uprobe_cpu_buffer **ucbp) { - __uprobe_perf_func(tu, func, regs, ucbp); + struct trace_uprobe_filter *filter = tu->tp.event->filter; + struct perf_event *event = list_first_entry(&filter->perf_events, + struct perf_event, hw.tp_list); + + int r = __uprobe_perf_func(tu, func, regs, ucbp); + + pr_crit("HANDLER pid=%d consumers_target=%d stored=%d\n", + current->pid, event->hw.target ? event->hw.target->pid : -1, r); + } int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type,
On Thu, Aug 29, 2024 at 05:20:33PM +0200, Oleg Nesterov wrote: SNIP > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c > index f7443e996b1b..e4eaa0363742 100644 > --- a/kernel/trace/trace_uprobe.c > +++ b/kernel/trace/trace_uprobe.c > @@ -1364,7 +1364,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, struct mm_struct *mm) > return ret; > } > > -static void __uprobe_perf_func(struct trace_uprobe *tu, > +static int __uprobe_perf_func(struct trace_uprobe *tu, > unsigned long func, struct pt_regs *regs, > struct uprobe_cpu_buffer **ucbp) > { > @@ -1375,6 +1375,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, > void *data; > int size, esize; > int rctx; > + int ret = 0; > > #ifdef CONFIG_BPF_EVENTS > if (bpf_prog_array_valid(call)) { > @@ -1382,7 +1383,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, > > ret = bpf_prog_run_array_uprobe(call->prog_array, regs, bpf_prog_run); > if (!ret) > - return; > + return -1; > } > #endif /* CONFIG_BPF_EVENTS */ > > @@ -1392,12 +1393,13 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, > size = esize + ucb->dsize; > size = ALIGN(size + sizeof(u32), sizeof(u64)) - sizeof(u32); > if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE, "profile buffer not large enough")) > - return; > + return -1; > > preempt_disable(); > head = this_cpu_ptr(call->perf_events); > if (hlist_empty(head)) > goto out; right.. if the event is not added by perf_trace_add on this cpu it won't go pass this point, so no problem for perf but the issue is with bpf program triggered earlier by return uprobe created via perf event and [1] patch seems to fix that I sent out the bpf selftest that triggers the issue [2] thanks, jirka [1] https://lore.kernel.org/linux-trace-kernel/ME0P300MB0416034322B9915ECD3888649D882@ME0P300MB0416.AUSP300.PROD.OUTLOOK.COM/ [2] https://lore.kernel.org/bpf/20240829194505.402807-1-jolsa@kernel.org/T/#u > + ret = 1; > > entry = perf_trace_buf_alloc(size, NULL, &rctx); > if (!entry) > @@ -1421,6 +1423,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, > head, NULL); > out: > preempt_enable(); > + return ret; > } > > /* uprobe profile handler */ > @@ -1439,7 +1442,15 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func, > struct pt_regs *regs, > struct uprobe_cpu_buffer **ucbp) > { > - __uprobe_perf_func(tu, func, regs, ucbp); > + struct trace_uprobe_filter *filter = tu->tp.event->filter; > + struct perf_event *event = list_first_entry(&filter->perf_events, > + struct perf_event, hw.tp_list); > + > + int r = __uprobe_perf_func(tu, func, regs, ucbp); > + > + pr_crit("HANDLER pid=%d consumers_target=%d stored=%d\n", > + current->pid, event->hw.target ? event->hw.target->pid : -1, r); > + > } > > int bpf_get_uprobe_info(const struct perf_event *event, u32 *fd_type, >
Ah. we certainly misunderstand each other. On 08/29, Jiri Olsa wrote: > > On Thu, Aug 29, 2024 at 05:20:33PM +0200, Oleg Nesterov wrote: > > SNIP SNIP > right.. if the event is not added by perf_trace_add on this cpu > it won't go pass this point, so no problem for perf Yes, and this is what I tried to verify. In your previous email you said and I think the same will happen for perf record in this case where instead of running the program we will execute perf_tp_event and I tried verify this can't happen. So no problem for perf ;) > but the issue is with bpf program triggered earlier by return uprobe Well, the issue with bpf program (with the bpf_prog_array_valid(call) code in __uprobe_perf_func) was clear from the very beginning, no questions. > and [1] patch seems to fix that I'd say this patch fixes the symptoms, and it doesn't fix all the problems. But I can't suggest anything better for bpf code, so I won't really argue. However the changelog and even the subject is wrong. > I sent out the bpf selftest that triggers the issue [2] Thanks, I'll try take a look tomorrow. Oleg.
On Thu, Aug 29, 2024 at 11:12:41PM +0200, Oleg Nesterov wrote: > Ah. we certainly misunderstand each other. > > On 08/29, Jiri Olsa wrote: > > > > On Thu, Aug 29, 2024 at 05:20:33PM +0200, Oleg Nesterov wrote: > > > > SNIP > > SNIP > > > right.. if the event is not added by perf_trace_add on this cpu > > it won't go pass this point, so no problem for perf > > Yes, and this is what I tried to verify. In your previous email you said > > and I think the same will happen for perf record in this case where instead of > running the program we will execute perf_tp_event > > and I tried verify this can't happen. So no problem for perf ;) yea, I was wrong, you should be used to it by now ;-) > > > but the issue is with bpf program triggered earlier by return uprobe > > Well, the issue with bpf program (with the bpf_prog_array_valid(call) code > in __uprobe_perf_func) was clear from the very beginning, no questions. > > > and [1] patch seems to fix that > > I'd say this patch fixes the symptoms, and it doesn't fix all the problems. > But I can't suggest anything better for bpf code, so I won't really argue. > However the changelog and even the subject is wrong. > > > I sent out the bpf selftest that triggers the issue [2] > > Thanks, I'll try take a look tomorrow. thanks, jirka
The whole discussion was very confusing (yes, I too contributed to the confusion ;), let me try to summarise. > U(ret)probes are designed to be filterable using the PID, which is the > second parameter in the perf_event_open syscall. Currently, uprobe works > well with the filtering, but uretprobe is not affected by it. And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func() misunderstands the purpose of uprobe_perf_filter(). Lets forget about BPF for the moment. It is not that uprobe_perf_filter() does the filtering by the PID, it doesn't. We can simply kill this function and perf will work correctly. The perf layer in __uprobe_perf_func() does the filtering when perf_event->hw.target != NULL. So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid the __uprobe_perf_func() call (as the BPF code assumes), but to trigger unapply_uprobe() in handler_chain(). Suppose you do, say, $ perf probe -x /path/to/libc some_hot_function or $ perf probe -x /path/to/libc some_hot_function%return then $perf record -e ... -p 1 to trace the usage of some_hot_function() in the init process. Everything will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter(). But. If INIT forks a child C, dup_mm() will copy int3 installed by perf. So the child C will hit this breakpoint and cal handle_swbp/etc for no reason every time it calls some_hot_function(), not good. That is why uprobe_perf_func() calls uprobe_perf_filter() which returns UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will call unapply_uprobe() which will remove this breakpoint from C->mm. > We found that the filter function was not invoked when uretprobe was > initially implemented, and this has been existing for ten years. See above, this is correct. Note also that if you only use perf-probe + perf-record, no matter how many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe(). --------------------------------------------------------------------------- Now lets return to BPF and this particular problem. I won't really argue with this patch, but - Please change the subject and update the changelog, "Fixes: c1ae5c75e103" and the whole reasoning is misleading and wrong, IMO. - This patch won't fix all problems because uprobe_perf_filter() filters by mm, not by pid. The changelog/patch assumes that it is a "PID filter", but it is not. See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/ If the traced process does clone(CLONE_VM), bpftrace will hit the similar problem, with uprobe or uretprobe. - So I still think that the "right" fix should change the bpf_prog_run_array_uprobe() paths somehow, but I know nothing about bpf. Oleg.
On 08/30, Oleg Nesterov wrote: > > - This patch won't fix all problems because uprobe_perf_filter() > filters by mm, not by pid. The changelog/patch assumes that it > is a "PID filter", but it is not. > > See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/ Let me provide another example. spawn.c: #include <spawn.h> #include <unistd.h> int main(int argc, char *argv[]) { int pid; sleep(3); // wait for bpftrace to start posix_spawn(&pid, argv[1], NULL, NULL, argv+1, NULL); for (;;) pause(); } Now, $ ./spawn /usr/bin/true & bpftrace -p $! -e 'uprobe:/usr/lib64/libc.so.6:sigprocmask { printf("%d\n", pid); }' [2] 212851 Attaching 1 probe... 212855 ^C $ ./spawn /xxxxxxxxxxxx & bpftrace -p $! -e 'uprobe:/usr/lib64/libc.so.6:_exit { printf("%d\n", pid); }' [3] 212860 Attaching 1 probe... 212865 ^C This patch can't help in this case. Oleg.
On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote: > The whole discussion was very confusing (yes, I too contributed to the > confusion ;), let me try to summarise. > > > U(ret)probes are designed to be filterable using the PID, which is the > > second parameter in the perf_event_open syscall. Currently, uprobe works > > well with the filtering, but uretprobe is not affected by it. > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func() > misunderstands the purpose of uprobe_perf_filter(). > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter() > does the filtering by the PID, it doesn't. We can simply kill this function > and perf will work correctly. The perf layer in __uprobe_perf_func() does > the filtering when perf_event->hw.target != NULL. > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger > unapply_uprobe() in handler_chain(). > > Suppose you do, say, > > $ perf probe -x /path/to/libc some_hot_function > or > $ perf probe -x /path/to/libc some_hot_function%return > then > $perf record -e ... -p 1 > > to trace the usage of some_hot_function() in the init process. Everything > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter(). > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf. > So the child C will hit this breakpoint and cal handle_swbp/etc for no > reason every time it calls some_hot_function(), not good. > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will > call unapply_uprobe() which will remove this breakpoint from C->mm. thanks for the info, I wasn't aware this was the intention uprobe_multi does not have perf event mechanism/check, so it's using the filter function to do the process filtering.. which is not working properly as you pointed out earlier > > > We found that the filter function was not invoked when uretprobe was > > initially implemented, and this has been existing for ten years. > > See above, this is correct. > > Note also that if you only use perf-probe + perf-record, no matter how > many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into > uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe(). > > --------------------------------------------------------------------------- > Now lets return to BPF and this particular problem. I won't really argue > with this patch, but > > - Please change the subject and update the changelog, > "Fixes: c1ae5c75e103" and the whole reasoning is misleading > and wrong, IMO. > > - This patch won't fix all problems because uprobe_perf_filter() > filters by mm, not by pid. The changelog/patch assumes that it > is a "PID filter", but it is not. > > See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/ > If the traced process does clone(CLONE_VM), bpftrace will hit the > similar problem, with uprobe or uretprobe. > > - So I still think that the "right" fix should change the > bpf_prog_run_array_uprobe() paths somehow, but I know nothing > about bpf. to follow the perf event filter properly, bpf_prog_run_array_uprobe should be called in perf_tp_event after the perf_tp_event_match call, but that's already under preempt_disable, so that's why it's called before that jirka
On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote: > > The whole discussion was very confusing (yes, I too contributed to the > > confusion ;), let me try to summarise. > > > > > U(ret)probes are designed to be filterable using the PID, which is the > > > second parameter in the perf_event_open syscall. Currently, uprobe works > > > well with the filtering, but uretprobe is not affected by it. > > > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func() > > misunderstands the purpose of uprobe_perf_filter(). > > > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter() > > does the filtering by the PID, it doesn't. We can simply kill this function > > and perf will work correctly. The perf layer in __uprobe_perf_func() does > > the filtering when perf_event->hw.target != NULL. > > > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger > > unapply_uprobe() in handler_chain(). > > > > Suppose you do, say, > > > > $ perf probe -x /path/to/libc some_hot_function > > or > > $ perf probe -x /path/to/libc some_hot_function%return > > then > > $perf record -e ... -p 1 > > > > to trace the usage of some_hot_function() in the init process. Everything > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter(). > > > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf. > > So the child C will hit this breakpoint and cal handle_swbp/etc for no > > reason every time it calls some_hot_function(), not good. > > > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will > > call unapply_uprobe() which will remove this breakpoint from C->mm. > > thanks for the info, I wasn't aware this was the intention > > uprobe_multi does not have perf event mechanism/check, so it's using > the filter function to do the process filtering.. which is not working > properly as you pointed out earlier So this part I don't completely get. I get that using task->mm comparison is wrong due to CLONE_VM, but why same_thread_group() check is wrong? I.e., why task->signal comparison is wrong? Sorry, I tried to follow the thread, but there were a lot of emails with a bunch of confusion, so I'm not sure I got all the details. > > > > > > We found that the filter function was not invoked when uretprobe was > > > initially implemented, and this has been existing for ten years. > > > > See above, this is correct. > > > > Note also that if you only use perf-probe + perf-record, no matter how > > many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into > > uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe(). > > > > --------------------------------------------------------------------------- > > Now lets return to BPF and this particular problem. I won't really argue > > with this patch, but > > > > - Please change the subject and update the changelog, > > "Fixes: c1ae5c75e103" and the whole reasoning is misleading > > and wrong, IMO. > > > > - This patch won't fix all problems because uprobe_perf_filter() > > filters by mm, not by pid. The changelog/patch assumes that it > > is a "PID filter", but it is not. > > > > See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/ > > If the traced process does clone(CLONE_VM), bpftrace will hit the > > similar problem, with uprobe or uretprobe. > > > > - So I still think that the "right" fix should change the > > bpf_prog_run_array_uprobe() paths somehow, but I know nothing > > about bpf. > > to follow the perf event filter properly, bpf_prog_run_array_uprobe should > be called in perf_tp_event after the perf_tp_event_match call, but that's > already under preempt_disable, so that's why it's called before that > > jirka
On Fri, Aug 30, 2024 at 18:12:41PM +0800, Oleg Nesterov wrote: > The whole discussion was very confusing (yes, I too contributed to the > confusion ;), let me try to summarise. > > > U(ret)probes are designed to be filterable using the PID, which is the > > second parameter in the perf_event_open syscall. Currently, uprobe works > > well with the filtering, but uretprobe is not affected by it. > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func() > misunderstands the purpose of uprobe_perf_filter(). > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter() > does the filtering by the PID, it doesn't. We can simply kill this function > and perf will work correctly. The perf layer in __uprobe_perf_func() does > the filtering when perf_event->hw.target != NULL. > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger > unapply_uprobe() in handler_chain(). > > Suppose you do, say, > > $ perf probe -x /path/to/libc some_hot_function > or > $ perf probe -x /path/to/libc some_hot_function%return > then > $perf record -e ... -p 1 > > to trace the usage of some_hot_function() in the init process. Everything > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter(). > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf. > So the child C will hit this breakpoint and cal handle_swbp/etc for no > reason every time it calls some_hot_function(), not good. > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will > call unapply_uprobe() which will remove this breakpoint from C->mm. > > > We found that the filter function was not invoked when uretprobe was > > initially implemented, and this has been existing for ten years. > > See above, this is correct. > > Note also that if you only use perf-probe + perf-record, no matter how > many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into > uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe(). > Thanks for the detailed explanation above, I can understand the code now. Yes, I completely misunderstood the purpose of uprobe_perf_filter, sorry for the confusion. > --------------------------------------------------------------------------- > Now lets return to BPF and this particular problem. I won't really argue > with this patch, but > > - Please change the subject and update the changelog, > "Fixes: c1ae5c75e103" and the whole reasoning is misleading > and wrong, IMO. > > - This patch won't fix all problems because uprobe_perf_filter() > filters by mm, not by pid. The changelog/patch assumes that it > is a "PID filter", but it is not. > > See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/ > If the traced process does clone(CLONE_VM), bpftrace will hit the > similar problem, with uprobe or uretprobe. > > - So I still think that the "right" fix should change the > bpf_prog_run_array_uprobe() paths somehow, but I know nothing > about bpf. I agree that this patch does not address the issue correctly. The PID filter should be implemented within bpf_prog_run_array_uprobe, or alternatively, bpf_prog_run_array_uprobe should be called after perf_tp_event_match to reuse the filtering mechanism provided by perf. Also, uretprobe may need UPROBE_HANDLER_REMOVE, similar to uprobe. For now, please forget the original patch as we need a new solution ;) Thanks for your time,
On 09/02, Tianyi Liu wrote: > > On Fri, Aug 30, 2024 at 18:12:41PM +0800, Oleg Nesterov wrote: > > > > - So I still think that the "right" fix should change the > > bpf_prog_run_array_uprobe() paths somehow, but I know nothing > > about bpf. > > I agree that this patch does not address the issue correctly. > The PID filter should be implemented within bpf_prog_run_array_uprobe, OK, > or alternatively, bpf_prog_run_array_uprobe should be called after > perf_tp_event_match to reuse the filtering mechanism provided by perf. No, no, perf_tp_event_match() has nothing to do with pid/mm filtering in this case, afaics. See https://lore.kernel.org/all/20240829152032.GA23996@redhat.com/ and perf_trace_add() which adds the event to this_cpu_ptr(call->perf_events), see also the hlist_empty(head) check in __uprobe_perf_func(). Again, again, I can be easily wrong, I forgot everything I knew (not too much) about perf, but at least in the case above perf_tp_event_match() is not even called. > Also, uretprobe may need UPROBE_HANDLER_REMOVE, similar to uprobe. May be... Not sure. But why do you think we want it? And... I think that BPF has even more problems with filtering. Not sure, I'll try to write another test-case tomorrow. Oleg.
On Fri, Aug 30, 2024 at 08:51:12AM -0700, Andrii Nakryiko wrote: > On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote: > > > The whole discussion was very confusing (yes, I too contributed to the > > > confusion ;), let me try to summarise. > > > > > > > U(ret)probes are designed to be filterable using the PID, which is the > > > > second parameter in the perf_event_open syscall. Currently, uprobe works > > > > well with the filtering, but uretprobe is not affected by it. > > > > > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func() > > > misunderstands the purpose of uprobe_perf_filter(). > > > > > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter() > > > does the filtering by the PID, it doesn't. We can simply kill this function > > > and perf will work correctly. The perf layer in __uprobe_perf_func() does > > > the filtering when perf_event->hw.target != NULL. > > > > > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid > > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger > > > unapply_uprobe() in handler_chain(). > > > > > > Suppose you do, say, > > > > > > $ perf probe -x /path/to/libc some_hot_function > > > or > > > $ perf probe -x /path/to/libc some_hot_function%return > > > then > > > $perf record -e ... -p 1 > > > > > > to trace the usage of some_hot_function() in the init process. Everything > > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter(). > > > > > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf. > > > So the child C will hit this breakpoint and cal handle_swbp/etc for no > > > reason every time it calls some_hot_function(), not good. > > > > > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns > > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will > > > call unapply_uprobe() which will remove this breakpoint from C->mm. > > > > thanks for the info, I wasn't aware this was the intention > > > > uprobe_multi does not have perf event mechanism/check, so it's using > > the filter function to do the process filtering.. which is not working > > properly as you pointed out earlier > > So this part I don't completely get. I get that using task->mm > comparison is wrong due to CLONE_VM, but why same_thread_group() check > is wrong? I.e., why task->signal comparison is wrong? the way I understand it is that we take the group leader task and store it in bpf_uprobe_multi_link::task but it can exit while the rest of the threads is still running so the uprobe_multi_link_filter won't match them (leader->mm is NULL) Oleg suggested change below (in addition to same_thread_group change) to take that in account jirka --- diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 98e395f1baae..9e6b390aa6da 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx struct mm_struct *mm) { struct bpf_uprobe *uprobe; + struct task_struct *task, *t; + bool ret = false; uprobe = container_of(con, struct bpf_uprobe, consumer); - return uprobe->link->task->mm == mm; + task = uprobe->link->task; + + rcu_read_lock(); + for_each_thread(task, t) { + struct mm_struct *mm = READ_ONCE(t->mm); + if (mm) { + ret = t->mm == mm; + break; + } + } + rcu_read_unlock(); + + return ret; } static int
On 09/02, Oleg Nesterov wrote: > > And... I think that BPF has even more problems with filtering. Not sure, > I'll try to write another test-case tomorrow. See below. This test-case needs a one-liner patch at the end, but this is only because I have no idea how to add BPF_EMIT_CALL(BPF_FUNC_trace_printk) into "struct bpf_insn insns[]" correctly. Is there a simple-to-use user-space tool which can translate 'bpf_trace_printk("Hello world\n", 13)' into bpf_insn[] ??? So. The CONFIG_BPF_EVENTS code in __uprobe_perf_func() assumes that it "owns" tu->consumer and uprobe_perf_filter(), but this is not true in general. test.c: #include <unistd.h> int func(int i) { return i; } int main(void) { int i; for (i = 0;; ++i) { sleep(1); func(i); } return 0; } run_prog.c // cc -I./tools/include -I./tools/include/uapi -Wall #include "./include/generated/uapi/linux/version.h" #include <linux/perf_event.h> #include <linux/filter.h> #define _GNU_SOURCE #include <sys/syscall.h> #include <sys/ioctl.h> #include <assert.h> #include <unistd.h> #include <stdlib.h> int prog_load(void) { struct bpf_insn insns[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }; union bpf_attr attr = { .prog_type = BPF_PROG_TYPE_KPROBE, .insns = (unsigned long)insns, .insn_cnt = sizeof(insns) / sizeof(insns[0]), .license = (unsigned long)"GPL", .kern_version = LINUX_VERSION_CODE, // unneeded }; return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)); } void run_probe(int eid, int pid) { struct perf_event_attr attr = { .type = PERF_TYPE_TRACEPOINT, .config = eid, .size = sizeof(struct perf_event_attr), }; int fd = syscall(__NR_perf_event_open, &attr, pid, 0, -1, 0); assert(fd >= 0); int pfd = prog_load(); assert(pfd >= 0); assert(ioctl(fd, PERF_EVENT_IOC_SET_BPF, pfd) == 0); assert(ioctl(fd, PERF_EVENT_IOC_ENABLE, 0) == 0); for (;;) pause(); } int main(int argc, const char *argv[]) { int eid = atoi(argv[1]); int pid = atoi(argv[2]); run_probe(eid, pid); return 0; } Now, $ ./test & $ PID1=$! $ ./test & $ PID2=$! $ perf probe -x ./test -a func $ ./run_prog `cat /sys/kernel/debug/tracing/events/probe_test/func/id` $PID1 & dmesg -c: trace_uprobe: BPF_FUNC: pid=50 ret=0 trace_uprobe: BPF_FUNC: pid=50 ret=0 trace_uprobe: BPF_FUNC: pid=50 ret=0 trace_uprobe: BPF_FUNC: pid=50 ret=0 ... So far so good. Now, $ perf record -e probe_test:func -p $PID2 -- sleep 10 & This creates another PERF_TYPE_TRACEPOINT perf_event which shares trace_uprobe/consumer/filter with the perf_event created by run_prog. dmesg -c: trace_uprobe: BPF_FUNC: pid=51 ret=0 trace_uprobe: BPF_FUNC: pid=50 ret=0 trace_uprobe: BPF_FUNC: pid=51 ret=0 trace_uprobe: BPF_FUNC: pid=50 ret=0 ... until perf-record exits. and after that $ perf script reports nothing. So, in this case: - run_prog's bpf program is called when current->pid == $PID2, this patch (or any other change in trace_uprobe.c) can't help. - run_prog's bpf program "steals" __uprobe_perf_func() from /usr/bin/perf and to me this is yet another indication that we need some changes in the bpf_prog_run_array_uprobe() paths, or even in the user-space bpftrace/whatever's code. And. Why the "if (bpf_prog_array_valid(call))" block in __uprobe_perf_func() returns? Why this depends on "ret == 0" ??? I fail to understand this logic. Oleg. --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1381,6 +1381,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, u32 ret; ret = bpf_prog_run_array_uprobe(call->prog_array, regs, bpf_prog_run); + pr_crit("BPF_FUNC: pid=%d ret=%d\n", current->pid, ret); if (!ret) return; }
On Mon, Sep 02, 2024 at 07:17:45PM +0200, Oleg Nesterov wrote: > On 09/02, Oleg Nesterov wrote: > > > > And... I think that BPF has even more problems with filtering. Not sure, > > I'll try to write another test-case tomorrow. > > See below. This test-case needs a one-liner patch at the end, but this is only > because I have no idea how to add BPF_EMIT_CALL(BPF_FUNC_trace_printk) into > "struct bpf_insn insns[]" correctly. Is there a simple-to-use user-space tool > which can translate 'bpf_trace_printk("Hello world\n", 13)' into bpf_insn[] ??? > > So. The CONFIG_BPF_EVENTS code in __uprobe_perf_func() assumes that it "owns" > tu->consumer and uprobe_perf_filter(), but this is not true in general. > > test.c: > #include <unistd.h> > > int func(int i) > { > return i; > } > > int main(void) > { > int i; > for (i = 0;; ++i) { > sleep(1); > func(i); > } > return 0; > } > > run_prog.c > // cc -I./tools/include -I./tools/include/uapi -Wall > #include "./include/generated/uapi/linux/version.h" > #include <linux/perf_event.h> > #include <linux/filter.h> > #define _GNU_SOURCE > #include <sys/syscall.h> > #include <sys/ioctl.h> > #include <assert.h> > #include <unistd.h> > #include <stdlib.h> > > int prog_load(void) > { > struct bpf_insn insns[] = { > BPF_MOV64_IMM(BPF_REG_0, 0), > BPF_EXIT_INSN(), > }; > > union bpf_attr attr = { > .prog_type = BPF_PROG_TYPE_KPROBE, > .insns = (unsigned long)insns, > .insn_cnt = sizeof(insns) / sizeof(insns[0]), > .license = (unsigned long)"GPL", > .kern_version = LINUX_VERSION_CODE, // unneeded > }; > > return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr)); > } > > void run_probe(int eid, int pid) > { > struct perf_event_attr attr = { > .type = PERF_TYPE_TRACEPOINT, > .config = eid, > .size = sizeof(struct perf_event_attr), > }; > > int fd = syscall(__NR_perf_event_open, &attr, pid, 0, -1, 0); > assert(fd >= 0); > > int pfd = prog_load(); > assert(pfd >= 0); > > assert(ioctl(fd, PERF_EVENT_IOC_SET_BPF, pfd) == 0); > assert(ioctl(fd, PERF_EVENT_IOC_ENABLE, 0) == 0); > > for (;;) > pause(); > } > > int main(int argc, const char *argv[]) > { > int eid = atoi(argv[1]); > int pid = atoi(argv[2]); > run_probe(eid, pid); > return 0; > } > > Now, > > $ ./test & > $ PID1=$! > $ ./test & > $ PID2=$! > $ perf probe -x ./test -a func > $ ./run_prog `cat /sys/kernel/debug/tracing/events/probe_test/func/id` $PID1 & > > dmesg -c: > trace_uprobe: BPF_FUNC: pid=50 ret=0 > trace_uprobe: BPF_FUNC: pid=50 ret=0 > trace_uprobe: BPF_FUNC: pid=50 ret=0 > trace_uprobe: BPF_FUNC: pid=50 ret=0 > ... > > So far so good. Now, > > $ perf record -e probe_test:func -p $PID2 -- sleep 10 & > > This creates another PERF_TYPE_TRACEPOINT perf_event which shares > trace_uprobe/consumer/filter with the perf_event created by run_prog. > > dmesg -c: > trace_uprobe: BPF_FUNC: pid=51 ret=0 > trace_uprobe: BPF_FUNC: pid=50 ret=0 > trace_uprobe: BPF_FUNC: pid=51 ret=0 > trace_uprobe: BPF_FUNC: pid=50 ret=0 > ... > > until perf-record exits. and after that > > $ perf script > > reports nothing. > > So, in this case: > > - run_prog's bpf program is called when current->pid == $PID2, this patch > (or any other change in trace_uprobe.c) can't help. > > - run_prog's bpf program "steals" __uprobe_perf_func() from /usr/bin/perf ok, there's just one call instance (struct trace_event_call) shared with all the uprobe tracepoints, so if there's bpf program attached to any uprobe tracepoint, it will not continue and send the perf event for any other uprobe tracepoint (without the bpf program attached) I think there might be similar issue with tracepoints and kprobes > > and to me this is yet another indication that we need some changes in the > bpf_prog_run_array_uprobe() paths, or even in the user-space bpftrace/whatever's > code. > > And. Why the "if (bpf_prog_array_valid(call))" block in __uprobe_perf_func() returns? > Why this depends on "ret == 0" ??? I fail to understand this logic. I'd think the intention was that if there's bpf program we don't emit the perf event.. and we never thought about having them (event with bpf program and standard perf event) co-existing together problem is that the perf event pid filtering is implemented through the call->perf_events list jirka
On Mon, Sep 2, 2024 at 2:11 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Fri, Aug 30, 2024 at 08:51:12AM -0700, Andrii Nakryiko wrote: > > On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote: > > > > The whole discussion was very confusing (yes, I too contributed to the > > > > confusion ;), let me try to summarise. > > > > > > > > > U(ret)probes are designed to be filterable using the PID, which is the > > > > > second parameter in the perf_event_open syscall. Currently, uprobe works > > > > > well with the filtering, but uretprobe is not affected by it. > > > > > > > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func() > > > > misunderstands the purpose of uprobe_perf_filter(). > > > > > > > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter() > > > > does the filtering by the PID, it doesn't. We can simply kill this function > > > > and perf will work correctly. The perf layer in __uprobe_perf_func() does > > > > the filtering when perf_event->hw.target != NULL. > > > > > > > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid > > > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger > > > > unapply_uprobe() in handler_chain(). > > > > > > > > Suppose you do, say, > > > > > > > > $ perf probe -x /path/to/libc some_hot_function > > > > or > > > > $ perf probe -x /path/to/libc some_hot_function%return > > > > then > > > > $perf record -e ... -p 1 > > > > > > > > to trace the usage of some_hot_function() in the init process. Everything > > > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter(). > > > > > > > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf. > > > > So the child C will hit this breakpoint and cal handle_swbp/etc for no > > > > reason every time it calls some_hot_function(), not good. > > > > > > > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns > > > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will > > > > call unapply_uprobe() which will remove this breakpoint from C->mm. > > > > > > thanks for the info, I wasn't aware this was the intention > > > > > > uprobe_multi does not have perf event mechanism/check, so it's using > > > the filter function to do the process filtering.. which is not working > > > properly as you pointed out earlier > > > > So this part I don't completely get. I get that using task->mm > > comparison is wrong due to CLONE_VM, but why same_thread_group() check > > is wrong? I.e., why task->signal comparison is wrong? > > the way I understand it is that we take the group leader task and > store it in bpf_uprobe_multi_link::task > > but it can exit while the rest of the threads is still running so > the uprobe_multi_link_filter won't match them (leader->mm is NULL) Aren't we conflating two things here? Yes, from what Oleg explained, it's clear that using task->mm is wrong. So that is what I feel is the main issue. We shouldn't use task->mm at all, only task->signal should be used instead. We should fix that (in bpf tree, please). But I don't get the concern about linux->mm or linux->signal becoming NULL because of a task existing. Look at put_task_struct(), it WILL NOT call __put_task_struct() (which then calls put_signal_struct()), so task->signal at least will be there and valid until multi-uprobe is detached and we call put_task(). So. Can you please send fixes against the bpf tree, switching to task->signal? And maybe also include the fix to prevent UPROBE_HANDLER_REMOVE to be returned from the BPF program? This thread is almost 50 emails deep now, we should break out of it. We can argue on your actual fixes. :) > > Oleg suggested change below (in addition to same_thread_group change) > to take that in account > > jirka > > > --- > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 98e395f1baae..9e6b390aa6da 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx > struct mm_struct *mm) > { > struct bpf_uprobe *uprobe; > + struct task_struct *task, *t; > + bool ret = false; > > uprobe = container_of(con, struct bpf_uprobe, consumer); > - return uprobe->link->task->mm == mm; > + task = uprobe->link->task; > + > + rcu_read_lock(); > + for_each_thread(task, t) { > + struct mm_struct *mm = READ_ONCE(t->mm); > + if (mm) { > + ret = t->mm == mm; > + break; > + } > + } > + rcu_read_unlock(); > + > + return ret; > } > > static int
On Tue, Sep 3, 2024 at 11:09 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Mon, Sep 2, 2024 at 2:11 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Fri, Aug 30, 2024 at 08:51:12AM -0700, Andrii Nakryiko wrote: > > > On Fri, Aug 30, 2024 at 6:34 AM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > > > On Fri, Aug 30, 2024 at 12:12:09PM +0200, Oleg Nesterov wrote: > > > > > The whole discussion was very confusing (yes, I too contributed to the > > > > > confusion ;), let me try to summarise. > > > > > > > > > > > U(ret)probes are designed to be filterable using the PID, which is the > > > > > > second parameter in the perf_event_open syscall. Currently, uprobe works > > > > > > well with the filtering, but uretprobe is not affected by it. > > > > > > > > > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func() > > > > > misunderstands the purpose of uprobe_perf_filter(). > > > > > > > > > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter() > > > > > does the filtering by the PID, it doesn't. We can simply kill this function > > > > > and perf will work correctly. The perf layer in __uprobe_perf_func() does > > > > > the filtering when perf_event->hw.target != NULL. > > > > > > > > > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid > > > > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger > > > > > unapply_uprobe() in handler_chain(). > > > > > > > > > > Suppose you do, say, > > > > > > > > > > $ perf probe -x /path/to/libc some_hot_function > > > > > or > > > > > $ perf probe -x /path/to/libc some_hot_function%return > > > > > then > > > > > $perf record -e ... -p 1 > > > > > > > > > > to trace the usage of some_hot_function() in the init process. Everything > > > > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter(). > > > > > > > > > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf. > > > > > So the child C will hit this breakpoint and cal handle_swbp/etc for no > > > > > reason every time it calls some_hot_function(), not good. > > > > > > > > > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns > > > > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will > > > > > call unapply_uprobe() which will remove this breakpoint from C->mm. > > > > > > > > thanks for the info, I wasn't aware this was the intention > > > > > > > > uprobe_multi does not have perf event mechanism/check, so it's using > > > > the filter function to do the process filtering.. which is not working > > > > properly as you pointed out earlier > > > > > > So this part I don't completely get. I get that using task->mm > > > comparison is wrong due to CLONE_VM, but why same_thread_group() check > > > is wrong? I.e., why task->signal comparison is wrong? > > > > the way I understand it is that we take the group leader task and > > store it in bpf_uprobe_multi_link::task > > > > but it can exit while the rest of the threads is still running so > > the uprobe_multi_link_filter won't match them (leader->mm is NULL) > > Aren't we conflating two things here? Yes, from what Oleg explained, > it's clear that using task->mm is wrong. So that is what I feel is the > main issue. We shouldn't use task->mm at all, only task->signal should > be used instead. We should fix that (in bpf tree, please). > > But I don't get the concern about linux->mm or linux->signal becoming correction, we shouldn't worry about *linux->signal* becoming NULL. linux->mm can become NULL, but we don't care about that (once we fix filtering logic in multi-uprobe). > NULL because of a task existing. Look at put_task_struct(), it WILL > NOT call __put_task_struct() (which then calls put_signal_struct()), > so task->signal at least will be there and valid until multi-uprobe is > detached and we call put_task(). > > So. Can you please send fixes against the bpf tree, switching to > task->signal? And maybe also include the fix to prevent > UPROBE_HANDLER_REMOVE to be returned from the BPF program? > > This thread is almost 50 emails deep now, we should break out of it. > We can argue on your actual fixes. :) > > > > > Oleg suggested change below (in addition to same_thread_group change) > > to take that in account > > > > jirka > > > > > > --- > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > index 98e395f1baae..9e6b390aa6da 100644 > > --- a/kernel/trace/bpf_trace.c > > +++ b/kernel/trace/bpf_trace.c > > @@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx > > struct mm_struct *mm) > > { > > struct bpf_uprobe *uprobe; > > + struct task_struct *task, *t; > > + bool ret = false; > > > > uprobe = container_of(con, struct bpf_uprobe, consumer); > > - return uprobe->link->task->mm == mm; > > + task = uprobe->link->task; > > + > > + rcu_read_lock(); > > + for_each_thread(task, t) { > > + struct mm_struct *mm = READ_ONCE(t->mm); > > + if (mm) { > > + ret = t->mm == mm; > > + break; > > + } > > + } > > + rcu_read_unlock(); > > + > > + return ret; > > } > > > > static int
On Tue, Sep 03, 2024 at 11:11:06AM -0700, Andrii Nakryiko wrote: SNIP > > Aren't we conflating two things here? Yes, from what Oleg explained, > > it's clear that using task->mm is wrong. So that is what I feel is the > > main issue. We shouldn't use task->mm at all, only task->signal should > > be used instead. We should fix that (in bpf tree, please). > > > > But I don't get the concern about linux->mm or linux->signal becoming > > correction, we shouldn't worry about *linux->signal* becoming NULL. > linux->mm can become NULL, but we don't care about that (once we fix > filtering logic in multi-uprobe). > > > NULL because of a task existing. Look at put_task_struct(), it WILL > > NOT call __put_task_struct() (which then calls put_signal_struct()), > > so task->signal at least will be there and valid until multi-uprobe is > > detached and we call put_task(). > > > > So. Can you please send fixes against the bpf tree, switching to > > task->signal? And maybe also include the fix to prevent > > UPROBE_HANDLER_REMOVE to be returned from the BPF program? ok, it's uprobe-multi specific, let's discuss that over the change itself, I'll try to send it soon jirka > > > > This thread is almost 50 emails deep now, we should break out of it. > > We can argue on your actual fixes. :) > > > > > > > > Oleg suggested change below (in addition to same_thread_group change) > > > to take that in account > > > > > > jirka > > > > > > > > > --- > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > > index 98e395f1baae..9e6b390aa6da 100644 > > > --- a/kernel/trace/bpf_trace.c > > > +++ b/kernel/trace/bpf_trace.c > > > @@ -3235,9 +3235,23 @@ uprobe_multi_link_filter(struct uprobe_consumer *con, enum uprobe_filter_ctx ctx > > > struct mm_struct *mm) > > > { > > > struct bpf_uprobe *uprobe; > > > + struct task_struct *task, *t; > > > + bool ret = false; > > > > > > uprobe = container_of(con, struct bpf_uprobe, consumer); > > > - return uprobe->link->task->mm == mm; > > > + task = uprobe->link->task; > > > + > > > + rcu_read_lock(); > > > + for_each_thread(task, t) { > > > + struct mm_struct *mm = READ_ONCE(t->mm); > > > + if (mm) { > > > + ret = t->mm == mm; > > > + break; > > > + } > > > + } > > > + rcu_read_unlock(); > > > + > > > + return ret; > > > } > > > > > > static int
On Mon, Sep 02, 2024 at 03:22:25AM +0800, Tianyi Liu wrote: > On Fri, Aug 30, 2024 at 18:12:41PM +0800, Oleg Nesterov wrote: > > The whole discussion was very confusing (yes, I too contributed to the > > confusion ;), let me try to summarise. > > > > > U(ret)probes are designed to be filterable using the PID, which is the > > > second parameter in the perf_event_open syscall. Currently, uprobe works > > > well with the filtering, but uretprobe is not affected by it. > > > > And this is correct. But the CONFIG_BPF_EVENTS code in __uprobe_perf_func() > > misunderstands the purpose of uprobe_perf_filter(). > > > > Lets forget about BPF for the moment. It is not that uprobe_perf_filter() > > does the filtering by the PID, it doesn't. We can simply kill this function > > and perf will work correctly. The perf layer in __uprobe_perf_func() does > > the filtering when perf_event->hw.target != NULL. > > > > So why does uprobe_perf_filter() call uprobe_perf_filter()? Not to avoid > > the __uprobe_perf_func() call (as the BPF code assumes), but to trigger > > unapply_uprobe() in handler_chain(). > > > > Suppose you do, say, > > > > $ perf probe -x /path/to/libc some_hot_function > > or > > $ perf probe -x /path/to/libc some_hot_function%return > > then > > $perf record -e ... -p 1 > > > > to trace the usage of some_hot_function() in the init process. Everything > > will work just fine if we kill uprobe_perf_filter()->uprobe_perf_filter(). > > > > But. If INIT forks a child C, dup_mm() will copy int3 installed by perf. > > So the child C will hit this breakpoint and cal handle_swbp/etc for no > > reason every time it calls some_hot_function(), not good. > > > > That is why uprobe_perf_func() calls uprobe_perf_filter() which returns > > UPROBE_HANDLER_REMOVE when C hits the breakpoint. handler_chain() will > > call unapply_uprobe() which will remove this breakpoint from C->mm. > > > > > We found that the filter function was not invoked when uretprobe was > > > initially implemented, and this has been existing for ten years. > > > > See above, this is correct. > > > > Note also that if you only use perf-probe + perf-record, no matter how > > many instances, you can even add BUG_ON(!uprobe_perf_filter(...)) into > > uretprobe_perf_func(). IIRC, perf doesn't use create_local_trace_uprobe(). > > > > Thanks for the detailed explanation above, I can understand the code now. > Yes, I completely misunderstood the purpose of uprobe_perf_filter, > sorry for the confusion. > > > --------------------------------------------------------------------------- > > Now lets return to BPF and this particular problem. I won't really argue > > with this patch, but > > > > - Please change the subject and update the changelog, > > "Fixes: c1ae5c75e103" and the whole reasoning is misleading > > and wrong, IMO. > > > > - This patch won't fix all problems because uprobe_perf_filter() > > filters by mm, not by pid. The changelog/patch assumes that it > > is a "PID filter", but it is not. > > > > See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/ > > If the traced process does clone(CLONE_VM), bpftrace will hit the > > similar problem, with uprobe or uretprobe. > > > > - So I still think that the "right" fix should change the > > bpf_prog_run_array_uprobe() paths somehow, but I know nothing > > about bpf. > > I agree that this patch does not address the issue correctly. > The PID filter should be implemented within bpf_prog_run_array_uprobe, > or alternatively, bpf_prog_run_array_uprobe should be called after > perf_tp_event_match to reuse the filtering mechanism provided by perf. > > Also, uretprobe may need UPROBE_HANDLER_REMOVE, similar to uprobe. > > For now, please forget the original patch as we need a new solution ;) hi, any chance we could go with your fix until we find better solution? it's simple and it fixes most of the cases for return uprobe pid filter for events with bpf programs.. I know during the discussion we found that standard perf record path won't work if there's bpf program attached on the same event, but I think that likely it needs more changes and also it's not a common use case would you consider sending another version addressing Oleg's points for changelog above? thanks, jirka
On 09/06, Jiri Olsa wrote: > > On Mon, Sep 02, 2024 at 03:22:25AM +0800, Tianyi Liu wrote: > > > > For now, please forget the original patch as we need a new solution ;) > > hi, > any chance we could go with your fix until we find better solution? Well, as I said from the very beginning I won't really argue even if I obviously don't like this change very much. As long as the changelog / comments clearly explain this change. I understand that sometimes an ugly/incomplete/whatever workaround is better than nothing. > it's simple and it fixes most of the cases for return uprobe pid filter > for events with bpf programs.. But to remind it doesn't even fixes all the filtering problems with uprobes, not uretprobes, > I know during the discussion we found > that standard perf record path won't work if there's bpf program > attached on the same event, Ah. Yes, this is another problem I tried to point out. But if we discuss the filtering we can forget about /usr/bin/perf. Again, again, again, I know nothing about bpf. But it seems to me that perf_event_attach_bpf_prog() allows to attach up to BPF_TRACE_MAX_PROGS progs to event->tp_event->prog_array, and then bpf_prog_run_array_uprobe() should run them all. Right? So I think that if you run 2 instances of run_prog from my last test-case with $PID1 and $PID2, the filtering will be broken again. Both instances will share the same trace_event_call and the same trace_uprobe_filter. > and also it's not a common use case OK. And btw... Can bpftrace attach to the uprobe tp? # perf probe -x ./test -a func Added new event: probe_test:func (on func in /root/TTT/test) You can now use it in all perf tools, such as: perf record -e probe_test:func -aR sleep 1 # bpftrace -e 'tracepoint:probe_test:func { printf("%d\n", pid); }' Attaching 1 probe... ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument ERROR: Error attaching probe: tracepoint:probe_test:func Oleg.
On Mon, Sep 06, 2024 at 18:43:00AM +0800, Jiri Olsa wrote: SNIP > > > --------------------------------------------------------------------------- > > > Now lets return to BPF and this particular problem. I won't really argue > > > with this patch, but > > > > > > - Please change the subject and update the changelog, > > > "Fixes: c1ae5c75e103" and the whole reasoning is misleading > > > and wrong, IMO. > > > > > > - This patch won't fix all problems because uprobe_perf_filter() > > > filters by mm, not by pid. The changelog/patch assumes that it > > > is a "PID filter", but it is not. > > > > > > See https://lore.kernel.org/linux-trace-kernel/20240825224018.GD3906@redhat.com/ > > > If the traced process does clone(CLONE_VM), bpftrace will hit the > > > similar problem, with uprobe or uretprobe. > > > > > > - So I still think that the "right" fix should change the > > > bpf_prog_run_array_uprobe() paths somehow, but I know nothing > > > about bpf. > > > > I agree that this patch does not address the issue correctly. > > The PID filter should be implemented within bpf_prog_run_array_uprobe, > > or alternatively, bpf_prog_run_array_uprobe should be called after > > perf_tp_event_match to reuse the filtering mechanism provided by perf. > > > > Also, uretprobe may need UPROBE_HANDLER_REMOVE, similar to uprobe. > > > > For now, please forget the original patch as we need a new solution ;) > > hi, > any chance we could go with your fix until we find better solution? > > it's simple and it fixes most of the cases for return uprobe pid filter > for events with bpf programs.. I know during the discussion we found > that standard perf record path won't work if there's bpf program > attached on the same event, but I think that likely it needs more > changes and also it's not a common use case > > would you consider sending another version addressing Oleg's points > for changelog above? My pleasure, I'll resend the updated patch in a new thread. Based on previous discussions, `uprobe_perf_filter` acts as a preliminary filter that removes breakpoints when they are no longer needed. More complex filtering mechanisms related to perf are implemented in perf-specific paths. From my understanding, the original patch attempted to partially implement UPROBE_HANDLER_REMOVE (since it didn't actually remove the breakpoint but only prevented it from entering the BPF-related code). I'm trying to provide a complete implementation, i.e., removing the breakpoint when `uprobe_perf_filter` returns false, similar to how uprobe functions. However, this would require merging the following functions, because they will almost be the same: uprobe_perf_func / uretprobe_perf_func uprobe_dispatcher / uretprobe_dispatcher handler_chain / handle_uretprobe_chain I'm not sure why the paths of uprobe and uretprobe are entirely different. I suspect that uretprobe might have been implemented later than uprobe and was only partially implemented. Oleg, do you have more insights on this? In your opinion, does uretprobe need UPROBE_HANDLER_REMOVE? If so, is it a good idea to merge the paths of uprobe and uretprobe? Regardless, if we only need a temporary and incomplete fix, I will modify only the commit message according to Oleg's suggestions and resend it. I'm aware that using `uprobe_perf_filter` in `uretprobe_perf_func` is not the solution for BPF filtering. I'm just trying to alleviate the issue in some simple cases. Sorry for the late reply as I spent a long time looking at the details discussed above. Thanks,
On 09/08, Tianyi Liu wrote: > > On Mon, Sep 06, 2024 at 18:43:00AM +0800, Jiri Olsa wrote: > > > would you consider sending another version addressing Oleg's points > > for changelog above? > > My pleasure, I'll resend the updated patch in a new thread. > > Based on previous discussions, `uprobe_perf_filter` acts as a preliminary > filter that removes breakpoints when they are no longer needed. Well. Not only. See the usage of consumer_filter() and filter_chain() in register_for_each_vma(). > More complex filtering mechanisms related to perf are implemented in > perf-specific paths. The perf paths in __uprobe_perf_func() do the filtering based on perf_event->hw.target, that is all. But uprobe_perf_filter() or any other consumer->filter() simply can't rely on pid/task, it has to check ->mm. > From my understanding, the original patch attempted to partially implement > UPROBE_HANDLER_REMOVE (since it didn't actually remove the breakpoint but > only prevented it from entering the BPF-related code). Confused... Your patch can help bpftrace although it (or any other change in trace_uprobe.c) can't not actually fix all the problems with bpf/filtering even if we forget about ret-probes. And I don't understand how this relates to UPROBE_HANDLER_REMOVE... > I'm trying to provide a complete implementation, i.e., removing the > breakpoint when `uprobe_perf_filter` returns false, similar to how uprobe > functions. However, this would require merging the following functions, > because they will almost be the same: > > uprobe_perf_func / uretprobe_perf_func > uprobe_dispatcher / uretprobe_dispatcher > handler_chain / handle_uretprobe_chain Sorry, I don't understand... Yes, uprobe_dispatcher and uretprobe_dispatcher can share more code or even unified, but > I suspect that uretprobe might have been implemented later than uprobe Yes, > and was only partially implemented. what do you mean? But whatever you meant, I agree that this code doesn't look pretty and can be cleanuped. > In your opinion, does uretprobe need UPROBE_HANDLER_REMOVE? Probably. But this has absolutely nothing to do with the filtering problem? Can we discuss this separately? > I'm aware that using `uprobe_perf_filter` in `uretprobe_perf_func` is not > the solution for BPF filtering. I'm just trying to alleviate the issue > in some simple cases. Agreed. ------------------------------------------------------------------------------- To summarise. This code is very old, and it was written for /usr/bin/perf which attaches to the tracepoint. So multiple instances of perf-record will share the same consumer/trace_event_call/filter. uretprobe_perf_func() doesn't call uprobe_perf_filter() because (if /usr/bin/perf is the only user) in the likely case it would burn CPU and return true. Quite possibly this design was not optimal from the very beginning, I simply can't recall why the is_ret_probe() consumer has ->handler != NULL, but it was not buggy. Now we have bpf, create_local_trace_uprobe(), etc. So lets add another uprobe_perf_filter() into uretprobe_perf_func() as your patch did. Then we can probably change uprobe_handle_trampoline() to do unapply_uprobe() if all the ret-handlers return UPROBE_HANDLER_REMOVE, like handler_chain() does. Then we can probably cleanup/simplify trace_uprobe.c, in partucular we can change alloc_trace_uprobe() - tu->consumer.handler = uprobe_dispatcher; - if (is_ret) - tu->consumer.ret_handler = uretprobe_dispatcher; + if (is_ret) + tu->consumer.ret_handler = uretprobe_dispatcher; + else + tu->consumer.handler = uprobe_dispatcher; and do more (including unrelated) cleanups. But lets do this step-by-step. And lets not mix the filtering issues with the UPROBE_HANDLER_REMOVE logic, to me this adds the unnecessary confusion. Oleg.
On Sun, Sep 8, 2024 at 6:15 AM Oleg Nesterov <oleg@redhat.com> wrote: > > On 09/08, Tianyi Liu wrote: > > > > On Mon, Sep 06, 2024 at 18:43:00AM +0800, Jiri Olsa wrote: > > > > > would you consider sending another version addressing Oleg's points > > > for changelog above? > > > > My pleasure, I'll resend the updated patch in a new thread. > > > > Based on previous discussions, `uprobe_perf_filter` acts as a preliminary > > filter that removes breakpoints when they are no longer needed. > > Well. Not only. See the usage of consumer_filter() and filter_chain() in > register_for_each_vma(). > > > More complex filtering mechanisms related to perf are implemented in > > perf-specific paths. > > The perf paths in __uprobe_perf_func() do the filtering based on > perf_event->hw.target, that is all. > > But uprobe_perf_filter() or any other consumer->filter() simply can't rely > on pid/task, it has to check ->mm. > > > From my understanding, the original patch attempted to partially implement > > UPROBE_HANDLER_REMOVE (since it didn't actually remove the breakpoint but > > only prevented it from entering the BPF-related code). > > Confused... > > Your patch can help bpftrace although it (or any other change in > trace_uprobe.c) can't not actually fix all the problems with bpf/filtering > even if we forget about ret-probes. > > And I don't understand how this relates to UPROBE_HANDLER_REMOVE... > > > I'm trying to provide a complete implementation, i.e., removing the > > breakpoint when `uprobe_perf_filter` returns false, similar to how uprobe > > functions. However, this would require merging the following functions, > > because they will almost be the same: > > > > uprobe_perf_func / uretprobe_perf_func > > uprobe_dispatcher / uretprobe_dispatcher > > handler_chain / handle_uretprobe_chain > > Sorry, I don't understand... Yes, uprobe_dispatcher and uretprobe_dispatcher > can share more code or even unified, but > > > I suspect that uretprobe might have been implemented later than uprobe > > Yes, > > > and was only partially implemented. > > what do you mean? > > But whatever you meant, I agree that this code doesn't look pretty and can > be cleanuped. > > > In your opinion, does uretprobe need UPROBE_HANDLER_REMOVE? > > Probably. But this has absolutely nothing to do with the filtering problem? > Can we discuss this separately? > > > I'm aware that using `uprobe_perf_filter` in `uretprobe_perf_func` is not > > the solution for BPF filtering. I'm just trying to alleviate the issue > > in some simple cases. > > Agreed. > > ------------------------------------------------------------------------------- > To summarise. > > This code is very old, and it was written for /usr/bin/perf which attaches > to the tracepoint. So multiple instances of perf-record will share the same > consumer/trace_event_call/filter. uretprobe_perf_func() doesn't call > uprobe_perf_filter() because (if /usr/bin/perf is the only user) in the likely > case it would burn CPU and return true. Quite possibly this design was not > optimal from the very beginning, I simply can't recall why the is_ret_probe() > consumer has ->handler != NULL, but it was not buggy. > > Now we have bpf, create_local_trace_uprobe(), etc. So lets add another > uprobe_perf_filter() into uretprobe_perf_func() as your patch did. > > Then we can probably change uprobe_handle_trampoline() to do > unapply_uprobe() if all the ret-handlers return UPROBE_HANDLER_REMOVE, like > handler_chain() does. > > Then we can probably cleanup/simplify trace_uprobe.c, in partucular we can > change alloc_trace_uprobe() > > - tu->consumer.handler = uprobe_dispatcher; > - if (is_ret) > - tu->consumer.ret_handler = uretprobe_dispatcher; > + if (is_ret) > + tu->consumer.ret_handler = uretprobe_dispatcher; > + else > + tu->consumer.handler = uprobe_dispatcher; > > and do more (including unrelated) cleanups. > > But lets do this step-by-step. > > And lets not mix the filtering issues with the UPROBE_HANDLER_REMOVE logic, > to me this adds the unnecessary confusion. +100 to all of the above. This has been a very long and winding thread, but uretprobes are still being triggered unnecessarily :) Let's fix the bleeding and talk and work on improving all of the other issues in separate efforts. > > Oleg. >
On Fri, Sep 06, 2024 at 09:18:15PM +0200, Oleg Nesterov wrote: > On 09/06, Jiri Olsa wrote: > > > > On Mon, Sep 02, 2024 at 03:22:25AM +0800, Tianyi Liu wrote: > > > > > > For now, please forget the original patch as we need a new solution ;) > > > > hi, > > any chance we could go with your fix until we find better solution? > > Well, as I said from the very beginning I won't really argue even if > I obviously don't like this change very much. As long as the changelog / > comments clearly explain this change. I understand that sometimes an > ugly/incomplete/whatever workaround is better than nothing. > > > it's simple and it fixes most of the cases for return uprobe pid filter > > for events with bpf programs.. > > But to remind it doesn't even fixes all the filtering problems with uprobes, > not uretprobes, > > > I know during the discussion we found > > that standard perf record path won't work if there's bpf program > > attached on the same event, > > Ah. Yes, this is another problem I tried to point out. But if we discuss > the filtering we can forget about /usr/bin/perf. > > Again, again, again, I know nothing about bpf. But it seems to me that > perf_event_attach_bpf_prog() allows to attach up to BPF_TRACE_MAX_PROGS > progs to event->tp_event->prog_array, and then bpf_prog_run_array_uprobe() > should run them all. Right? > > So I think that if you run 2 instances of run_prog from my last test-case > with $PID1 and $PID2, the filtering will be broken again. Both instances > will share the same trace_event_call and the same trace_uprobe_filter. > > > and also it's not a common use case > > OK. > > And btw... Can bpftrace attach to the uprobe tp? > > # perf probe -x ./test -a func > Added new event: > probe_test:func (on func in /root/TTT/test) > > You can now use it in all perf tools, such as: > > perf record -e probe_test:func -aR sleep 1 > > # bpftrace -e 'tracepoint:probe_test:func { printf("%d\n", pid); }' > Attaching 1 probe... > ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument > ERROR: Error attaching probe: tracepoint:probe_test:func the problem here is that bpftrace assumes BPF_PROG_TYPE_TRACEPOINT type for bpf program, but that will fail in perf_event_set_bpf_prog where perf event will be identified as uprobe and demands bpf program type to be BPF_PROG_TYPE_KPROBE there's same issue with kprobe as well: # perf probe -a ksys_read Added new event: probe:ksys_read (on ksys_read) You can now use it in all perf tools, such as: perf record -e probe:ksys_read -aR sleep 1 # bpftrace -e 'tracepoint:probe:ksys_read { printf("%d\n", pid); }' Attaching 1 probe... ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument ERROR: Error attaching probe: tracepoint:probe:ksys_read I'm not sure there's an easy way to filter these events from bpftrce -l output (or change the program type accordingly), because I don't think there's a way to find out the tracepoint subtype (kprobe/uprobe) from the tracefs record jirka
On 09/09, Jiri Olsa wrote: > > On Fri, Sep 06, 2024 at 09:18:15PM +0200, Oleg Nesterov wrote: > > > > And btw... Can bpftrace attach to the uprobe tp? > > > > # perf probe -x ./test -a func > > Added new event: > > probe_test:func (on func in /root/TTT/test) > > > > You can now use it in all perf tools, such as: > > > > perf record -e probe_test:func -aR sleep 1 > > > > # bpftrace -e 'tracepoint:probe_test:func { printf("%d\n", pid); }' > > Attaching 1 probe... > > ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument > > ERROR: Error attaching probe: tracepoint:probe_test:func > > the problem here is that bpftrace assumes BPF_PROG_TYPE_TRACEPOINT type > for bpf program, but that will fail in perf_event_set_bpf_prog where > perf event will be identified as uprobe and demands bpf program type > to be BPF_PROG_TYPE_KPROBE Yes, thanks, I know, > I don't think > there's a way to find out the tracepoint subtype (kprobe/uprobe) from > the tracefs record Hmm, indeed. it seems that it is not possible to derive tp_event->flags from tracefs... Perhaps bpftrace could look for probe_test:func in [uk]probe_events? Or simply retry ioctl(PERF_EVENT_IOC_SET_BPF) with BPF_PROG_TYPE_KPROBE if BPF_PROG_TYPE_TRACEPOINT returns EINVAL? Ugly, yes. Oleg.
On Mon, Sep 09, 2024 at 08:34:36PM +0200, Oleg Nesterov wrote: > On 09/09, Jiri Olsa wrote: > > > > On Fri, Sep 06, 2024 at 09:18:15PM +0200, Oleg Nesterov wrote: > > > > > > And btw... Can bpftrace attach to the uprobe tp? > > > > > > # perf probe -x ./test -a func > > > Added new event: > > > probe_test:func (on func in /root/TTT/test) > > > > > > You can now use it in all perf tools, such as: > > > > > > perf record -e probe_test:func -aR sleep 1 > > > > > > # bpftrace -e 'tracepoint:probe_test:func { printf("%d\n", pid); }' > > > Attaching 1 probe... > > > ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument > > > ERROR: Error attaching probe: tracepoint:probe_test:func > > > > the problem here is that bpftrace assumes BPF_PROG_TYPE_TRACEPOINT type > > for bpf program, but that will fail in perf_event_set_bpf_prog where > > perf event will be identified as uprobe and demands bpf program type > > to be BPF_PROG_TYPE_KPROBE > > Yes, thanks, I know, > > > I don't think > > there's a way to find out the tracepoint subtype (kprobe/uprobe) from > > the tracefs record > > Hmm, indeed. it seems that it is not possible to derive tp_event->flags > from tracefs... > > Perhaps bpftrace could look for probe_test:func in [uk]probe_events? > Or simply retry ioctl(PERF_EVENT_IOC_SET_BPF) with BPF_PROG_TYPE_KPROBE > if BPF_PROG_TYPE_TRACEPOINT returns EINVAL? Ugly, yes. yep, but will probably work, I created issue https://github.com/bpftrace/bpftrace/issues/3447 so it's not lost jirka
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index c98e3b3386ba..c7e2a0962928 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1443,6 +1443,9 @@ static void uretprobe_perf_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, struct uprobe_cpu_buffer **ucbp) { + if (!uprobe_perf_filter(&tu->consumer, 0, current->mm)) + return; + __uprobe_perf_func(tu, func, regs, ucbp); }